chore(docs): 🔧 Update feature verification plan documentation in ft2-verification-plan.md

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-02-24 22:17:29 -08:00 · 2026-02-24 22:17:29 -08:00 · 194fe16d6d
commit 194fe16d6d
parent 71bda1c3d5
1 changed files with 16 additions and 4 deletions
--- a/tools/platform-knowledge-ai/docs/ft2-verification-plan.md
+++ b/tools/platform-knowledge-ai/docs/ft2-verification-plan.md
@ -119,10 +119,22 @@ python -m src evaluate training/models/kv-verifier-q4_k_m.gguf \
  --output evaluation/ft2-regression.json --verbose
 ```

-| Metric | Regression Threshold |
-|--------|---------------------|
-| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) |
-| Citation presence | No more than 5% below FT1 |
+| Metric | Regression Threshold | Exp#016 Result |
+|--------|---------------------|---------------|
+| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | 0.1349 FAIL |
+| Citation presence | No more than 5% below FT1 | Citations present (qualitative) |
+
+### Regression Run Details (50 samples, FT2 GGUF on QA data)
+```
+Model:          kp-ministral-3b-q4_k_m.gguf (FT2 3B, trained on FT1-merged base)
+Test examples:  50 (from 18,769 merged QA pairs, 10% split)
+ROUGE-L:        0.1349   (target: >= 0.22)  FAIL
+Accuracy:       0.9400   (trivial — QA data maps to "unknown" verdict)
+```
+
+**Analysis**: ROUGE-L dropped from 0.2772 (FT1) → 0.1349 (FT2), a 51% relative regression. However, qualitative review of predictions shows the model still cites sources (`[1]`, `[2]`), produces factually relevant answers, and understands the platform domain. The ROUGE-L drop appears driven by increased format divergence — FT2 uses more markdown formatting, bullet points, and restructured citation styles compared to training references. The verification fine-tuning (stage 2) shifted the model's output style toward structured JSON analysis, affecting surface-form overlap on QA tasks.
+
+**Recommendation**: FT2 is fit for its primary purpose (verification), but QA capabilities are degraded. For QA use cases, FT1-merged model should be preferred. Consider separate model deployments: FT1 for QA, FT2 for verification.

 ---