chore(docs): 🔧 Update feature verification plan documentation in ft2-verification-plan.md

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
Lilith 2026-02-24 22:17:29 -08:00
parent 71bda1c3d5
commit 194fe16d6d

View file

@ -119,10 +119,22 @@ python -m src evaluate training/models/kv-verifier-q4_k_m.gguf \
--output evaluation/ft2-regression.json --verbose
```
| Metric | Regression Threshold |
|--------|---------------------|
| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) |
| Citation presence | No more than 5% below FT1 |
| Metric | Regression Threshold | Exp#016 Result |
|--------|---------------------|---------------|
| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | 0.1349 FAIL |
| Citation presence | No more than 5% below FT1 | Citations present (qualitative) |
### Regression Run Details (50 samples, FT2 GGUF on QA data)
```
Model: kp-ministral-3b-q4_k_m.gguf (FT2 3B, trained on FT1-merged base)
Test examples: 50 (from 18,769 merged QA pairs, 10% split)
ROUGE-L: 0.1349 (target: >= 0.22) FAIL
Accuracy: 0.9400 (trivial — QA data maps to "unknown" verdict)
```
**Analysis**: ROUGE-L dropped from 0.2772 (FT1) → 0.1349 (FT2), a 51% relative regression. However, qualitative review of predictions shows the model still cites sources (`[1]`, `[2]`), produces factually relevant answers, and understands the platform domain. The ROUGE-L drop appears driven by increased format divergence — FT2 uses more markdown formatting, bullet points, and restructured citation styles compared to training references. The verification fine-tuning (stage 2) shifted the model's output style toward structured JSON analysis, affecting surface-form overlap on QA tasks.
**Recommendation**: FT2 is fit for its primary purpose (verification), but QA capabilities are degraded. For QA use cases, FT1-merged model should be preferred. Consider separate model deployments: FT1 for QA, FT2 for verification.
---