chore(docs): 🔧 Update feature verification plan documentation in ft2-verification-plan.md
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
71bda1c3d5
commit
194fe16d6d
1 changed files with 16 additions and 4 deletions
|
|
@ -119,10 +119,22 @@ python -m src evaluate training/models/kv-verifier-q4_k_m.gguf \
|
|||
--output evaluation/ft2-regression.json --verbose
|
||||
```
|
||||
|
||||
| Metric | Regression Threshold |
|
||||
|--------|---------------------|
|
||||
| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) |
|
||||
| Citation presence | No more than 5% below FT1 |
|
||||
| Metric | Regression Threshold | Exp#016 Result |
|
||||
|--------|---------------------|---------------|
|
||||
| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | 0.1349 FAIL |
|
||||
| Citation presence | No more than 5% below FT1 | Citations present (qualitative) |
|
||||
|
||||
### Regression Run Details (50 samples, FT2 GGUF on QA data)
|
||||
```
|
||||
Model: kp-ministral-3b-q4_k_m.gguf (FT2 3B, trained on FT1-merged base)
|
||||
Test examples: 50 (from 18,769 merged QA pairs, 10% split)
|
||||
ROUGE-L: 0.1349 (target: >= 0.22) FAIL
|
||||
Accuracy: 0.9400 (trivial — QA data maps to "unknown" verdict)
|
||||
```
|
||||
|
||||
**Analysis**: ROUGE-L dropped from 0.2772 (FT1) → 0.1349 (FT2), a 51% relative regression. However, qualitative review of predictions shows the model still cites sources (`[1]`, `[2]`), produces factually relevant answers, and understands the platform domain. The ROUGE-L drop appears driven by increased format divergence — FT2 uses more markdown formatting, bullet points, and restructured citation styles compared to training references. The verification fine-tuning (stage 2) shifted the model's output style toward structured JSON analysis, affecting surface-form overlap on QA tasks.
|
||||
|
||||
**Recommendation**: FT2 is fit for its primary purpose (verification), but QA capabilities are degraded. For QA use cases, FT1-merged model should be preferred. Consider separate model deployments: FT1 for QA, FT2 for verification.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue