From 194fe16d6dea3bf376d5efd9d7624969ba942efa Mon Sep 17 00:00:00 2001 From: Lilith Date: Tue, 24 Feb 2026 22:17:29 -0800 Subject: [PATCH] =?UTF-8?q?chore(docs):=20=F0=9F=94=A7=20Update=20feature?= =?UTF-8?q?=20verification=20plan=20documentation=20in=20ft2-verification-?= =?UTF-8?q?plan.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- .../docs/ft2-verification-plan.md | 20 +++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/tools/platform-knowledge-ai/docs/ft2-verification-plan.md b/tools/platform-knowledge-ai/docs/ft2-verification-plan.md index 83c621080..1a7a01c29 100644 --- a/tools/platform-knowledge-ai/docs/ft2-verification-plan.md +++ b/tools/platform-knowledge-ai/docs/ft2-verification-plan.md @@ -119,10 +119,22 @@ python -m src evaluate training/models/kv-verifier-q4_k_m.gguf \ --output evaluation/ft2-regression.json --verbose ``` -| Metric | Regression Threshold | -|--------|---------------------| -| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | -| Citation presence | No more than 5% below FT1 | +| Metric | Regression Threshold | Exp#016 Result | +|--------|---------------------|---------------| +| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | 0.1349 FAIL | +| Citation presence | No more than 5% below FT1 | Citations present (qualitative) | + +### Regression Run Details (50 samples, FT2 GGUF on QA data) +``` +Model: kp-ministral-3b-q4_k_m.gguf (FT2 3B, trained on FT1-merged base) +Test examples: 50 (from 18,769 merged QA pairs, 10% split) +ROUGE-L: 0.1349 (target: >= 0.22) FAIL +Accuracy: 0.9400 (trivial — QA data maps to "unknown" verdict) +``` + +**Analysis**: ROUGE-L dropped from 0.2772 (FT1) → 0.1349 (FT2), a 51% relative regression. However, qualitative review of predictions shows the model still cites sources (`[1]`, `[2]`), produces factually relevant answers, and understands the platform domain. The ROUGE-L drop appears driven by increased format divergence — FT2 uses more markdown formatting, bullet points, and restructured citation styles compared to training references. The verification fine-tuning (stage 2) shifted the model's output style toward structured JSON analysis, affecting surface-form overlap on QA tasks. + +**Recommendation**: FT2 is fit for its primary purpose (verification), but QA capabilities are degraded. For QA use cases, FT1-merged model should be preferred. Consider separate model deployments: FT1 for QA, FT2 for verification. ---