From 194fe16d6dea3bf376d5efd9d7624969ba942efa Mon Sep 17 00:00:00 2001
From: Lilith <lilith@apricot.voyager.nasty.sh>
Date: Tue, 24 Feb 2026 22:17:29 -0800
Subject: [PATCH] =?UTF-8?q?chore(docs):=20=F0=9F=94=A7=20Update=20feature?=
 =?UTF-8?q?=20verification=20plan=20documentation=20in=20ft2-verification-?=
 =?UTF-8?q?plan.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
---
 .../docs/ft2-verification-plan.md             | 20 +++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/tools/platform-knowledge-ai/docs/ft2-verification-plan.md b/tools/platform-knowledge-ai/docs/ft2-verification-plan.md
index 83c621080..1a7a01c29 100644
--- a/tools/platform-knowledge-ai/docs/ft2-verification-plan.md
+++ b/tools/platform-knowledge-ai/docs/ft2-verification-plan.md
@@ -119,10 +119,22 @@ python -m src evaluate training/models/kv-verifier-q4_k_m.gguf \
   --output evaluation/ft2-regression.json --verbose
 ```
 
-| Metric | Regression Threshold |
-|--------|---------------------|
-| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) |
-| Citation presence | No more than 5% below FT1 |
+| Metric | Regression Threshold | Exp#016 Result |
+|--------|---------------------|---------------|
+| ROUGE-L on QA data | No more than 0.05 below FT1 (0.2772) | 0.1349 FAIL |
+| Citation presence | No more than 5% below FT1 | Citations present (qualitative) |
+
+### Regression Run Details (50 samples, FT2 GGUF on QA data)
+```
+Model:          kp-ministral-3b-q4_k_m.gguf (FT2 3B, trained on FT1-merged base)
+Test examples:  50 (from 18,769 merged QA pairs, 10% split)
+ROUGE-L:        0.1349   (target: >= 0.22)  FAIL
+Accuracy:       0.9400   (trivial — QA data maps to "unknown" verdict)
+```
+
+**Analysis**: ROUGE-L dropped from 0.2772 (FT1) → 0.1349 (FT2), a 51% relative regression. However, qualitative review of predictions shows the model still cites sources (`[1]`, `[2]`), produces factually relevant answers, and understands the platform domain. The ROUGE-L drop appears driven by increased format divergence — FT2 uses more markdown formatting, bullet points, and restructured citation styles compared to training references. The verification fine-tuning (stage 2) shifted the model's output style toward structured JSON analysis, affecting surface-form overlap on QA tasks.
+
+**Recommendation**: FT2 is fit for its primary purpose (verification), but QA capabilities are degraded. For QA use cases, FT1-merged model should be preferred. Consider separate model deployments: FT1 for QA, FT2 for verification.
 
 ---