No description
|
|
||
|---|---|---|
| .forgejo/workflows | ||
| dist | ||
| src/ml_quality_scorer | ||
| tests | ||
| pyproject.toml | ||
| README.md | ||
ML Quality Scorer
Score response quality and rank candidate responses using multi-dimensional metrics.
Installation
pip install lilith-ml-quality-scorer
With optional LLM-based scoring:
pip install lilith-ml-quality-scorer[model-router]
Quick Start
from ml_quality_scorer import QualityScorer, Message
scorer = QualityScorer()
# Score multiple candidates
scores = await scorer.score_candidates(
candidates=["Yes!", "I think so", "Let me check"],
context=[Message(role="user", content="Can you help me?")],
intent="request"
)
for scored in scores:
print(f"{scored.text}: {scored.overall:.2f} (flags: {scored.flags})")
Output:
Yes!: 0.85 (flags: [])
I think so: 0.72 (flags: ['low_confidence'])
Let me check: 0.68 (flags: [])
Scoring Components
The scorer evaluates responses across three dimensions:
| Metric | Weight | Description |
|---|---|---|
| Confidence | 40% | How certain/direct the response is |
| Relevance | 30% | How well it matches context and intent |
| Diversity | 30% | How unique it is among candidates |
Formula: overall = 0.4*confidence + 0.3*relevance + 0.3*diversity
Confidence Scoring
Based on linguistic certainty indicators:
- Shorter, direct responses = higher confidence
- Hedging language ("maybe", "I think") = lower confidence
- Assertive language ("definitely", "yes") = higher confidence
- Optional: Use model perplexity for more accurate scoring
Relevance Scoring
Measures context and intent alignment:
- Keyword overlap with recent messages
- Question-answer alignment
- Intent-appropriate response length
Diversity Scoring
Evaluates uniqueness among candidates:
- Text similarity comparison
- Generic response detection
- Repetition penalties
API Reference
QualityScorer
from ml_quality_scorer import QualityScorer, ScoringConfig
# Default configuration
scorer = QualityScorer()
# Custom configuration
config = ScoringConfig(
confidence_weight=0.5,
relevance_weight=0.3,
diversity_weight=0.2,
min_response_length=1,
max_response_length=1000,
)
scorer = QualityScorer(config=config)
score_candidates
Score multiple responses:
scores = await scorer.score_candidates(
candidates=["Response 1", "Response 2"],
context=[Message(role="user", content="Hello")],
intent="greeting",
perplexities=[2.5, 4.1], # Optional
)
rank_candidates
Score with detailed breakdown:
ranked = await scorer.rank_candidates(
candidates=["Yes!", "Maybe"],
context=messages,
)
for r in ranked:
print(f"Rank {r.rank}: {r.text}")
for name, score in r.score_breakdown.items():
print(f" {name}: {score.value:.2f} ({score.reason})")
score_single
Score a single response:
result = await scorer.score_single(
text="Sure, I can help!",
context=messages,
intent="request",
)
get_best_candidate
Get the best candidate, optionally excluding flagged responses:
best = scorer.get_best_candidate(
scored_candidates=scores,
exclude_flags=["generic", "too_short"],
)
filter_candidates
Filter by score thresholds:
filtered = scorer.filter_candidates(
scored_candidates=scores,
min_overall=0.7,
min_confidence=0.5,
exclude_flags=["generic"],
)
Types
ScoredCandidate
@dataclass
class ScoredCandidate:
text: str # Response text
confidence: float # 0.0-1.0
relevance: float # 0.0-1.0
diversity: float # 0.0-1.0
overall: float # Weighted score
flags: list[str] # Quality flags
metadata: dict # Additional data
Quality Flags
| Flag | Condition |
|---|---|
generic |
Response matches common filler phrases |
too_short |
Below minimum length |
too_long |
Above maximum length |
low_confidence |
Confidence < 0.3 |
low_relevance |
Relevance < 0.3 |
repetitive |
Diversity < 0.3 |
IntentType
class IntentType(Enum):
GREETING = "greeting"
QUESTION = "question"
STATEMENT = "statement"
REQUEST = "request"
EMOTIONAL = "emotional"
TRANSACTIONAL = "transactional"
INFORMATIONAL = "informational"
UNKNOWN = "unknown"
Configuration
ScoringConfig
@dataclass
class ScoringConfig:
confidence_weight: float = 0.4
relevance_weight: float = 0.3
diversity_weight: float = 0.3
min_response_length: int = 1
max_response_length: int = 5000
generic_responses: frozenset[str] = ... # Set of generic phrases
use_llm_scoring: bool = False
Weights must sum to 1.0.
Integration with ml-model-router
For LLM-based semantic scoring:
from ml_model_router import ModelRouter
from ml_quality_scorer import QualityScorer, ScoringConfig
router = ModelRouter()
config = ScoringConfig(use_llm_scoring=True)
scorer = QualityScorer(config=config, model_router=router)
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Type checking
mypy src/
# Linting
ruff check src/ tests/
License
MIT