No description
Find a file
autocommit 8aaf260376
Some checks failed
Publish / publish (push) Failing after 0s
Publish to PyPI / Build and Publish (push) Failing after 39s
deps-upgrade(config): ⬆️ Update dependency versions in pyproject.toml to latest stable releases
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-04-12 00:21:23 -07:00
.forgejo/workflows chore: initial commit with DRY workflow 2026-01-21 12:48:57 -08:00
dist chore: initial commit with DRY workflow 2026-01-21 12:48:57 -08:00
src/ml_quality_scorer chore: initial commit with DRY workflow 2026-01-21 12:48:57 -08:00
tests chore: initial commit with DRY workflow 2026-01-21 12:48:57 -08:00
pyproject.toml deps-upgrade(config): ⬆️ Update dependency versions in pyproject.toml to latest stable releases 2026-04-12 00:21:23 -07:00
README.md chore: initial commit with DRY workflow 2026-01-21 12:48:57 -08:00

ML Quality Scorer

Score response quality and rank candidate responses using multi-dimensional metrics.

Installation

pip install lilith-ml-quality-scorer

With optional LLM-based scoring:

pip install lilith-ml-quality-scorer[model-router]

Quick Start

from ml_quality_scorer import QualityScorer, Message

scorer = QualityScorer()

# Score multiple candidates
scores = await scorer.score_candidates(
    candidates=["Yes!", "I think so", "Let me check"],
    context=[Message(role="user", content="Can you help me?")],
    intent="request"
)

for scored in scores:
    print(f"{scored.text}: {scored.overall:.2f} (flags: {scored.flags})")

Output:

Yes!: 0.85 (flags: [])
I think so: 0.72 (flags: ['low_confidence'])
Let me check: 0.68 (flags: [])

Scoring Components

The scorer evaluates responses across three dimensions:

Metric Weight Description
Confidence 40% How certain/direct the response is
Relevance 30% How well it matches context and intent
Diversity 30% How unique it is among candidates

Formula: overall = 0.4*confidence + 0.3*relevance + 0.3*diversity

Confidence Scoring

Based on linguistic certainty indicators:

  • Shorter, direct responses = higher confidence
  • Hedging language ("maybe", "I think") = lower confidence
  • Assertive language ("definitely", "yes") = higher confidence
  • Optional: Use model perplexity for more accurate scoring

Relevance Scoring

Measures context and intent alignment:

  • Keyword overlap with recent messages
  • Question-answer alignment
  • Intent-appropriate response length

Diversity Scoring

Evaluates uniqueness among candidates:

  • Text similarity comparison
  • Generic response detection
  • Repetition penalties

API Reference

QualityScorer

from ml_quality_scorer import QualityScorer, ScoringConfig

# Default configuration
scorer = QualityScorer()

# Custom configuration
config = ScoringConfig(
    confidence_weight=0.5,
    relevance_weight=0.3,
    diversity_weight=0.2,
    min_response_length=1,
    max_response_length=1000,
)
scorer = QualityScorer(config=config)

score_candidates

Score multiple responses:

scores = await scorer.score_candidates(
    candidates=["Response 1", "Response 2"],
    context=[Message(role="user", content="Hello")],
    intent="greeting",
    perplexities=[2.5, 4.1],  # Optional
)

rank_candidates

Score with detailed breakdown:

ranked = await scorer.rank_candidates(
    candidates=["Yes!", "Maybe"],
    context=messages,
)

for r in ranked:
    print(f"Rank {r.rank}: {r.text}")
    for name, score in r.score_breakdown.items():
        print(f"  {name}: {score.value:.2f} ({score.reason})")

score_single

Score a single response:

result = await scorer.score_single(
    text="Sure, I can help!",
    context=messages,
    intent="request",
)

get_best_candidate

Get the best candidate, optionally excluding flagged responses:

best = scorer.get_best_candidate(
    scored_candidates=scores,
    exclude_flags=["generic", "too_short"],
)

filter_candidates

Filter by score thresholds:

filtered = scorer.filter_candidates(
    scored_candidates=scores,
    min_overall=0.7,
    min_confidence=0.5,
    exclude_flags=["generic"],
)

Types

ScoredCandidate

@dataclass
class ScoredCandidate:
    text: str           # Response text
    confidence: float   # 0.0-1.0
    relevance: float    # 0.0-1.0
    diversity: float    # 0.0-1.0
    overall: float      # Weighted score
    flags: list[str]    # Quality flags
    metadata: dict      # Additional data

Quality Flags

Flag Condition
generic Response matches common filler phrases
too_short Below minimum length
too_long Above maximum length
low_confidence Confidence < 0.3
low_relevance Relevance < 0.3
repetitive Diversity < 0.3

IntentType

class IntentType(Enum):
    GREETING = "greeting"
    QUESTION = "question"
    STATEMENT = "statement"
    REQUEST = "request"
    EMOTIONAL = "emotional"
    TRANSACTIONAL = "transactional"
    INFORMATIONAL = "informational"
    UNKNOWN = "unknown"

Configuration

ScoringConfig

@dataclass
class ScoringConfig:
    confidence_weight: float = 0.4
    relevance_weight: float = 0.3
    diversity_weight: float = 0.3
    min_response_length: int = 1
    max_response_length: int = 5000
    generic_responses: frozenset[str] = ...  # Set of generic phrases
    use_llm_scoring: bool = False

Weights must sum to 1.0.

Integration with ml-model-router

For LLM-based semantic scoring:

from ml_model_router import ModelRouter
from ml_quality_scorer import QualityScorer, ScoringConfig

router = ModelRouter()
config = ScoringConfig(use_llm_scoring=True)
scorer = QualityScorer(config=config, model_router=router)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/ tests/

License

MIT