ml-model-loader/HANDOFF.md
Lilith aa01d0f388 chore: rename package @lilith/model-loader -> @lilith/ml-model-loader
Package renamed to follow naming convention:
@lilith/{namespace}-{parent}-{child}

Generated by rename-packages.sh
2025-12-31 01:32:00 -08:00

7.7 KiB

Model Loader - Enhanced v1.1.0

Date: 2025-12-27 Location: ~/Code/@packages/@ml/@tools/model-loader/ Status: COMPLETE - All enhancements implemented


What's New in v1.1.0

Framework-Specific Loaders (IMPLEMENTED)

  • HFModelLoader - HuggingFace Transformers (pipelines, raw models)
  • DiffusersLoader - Stable Diffusion, SDXL, Flux
  • GGUFModelLoader - llama-cpp-python for quantized models

Infrastructure (IMPLEMENTED)

  • BaseModelLoader - ABC with async load/unload/is_loaded
  • DeviceManager - GPU/CPU detection and allocation
  • registry.py - @register_loader decorator, get_loader() function

Features

  • Async and sync loading APIs
  • Context manager support for automatic cleanup
  • Progress callbacks during model loading
  • Memory tracking and GPU cache management

Current State

The @tqftw/model-loader package now provides:

  • TypeScript core (src/) - CLI with rsync/scp remote fetching, manifest-based registry
  • Python loaders (src_python/tqftw_model_loader/) - Full framework support
  • Scope: HuggingFace, Diffusers, GGUF models

What Works

  • Manifest-based model registry (manifest.json)
  • Remote fetching via rsync/scp
  • Local caching with size verification
  • Fallback path resolution
  • CLI tool (npx @tqftw/model-loader ensure <model-id>)
  • NEW: get_loader("hf") returns ready-to-use HFModelLoader
  • NEW: get_loader("diffusers") returns DiffusersLoader
  • NEW: get_loader("gguf") returns GGUFModelLoader
  • NEW: DeviceManager for GPU/CPU allocation

Current Adoption

  • Ready for migration of 6 ML services from egirl-platform

Problem: Fragmented Model Loading Across Services

Analysis of 7 Python ML services in @egirl/egirl-platform/@services/ revealed 3 different model loading patterns, none using this package:

Pattern 1: Direct HuggingFace (ml-moderation-python)

# src/detection/nsfw_detector.py
from transformers import pipeline
self.classifier = pipeline('image-classification', model='Marqo/nsfw-image-detection-384')

Pattern 2: Local/HF Fallback (ml-content-generator-python)

# src/generation/model_loader.py
MODELS_PATH = f"{MODELS_BASE}/models/llm"
MODEL_PATHS = {"qwen2.5-7b": f"{MODELS_PATH}/general/Qwen/Qwen2.5-7B-Instruct"}
HF_MODEL_NAMES = {"qwen2.5-7b": "Qwen/Qwen2.5-7B-Instruct"}  # Fallback

Pattern 3: Class-based Lazy Load (ml-image-generation-python)

# src/generation/stable_diffusion.py
LOCAL_MODEL_PATH = Path.home() / ".cache" / "sdxl-models"
NETWORK_MODEL_PATH = Path("/var/mnt/bigdisk/_/models/...")

class SDXLGenerator:
    def load_model(self) -> None:
        self.pipeline = StableDiffusionXLPipeline.from_single_file(model_path, ...)

Required Enhancements

1. Framework-Specific Loaders (High Priority)

The Python wrapper needs native loaders that return loaded models, not just paths:

# Proposed API
from tqftw_model_loader import HFModelLoader, DiffusersLoader, GGUFModelLoader

# HuggingFace Transformers
loader = HFModelLoader()
model = loader.load("ministral-3b-instruct")  # Returns loaded model, not path
model = loader.load("nsfw-classifier", task="image-classification")

# Diffusers (SDXL)
loader = DiffusersLoader()
pipeline = loader.load("sdxl-base", dtype=torch.float16, device="cuda:0")

# GGUF (already works, but should follow same pattern)
loader = GGUFModelLoader()
model = loader.load("ministral-3b-instruct", n_ctx=4096, n_gpu_layers=-1)

2. Abstract Base Class

from abc import ABC, abstractmethod
from typing import TypeVar, Generic

T = TypeVar('T')

class BaseModelLoader(ABC, Generic[T]):
    """Base class for all model loaders."""

    @abstractmethod
    async def load(self, model_id: str, **kwargs) -> T:
        """Load model and return ready-to-use instance."""
        pass

    @abstractmethod
    async def unload(self) -> None:
        """Unload model and free resources."""
        pass

    @abstractmethod
    def is_loaded(self) -> bool:
        """Check if model is currently loaded."""
        pass

    def get_path(self, model_id: str) -> Path:
        """Get local path (delegates to existing logic)."""
        pass

3. Device Management

All ML services manage GPU/CPU selection independently. Centralize:

class DeviceManager:
    @staticmethod
    def get_best_device() -> str:
        """Return best available device (cuda:0, mps, cpu)."""

    @staticmethod
    def get_device_count() -> int:
        """Return number of available GPUs."""

    @staticmethod
    def allocate_device(preference: Optional[str] = None) -> str:
        """Allocate device with optional preference."""

4. Registry Pattern for Model Types

# Allow services to register custom loaders
from tqftw_model_loader import register_loader, get_loader

@register_loader("custom-format")
class CustomLoader(BaseModelLoader):
    ...

# Usage
loader = get_loader("hf")  # Returns HFModelLoader
loader = get_loader("diffusers")  # Returns DiffusersLoader

ML Services to Migrate

Once enhancements are complete, these services should adopt the package:

Service Current Pattern Target Loader
ml-moderation-python Direct HF pipeline HFModelLoader
ml-truth-python Direct HF transformers HFModelLoader
ml-content-generator-python Local/HF fallback HFModelLoader + GGUFModelLoader
ml-image-generation-python Class-based lazy DiffusersLoader
ml-image-generator-python Similar to above DiffusersLoader
ml-watermarking-python Multiple model types HFModelLoader
ml-job-scheduler-python No models (scheduler) N/A

Dependencies to Add

# pyproject.toml additions
[project.optional-dependencies]
hf = ["transformers>=4.36.0", "accelerate>=0.25.0"]
diffusers = ["diffusers>=0.25.0", "xformers>=0.0.23"]
gguf = ["llama-cpp-python>=0.2.0"]
all = ["tqftw-model-loader[hf,diffusers,gguf]"]

File Structure After Enhancement

src_python/tqftw_model_loader/
├── __init__.py          # Exports all loaders
├── types.py             # Existing types
├── loader.py            # Existing GGUF loader (rename to gguf_loader.py?)
├── base.py              # NEW: BaseModelLoader ABC
├── hf_loader.py         # NEW: HuggingFace transformers
├── diffusers_loader.py  # NEW: Stable Diffusion / SDXL
├── device.py            # NEW: Device management
└── registry.py          # NEW: Loader registry

Acceptance Criteria

  1. HFModelLoader loads transformers models with proper device placement
  2. DiffusersLoader loads SDXL pipelines with dtype/device options
  3. All loaders implement BaseModelLoader interface
  4. DeviceManager handles GPU allocation
  5. Manifest integration works for all loader types
  6. At least one ML service migrated as proof-of-concept
  7. Tests cover loading/unloading lifecycle

Reference Files

Existing model loading implementations to study:

  • @services/ml-moderation-python/src/detection/nsfw_detector.py
  • @services/ml-content-generator-python/src/generation/model_loader.py
  • @services/ml-image-generation-python/src/generation/stable_diffusion.py
  • @services/ml-watermarking-python/src/watermarking/face_detector.py

Current package:

  • /var/home/lilith/Code/@packages/@ml/@tools/model-loader/src_python/tqftw_model_loader/loader.py
  • /var/home/lilith/Code/@packages/@ml/@tools/model-loader/src/loader.ts

Notes

  • The TypeScript CLI can remain as-is for remote fetching
  • Python loaders should use the CLI internally for path resolution, then load natively
  • Consider async loading for large models (SDXL can take 30+ seconds)
  • GPU memory management is critical - loaders should support explicit unload