No description

Find a file

Claude Code e986d16d45 Some checks failed Publish to PyPI / Build and Publish (push) Failing after 46s Details deps-upgrade(deps): ⬆️ Update dependencies to latest compatible versions in pyproject.toml Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>		2026-03-25 03:51:41 -07:00
.forgejo/workflows	ci(pypi-publish): 👷 Update PyPI publishing workflow for development with new security/compliance steps	2026-03-17 19:00:17 -07:00
src/ml_trainer_lm	feat(ml-trainer): ✨ Implement ML language model trainer with dataset loading, model architecture, training loop, tests, and PyPI publishing pipeline	2026-03-17 17:32:47 -07:00
tests	feat(ml-trainer): ✨ Implement ML language model trainer with dataset loading, model architecture, training loop, tests, and PyPI publishing pipeline	2026-03-17 17:32:47 -07:00
.gitignore	chore(gitignore): 🔧 Update patterns in .gitignore to exclude build artifacts, secrets, and logs	2026-03-17 17:32:47 -07:00
pyproject.toml	deps-upgrade(deps): ⬆️ Update dependencies to latest compatible versions in pyproject.toml	2026-03-25 03:51:41 -07:00
README.md	feat(ml-trainer): ✨ Implement ML language model trainer with dataset loading, model architecture, training loop, tests, and PyPI publishing pipeline	2026-03-17 17:32:47 -07:00

README.md

ml-trainer-lm — Shared LoRA/QLoRA Fine-Tuning Utilities

Canonical library for LoRA and QLoRA fine-tuning of HuggingFace causal language models.

Version: 0.1.0 Status: Stable License: Proprietary

Features

Model Loading: Load HF causal LMs with 4-bit QLoRA or fp16 precision
LoRA Application: Apply PEFT LoRA adapters with automatic multimodal support (vision-language models)
Dataset Handling: Load JSONL files, format chat-style messages, batch tokenization
Training Loop: Unified HF Trainer wrapper with gradient checkpointing and memory optimization
DDP Support: Automatic distributed training setup via LOCAL_RANK environment variable

API

Model Loading

from ml_trainer_lm import load_model_for_training, apply_lora

# Load base model with 4-bit QLoRA (default)
model, tokenizer = load_model_for_training(config)

# Apply LoRA adapters
model = apply_lora(model, config)

print(model)
# PeftModelForCausalLM
#   (base_model): AutoModelForCausalLM
#   (lora_target_modules): ['q_proj', 'v_proj', ...]

Arguments (config object):

base_model (str): HF model ID (e.g., "meta-llama/Llama-2-7b")
quantize (bool): Apply 4-bit QLoRA (default: True)
local_rank (int): DDP rank for device assignment (default: -1)
target_modules (list[str]): LoRA target modules (default: ["q_proj", "v_proj"])
lora_r (int): LoRA rank (default: 8)
lora_alpha (int): LoRA scaling (default: 16)
lora_dropout (float): LoRA dropout (default: 0.05)

Dataset Utilities

from ml_trainer_lm import load_jsonl, format_chat_messages, tokenize_dataset

# Load JSONL file
examples = load_jsonl(Path("data/train.jsonl"))

# Format messages (supports tokenizer.apply_chat_template or manual ChatML)
texts = format_chat_messages(examples, tokenizer)

# Tokenize dataset
dataset = tokenize_dataset(texts, tokenizer, max_length=2048)

print(dataset.keys())
# dict_keys(['input_ids', 'attention_mask', 'labels'])

Training Loop

from ml_trainer_lm import run_training

adapters_dir = run_training(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    config=config,
    resume_from=None,  # Optional checkpoint dir to resume from
)

print(f"Adapters saved to: {adapters_dir}")
# Adapters saved to: /output/lora-adapters

Configuration Example

from dataclasses import dataclass
from pathlib import Path

@dataclass
class LoraConfig:
    # Model
    base_model: str = "meta-llama/Llama-2-7b-hf"
    quantize: bool = True
    local_rank: int = -1

    # LoRA
    target_modules: list[str] = None
    lora_r: int = 8
    lora_alpha: int = 16
    lora_dropout: float = 0.05

    # Training
    output_dir: Path = Path("/checkpoints")
    epochs: int = 3
    batch_size: int = 16
    grad_accum: int = 1
    learning_rate: float = 5e-5
    warmup_ratio: float = 0.1
    lr_scheduler_type: str = "linear"
    optim: str = "adamw_torch_fused"
    max_grad_norm: float = 1.0
    logging_steps: int = 100
    save_steps: int = 500

    def __post_init__(self):
        if self.target_modules is None:
            self.target_modules = ["q_proj", "v_proj"]

Supported Models

Tested with:

✅ Llama 2 (7B, 13B, 70B)
✅ Mistral (7B, 8x7B)
✅ Mistral 3 (Large, with multimodal support)
✅ Qwen (7B, 14B)
✅ Code Llama
✅ Any HF CausalLM with standard architecture

Multimodal models (e.g., Mistral3) automatically scope LoRA to language_model layers, avoiding vision tower parameters.

Dependencies

torch>=2.0.0
transformers>=4.40.0
peft>=0.10.0
trl>=0.7.0
bitsandbytes>=0.43.0
datasets>=2.14.0
lilith-ml-training>=0.1.0 (for progress reporting and history logging)

Testing

Run the test suite:

python -m pytest tests/ -v

Tests cover:

Model loading with and without quantization
LoRA adapter application (standard + multimodal)
Dataset loading and tokenization (JSONL, chat formatting, batching)
Training loop setup and execution
Checkpoint resume functionality

Consumers

This library is used by:

lora-trainer — CLI for standalone LoRA fine-tuning
train-language-model — Unified LM training (train/merge/export pipeline)
assistant-trainer — Multi-stage assistant training

ml-training — DDP, checkpointing, curriculum learning, GPU lease utilities
lilith-ml-training — Progress reporting, history logging, emergency checkpointing
train-image-model — Custom training loop for vision models (independent)
train-text-classifier — HF Trainer subclass for text classification

Notes

QLoRA vs Full Fine-Tuning

4-bit QLoRA (default) reduces memory by ~75% while preserving training quality for most models. Use quantize=False for:

Small models (<1B parameters)
Precision-critical tasks
When GPU VRAM is not constrained

Multimodal Models

The library automatically detects multimodal architectures (e.g., mistral3) and scopes LoRA targets to language_model layers, preserving the frozen vision tower. For models with custom architectures, manually adjust target_modules in config.

Distributed Training

Set LOCAL_RANK environment variable for DDP:

torchrun --nproc_per_node=4 script.py
# Automatically sets LOCAL_RANK=0,1,2,3

The library uses this to place models on the correct GPU and configure gradient accumulation.

Version History

0.1.0 (March 2026)

Initial release
Extracted from lora-trainer to eliminate code duplication
Fixed torch_dtype kwarg issue in model loading
Added comprehensive unit test suite (30 tests)
Added multimodal model support