peft-fine-tuning

1views

1installs

Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use when fine-tuning large models (7B-70B) with limited GPU memory, when you need to train <1% of parameters with minimal accuracy loss, or for multi-adapter serving. HuggingFace's official library integrated with transformers ecosystem.

Install

mkdir -p .claude/skills/peft-fine-tuning && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4153" && unzip -o skill.zip -d .claude/skills/peft-fine-tuning && rm skill.zip

Installs to .claude/skills/peft-fine-tuning

About this skill

PEFT (Parameter-Efficient Fine-Tuning)

Fine-tune LLMs by training <1% of parameters using LoRA, QLoRA, and 25+ adapter methods.

When to use PEFT

Use PEFT/LoRA when:

Fine-tuning 7B-70B models on consumer GPUs (RTX 4090, A100)
Need to train <1% parameters (6MB adapters vs 14GB full model)
Want fast iteration with multiple task-specific adapters
Deploying multiple fine-tuned variants from one base model

Use QLoRA (PEFT + quantization) when:

Fine-tuning 70B models on single 24GB GPU
Memory is the primary constraint
Can accept ~5% quality trade-off vs full fine-tuning

Use full fine-tuning instead when:

Training small models (<1B parameters)
Need maximum quality and have compute budget
Significant domain shift requires updating all weights

Quick start

Installation

# Basic installation
pip install peft

# With quantization support (recommended)
pip install peft bitsandbytes

# Full stack
pip install peft transformers accelerate bitsandbytes datasets

LoRA fine-tuning (standard)

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import get_peft_model, LoraConfig, TaskType
from datasets import load_dataset

# Load base model
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank (8-64, higher = more capacity)
    lora_alpha=32,                 # Scaling factor (typically 2*r)
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # Attention layers
    bias="none"                    # Don't train biases
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,043,307,008 || trainable%: 0.17%

# Prepare dataset
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

def tokenize(example):
    text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    return tokenizer(text, truncation=True, max_length=512, padding="max_length")

tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=lambda data: {"input_ids": torch.stack([f["input_ids"] for f in data]),
                                 "attention_mask": torch.stack([f["attention_mask"] for f in data]),
                                 "labels": torch.stack([f["input_ids"] for f in data])}
)

trainer.train()

# Save adapter only (6MB vs 16GB)
model.save_pretrained("./lora-llama-adapter")

QLoRA fine-tuning (memory-efficient)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4 (best for LLMs)
    bnb_4bit_compute_dtype="bfloat16",   # Compute in bf16
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-70B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare for training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# LoRA config for QLoRA
lora_config = LoraConfig(
    r=64,                              # Higher rank for 70B
    lora_alpha=128,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
# 70B model now fits on single 24GB GPU!

LoRA parameter selection

Rank (r) - capacity vs efficiency

Rank	Trainable Params	Memory	Quality	Use Case
4	~3M	Minimal	Lower	Simple tasks, prototyping
8	~7M	Low	Good	Recommended starting point
16	~14M	Medium	Better	General fine-tuning
32	~27M	Higher	High	Complex tasks
64	~54M	High	Highest	Domain adaptation, 70B models

Alpha (lora_alpha) - scaling factor

# Rule of thumb: alpha = 2 * rank
LoraConfig(r=16, lora_alpha=32)  # Standard
LoraConfig(r=16, lora_alpha=16)  # Conservative (lower learning rate effect)
LoraConfig(r=16, lora_alpha=64)  # Aggressive (higher learning rate effect)

Target modules by architecture

# Llama / Mistral / Qwen
target_modules = ["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# GPT-2 / GPT-Neo
target_modules = ["c_attn", "c_proj", "c_fc"]

# Falcon
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# BLOOM
target_modules = ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"]

# Auto-detect all linear layers
target_modules = "all-linear"  # PEFT 0.6.0+

Loading and merging adapters

Load trained adapter

from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM

# Option 1: Load with PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
model = PeftModel.from_pretrained(base_model, "./lora-llama-adapter")

# Option 2: Load directly (recommended)
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama-adapter",
    device_map="auto"
)

Merge adapter into base model

# Merge for deployment (no adapter overhead)
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

# Push to Hub
merged_model.push_to_hub("username/llama-finetuned")

Multi-adapter serving

from peft import PeftModel

# Load base with first adapter
model = AutoPeftModelForCausalLM.from_pretrained("./adapter-task1")

# Load additional adapters
model.load_adapter("./adapter-task2", adapter_name="task2")
model.load_adapter("./adapter-task3", adapter_name="task3")

# Switch between adapters at runtime
model.set_adapter("task1")  # Use task1 adapter
output1 = model.generate(**inputs)

model.set_adapter("task2")  # Switch to task2
output2 = model.generate(**inputs)

# Disable adapters (use base model)
with model.disable_adapter():
    base_output = model.generate(**inputs)

PEFT methods comparison

Method	Trainable %	Memory	Speed	Best For
LoRA	0.1-1%	Low	Fast	General fine-tuning
QLoRA	0.1-1%	Very Low	Medium	Memory-constrained
AdaLoRA	0.1-1%	Low	Medium	Automatic rank selection
IA3	0.01%	Minimal	Fastest	Few-shot adaptation
Prefix Tuning	0.1%	Low	Medium	Generation control
Prompt Tuning	0.001%	Minimal	Fast	Simple task adaptation
P-Tuning v2	0.1%	Low	Medium	NLU tasks

IA3 (minimal parameters)

from peft import IA3Config

ia3_config = IA3Config(
    target_modules=["q_proj", "v_proj", "k_proj", "down_proj"],
    feedforward_modules=["down_proj"]
)
model = get_peft_model(model, ia3_config)
# Trains only 0.01% of parameters!

Prefix Tuning

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # Prepended tokens
    prefix_projection=True       # Use MLP projection
)
model = get_peft_model(model, prefix_config)

Integration patterns

With TRL (SFTTrainer)

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig

lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(output_dir="./output", max_seq_length=512),
    train_dataset=dataset,
    peft_config=lora_config,  # Pass LoRA config directly
)
trainer.train()

With Axolotl (YAML config)

# axolotl config.yaml
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
lora_target_linear: true  # Target all linear layers

With vLLM (inference)

from vllm import LLM
from vllm.lora.request import LoRARequest

# Load base model with LoRA support
llm = LLM(model="meta-llama/Llama-3.1-8B", enable_lora=True)

# Serve with adapter
outputs = llm.generate(
    prompts,
    lora_request=LoRARequest("adapter1", 1, "./lora-adapter")
)

Performance benchmarks

Memory usage (Llama 3.1 8B)

Method	GPU Memory	Trainable Params
Full fine-tuning	60+ GB	8B (100%)
LoRA r=16	18 GB	14M (0.17%)
QLoRA r=16	6 GB	14M (0.17%)
IA3	16 GB	800K (0.01%)

Training speed (A100 80GB)

Method	Tokens/sec	vs Full FT
Full FT	2,500	1x
LoRA	3,200	1.3x
QLoRA	2,100	0.84x

Quality (MMLU benchmark)

Model	Full FT	LoRA	QLoRA
Llama 2-7B	45.3	44.8	44.1
Llama 2-13B	54.8	54.2	53.5

Common issues

CUDA OOM during training

# Solution 1: Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Solution 2: Reduce batch size + increase accumulation
TrainingArguments(
    per

---

*Content truncated.*

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

526190

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

84107

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

13087

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

11557

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

15249

2d-games

davila7

2D game development principles. Sprites, tilemaps, physics, camera.

14448

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6811,428

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2591,319

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5271,144

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,349807

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,261727

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,466674

Related MCP Servers

Browse all servers

Browser Use

Browser Use lets LLMs and agents access and scrape any website in real time, making web scraping and web page scraping e

79,9420 tools

TypeScript Refactoring

TypeScript Refactoring offers advanced TypeScript/JavaScript code analysis and intelligent refactoring for seamless and

4400 tools

DINO-X

DINO-X is a powerful multimodal AI model that lets you detect, localize, and describe anything in images using natural l

1120 tools

Auth0

Integrate Auth0 with AI agents to manage Auth0 operations using natural language. Easily create apps and retrieve domain

940 tools

LLMs.txt Explorer

Explore websites using llms.txt files with LLMs.txt Explorer. Fetch and parse site-specific language model instructions

750 tools

HTTP Request

Advanced web scraper lets LLMs bypass anti-bot protection using HTTP requests, ideal for web scraping tools like Octopar

430 tools

Install

mkdir -p .claude/skills/peft-fine-tuning && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4153" && unzip -o skill.zip -d .claude/skills/peft-fine-tuning && rm skill.zip

Installs to .claude/skills/peft-fine-tuning

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

peft-fine-tuning

Install

About this skill

PEFT (Parameter-Efficient Fine-Tuning)

When to use PEFT

Quick start

Installation

LoRA fine-tuning (standard)

QLoRA fine-tuning (memory-efficient)

LoRA parameter selection

Rank (r) - capacity vs efficiency

Alpha (lora_alpha) - scaling factor

Target modules by architecture

Loading and merging adapters

Load trained adapter

Merge adapter into base model

Multi-adapter serving

PEFT methods comparison

IA3 (minimal parameters)

Prefix Tuning

Integration patterns

With TRL (SFTTrainer)

With Axolotl (YAML config)

With vLLM (inference)

Performance benchmarks

Memory usage (Llama 3.1 8B)

Training speed (A100 80GB)

Quality (MMLU benchmark)

Common issues

CUDA OOM during training

More by davila7

software-architecture

planning-with-files

scroll-experience

humanizer

game-development

2d-games

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers