book-sft-pipeline
This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
Install
mkdir -p .claude/skills/book-sft-pipeline && curl -L -o skill.zip "https://mcp.directory/api/skills/download/1782" && unzip -o skill.zip -d .claude/skills/book-sft-pipeline && rm skill.zipInstalls to .claude/skills/book-sft-pipeline
About this skill
Book SFT Pipeline
A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.
When to Activate
Activate this skill when:
- Building fine-tuning datasets from literary works
- Creating author-voice or style-transfer models
- Preparing training data for Tinker or similar SFT platforms
- Designing text segmentation pipelines for long-form content
- Training small models (8B or less) on limited data
Core Concepts
The Three Pillars of Book SFT
1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.
2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.
3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR AGENT │
│ Coordinates pipeline phases, manages state, handles failures │
└──────────────────────┬──────────────────────────────────────────┘
│
┌───────────────┼───────────────┬───────────────┐
▼ ▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EXTRACTION │ │ SEGMENTATION │ │ INSTRUCTION │ │ DATASET │
│ AGENT │ │ AGENT │ │ AGENT │ │ BUILDER │
│ ePub → Text │ │ Text → Chunks│ │ Chunks → │ │ Pairs → │
│ │ │ 150-400 words│ │ Prompts │ │ JSONL │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌──────────────┐ ┌──────────────┐
│ TRAINING │ │ VALIDATION │
│ AGENT │ │ AGENT │
│ LoRA on │ │ AI detector │
│ Tinker │ │ Originality │
└──────────────┘ └──────────────┘
Phase 1: Text Extraction
Critical Rules
- Always source ePub over PDF - OCR errors become learned patterns
- Use paragraph-level extraction - Extract from
<p>tags to preserve breaks - Remove front/back matter - Copyright and TOC pollute the dataset
# Extract text from ePub paragraphs
from epub2 import EPub
from bs4 import BeautifulSoup
def extract_epub(path):
book = EPub(path)
chapters = []
for item in book.flow:
html = book.get_chapter(item.id)
soup = BeautifulSoup(html, 'html.parser')
paragraphs = [p.get_text().strip() for p in soup.find_all('p')]
chapters.append('\n\n'.join(p for p in paragraphs if p))
return '\n\n'.join(chapters)
Phase 2: Intelligent Segmentation
Smaller Chunks + Overlap
Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).
def segment(text, min_words=150, max_words=400):
paragraphs = text.split('\n\n')
chunks, buffer, buffer_words = [], [], 0
for para in paragraphs:
words = len(para.split())
if buffer_words + words > max_words and buffer_words >= min_words:
chunks.append('\n\n'.join(buffer))
# Keep last paragraph for overlap
buffer = [buffer[-1], para] if buffer else [para]
buffer_words = sum(len(p.split()) for p in buffer)
else:
buffer.append(para)
buffer_words += words
if buffer:
chunks.append('\n\n'.join(buffer))
return chunks
Expected Results
For an 86,000-word book:
- Old method (250-650 words): ~150 chunks
- New method (150-400 + overlap): ~300 chunks
- With 2 variants per chunk: 600+ training examples
Phase 3: Diverse Instruction Generation
The Key Insight
Using a single prompt template causes memorization. Diverse templates teach the underlying style.
SYSTEM_PROMPTS = [
"You are an expert creative writer capable of emulating specific literary styles.",
"You are a literary writer with deep knowledge of classic prose styles.",
"You are a creative writer skilled at emulating distinctive authorial voices.",
"You write prose that captures the essence of modernist literature.",
"You are a talented writer who can channel classic American authors.",
]
PROMPT_TEMPLATES = [
"Write a passage in the style of {author}: {desc}",
"Channel {author}'s voice to write about: {desc}",
"In {author}'s distinctive prose style, describe: {desc}",
"Write this scene as {author} would have: {desc}",
"Using {author}'s repetitive technique, describe: {desc}",
"Capture the rhythm of {author} in this passage: {desc}",
"Write like {author}: {desc}",
"In the voice of {author}, write: {desc}",
"This is a literary exercise. Write like {author}: {desc}",
"Can you write in {author}'s style? {desc}",
]
Instruction Generation
INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences.
Focus on: characters present, actions, emotions, setting.
Do NOT quote the text directly.
Excerpt:
{text}
"""
# Use a fast, cheap LLM (e.g., Gemini Flash)
instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))
Phase 4: Dataset Construction
Message Format
{
"messages": [
{"role": "system", "content": "You are an expert creative writer..."},
{"role": "user", "content": "Write in the style of Author: Scene description..."},
{"role": "assistant", "content": "The actual book text from chunk..."}
]
}
Multiple Variants Per Chunk
def build_examples(chunk, instruction, author, variants=2):
examples = []
for i in range(variants):
system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)]
template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)]
user = template.format(author=author, desc=instruction)
examples.append({"messages": [
{"role": "system", "content": system},
{"role": "user", "content": user},
{"role": "assistant", "content": chunk.text}
]})
return examples
Phase 5: LoRA Training on Tinker
Configuration
CONFIG = {
"model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct
"lora_rank": 32, # 352MB adapter
"learning_rate": 5e-4, # Higher for LoRA
"batch_size": 4,
"epochs": 3,
}
Why Base Model?
Use base (pretrained) models, not instruction-tuned versions:
- Base models are more malleable for new styles
- Instruct models have patterns that resist overwriting
- Style is a low-level pattern that base models capture better
Training Loop
import tinker
from tinker import types
training_client = await service_client.create_lora_training_client_async(
base_model="Qwen/Qwen3-8B-Base",
rank=32
)
for epoch in range(3):
for batch in batches:
await training_client.forward_backward_async(batch, loss_fn="cross_entropy")
await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4))
result = await training_client.save_weights_for_sampler_async(name="final")
Phase 6: Validation
Modern Scenario Test
Test with scenarios that couldn't exist in the original book:
TEST_PROMPTS = [
"Write about a barista making lattes",
"Describe lovers communicating through text messages",
"Write about someone anxious about climate change",
]
If the model applies style markers to modern scenarios, it learned style, not content.
Originality Verification
# Search training data for output phrases
grep "specific phrase from output" dataset.jsonl
# Should return: No matches
AI Detector Testing
Test outputs with GPTZero, Pangram, or ZeroGPT.
Known Issues and Solutions
Character Name Leakage
Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.
Model Parrots Exact Phrases
Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.
Fragmented Outputs
Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.
Guidelines
- Always source ePub over PDF - OCR errors become learned patterns
- Never break mid-sentence - Boundaries must be grammatically complete
- Use diverse prompts - 15+ templates, 5+ system prompts
- Use base models - Not instruct versions
- Use smaller chunks - 150-400 words for more examples
- Reserve test set - 50 examples minimum
- Test on modern scenarios - Proves style transfer vs memorization
- Verify originality - Grep training data for output phrases
Expected Results
| Metric | Value |
|---|---|
| Training examples | 500-1000 per book |
| Model | Qwen/Qwen3-8B-Base |
| LoRA rank | 32 |
| Adapter size | ~350 MB |
| Training time | ~15 min |
| Loss reduction | 90%+ |
| Style transfer success | ~50% perfect |
Cost Estimate
| Component | Cost |
|---|---|
| LLM (instruction generation) | ~$0.50 |
| Tinker training (15 min) | ~$1.50 |
| Total | ~$2.00 |
Integration with Context Engineering Skills
This example applies several skil
Content truncated.
More by muratcankoylan
View all skills by muratcankoylan →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversUnlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work
Official Laravel-focused MCP server for augmenting AI-powered local development. Provides deep context about your Larave
Safely connect cloud Grafana to AI agents with MCP: query, inspect, and manage Grafana resources using simple, focused o
Empower your workflows with Perplexity Ask MCP Server—seamless integration of AI research tools for real-time, accurate
Boost your productivity by managing Azure DevOps projects, pipelines, and repos in VS Code. Streamline dev workflows wit
Boost AI coding agents with Ref Tools—efficient documentation access for faster, smarter code generation than GitHub Cop
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.