Content Core

Name: Content Core
Rating: 4.9 (94 reviews)
Author: lfnovo

Extracts and processes content from URLs, documents, videos, audio files, and images into clean, structured text. Uses AI to automatically detect media types and apply the right extraction method.

Extracts content from diverse media sources including URLs, documents, videos, audio files, and images using intelligent auto-detection and multiple extraction engines for unified content processing and analysis.

136310 views28Local (stdio)

ai ml productivity

GitHub

What it does

Extract text from PDFs, Word docs, and other documents
Transcribe videos and audio files to text
Extract content from web URLs
Perform OCR on images to extract text
Process ZIP archives and other compressed files
Generate AI summaries of extracted content

Best for

Content researchers analyzing diverse media sourcesData analysts processing mixed document formatsAI developers building content processing pipelinesAnyone needing to extract text from various file types

Auto-detects media types and chooses extraction methodHandles 15+ file formats in one toolBuilt-in AI summarization

About Content Core

Content Core is a community-built MCP server published by lfnovo that provides AI assistants with tools and capabilities via the Model Context Protocol. Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified content analysis. It is categorized under ai ml, productivity. This server exposes 1 tool that AI clients can invoke during conversations and coding sessions.

How to install

You can install Content Core in your AI client of choice. Use the install panel on this page to get one-click setup for Cursor, Claude Desktop, VS Code, and other MCP-compatible clients. This server runs locally on your machine via the stdio transport.

License

Content Core is released under the MIT license. This is a permissive open-source license, meaning you can freely use, modify, and distribute the software.

Tools (1)

extract_content

Extract content from a URL or file using Content Core's auto engine. Args: url: Optional URL to extract content from file_path: Optional file path to extract content from Returns: JSON object containing extracted content and metadata Raises: ValueError: If neither or both url and file_path are provided

Content Core

Content Core is a powerful, AI-powered content extraction and processing platform that transforms any source into clean, structured content. Extract text from websites, transcribe videos, process documents, and generate AI summaries—all through a unified interface with multiple integration options.

🚀 What You Can Do

Extract content from anywhere:

📄 Documents - PDF, Word, PowerPoint, Excel, Markdown, HTML, EPUB
🎥 Media - Videos (MP4, AVI, MOV) with automatic transcription
🎵 Audio - MP3, WAV, M4A with speech-to-text conversion
🌐 Web - Any URL with intelligent content extraction
🖼️ Images - JPG, PNG, TIFF with OCR text recognition
📦 Archives - ZIP, TAR, GZ with content analysis

Process with AI:

✨ Clean & format extracted content automatically
📝 Generate summaries with customizable styles (bullet points, executive summary, etc.)
🎯 Context-aware processing - explain to a child, technical summary, action items
🔄 Smart engine selection - automatically chooses the best extraction method

🛠️ Multiple Ways to Use

🖥️ Command Line (Zero Install)

# Extract content from any source
uvx --from "content-core" ccore https://example.com
uvx --from "content-core" ccore document.pdf

# Generate AI summaries  
uvx --from "content-core" csum video.mp4 --context "bullet points"

🤖 Claude Desktop Integration

One-click setup with Model Context Protocol (MCP) - extract content directly in Claude conversations.

🔍 Raycast Extension

Smart auto-detection commands:

Extract Content - Full interface with format options
Summarize Content - 9 summary styles available
Quick Extract - Instant clipboard extraction

🖱️ macOS Right-Click Integration

Right-click any file in Finder → Services → Extract or Summarize content instantly.

🐍 Python Library

import content_core as cc

# Extract from any source
result = await cc.extract("https://example.com/article")
summary = await cc.summarize_content(result, context="explain to a child")

⚡ Key Features

🎯 Intelligent Auto-Detection: Automatically selects the best extraction method based on content type and available services
🔧 Smart Engine Selection:
- URLs: Firecrawl → Jina → Crawl4AI (optional) → BeautifulSoup fallback chain
- Documents: Docling → Enhanced PyMuPDF → Simple extraction fallback
- Media: OpenAI Whisper transcription
- Images: OCR with multiple engine support
📊 Enhanced PDF Processing: Advanced PyMuPDF engine with quality flags, table detection, and optional OCR for mathematical formulas
🌍 Multiple Integrations: CLI, Python library, MCP server, Raycast extension, macOS Services
⚡ Zero-Install Options: Use uvx for instant access without installation
🧠 AI-Powered Processing: LLM integration for content cleaning and summarization
🔄 Asynchronous: Built with asyncio for efficient processing
🐍 Pure Python Implementation: No system dependencies required - simplified installation across all platforms

Getting Started

Installation

Install Content Core using pip - no system dependencies required!

# Basic installation (PyMuPDF + BeautifulSoup/Jina extraction)
pip install content-core

# With enhanced document processing (adds Docling)
pip install content-core[docling]

# With local browser-based URL extraction (adds Crawl4AI)
# Note: Requires Playwright browsers (~300MB). Run:
pip install content-core[crawl4ai]
python -m playwright install --with-deps

# Full installation (with all optional features)
pip install content-core[docling,crawl4ai]

Note: The core installation uses pure Python implementations and doesn't require system libraries like libmagic, ensuring consistent, hassle-free installation across Windows, macOS, and Linux. Optional features like Crawl4AI (browser automation) may require additional system dependencies.

Alternatively, if you’re developing locally:

# Clone the repository
git clone https://github.com/lfnovo/content-core
cd content-core

# Install with uv
uv sync

Command-Line Interface

Content Core provides three CLI commands for extracting, cleaning, and summarizing content: ccore, cclean, and csum. These commands support input from text, URLs, files, or piped data (e.g., via cat file | command).

Zero-install usage with uvx:

# Extract content
uvx --from "content-core" ccore https://example.com

# Clean content  
uvx --from "content-core" cclean "messy content"

# Summarize content
uvx --from "content-core" csum "long text" --context "bullet points"

ccore - Extract Content

Extracts content from text, URLs, or files, with optional formatting. Usage:

ccore [-f|--format xml|json|text] [-d|--debug] [content]

Options:

-f, --format: Output format (xml, json, or text). Default: text.
-d, --debug: Enable debug logging.
content: Input content (text, URL, or file path). If omitted, reads from stdin.

Examples:

# Extract from a URL as text
ccore https://example.com

# Extract from a file as JSON
ccore -f json document.pdf

# Extract from piped text as XML
echo "Sample text" | ccore --format xml

cclean - Clean Content

Cleans content by removing unnecessary formatting, spaces, or artifacts. Accepts text, JSON, XML input, URLs, or file paths. Usage:

cclean [-d|--debug] [content]

Options:

-d, --debug: Enable debug logging.
content: Input content to clean (text, URL, file path, JSON, or XML). If omitted, reads from stdin.

Examples:

# Clean a text string
cclean "  messy   text   "

# Clean piped JSON
echo '{"content": "  messy   text   "}' | cclean

# Clean content from a URL
cclean https://example.com

# Clean a file’s content
cclean document.txt

csum - Summarize Content

Summarizes content with an optional context to guide the summary style. Accepts text, JSON, XML input, URLs, or file paths.

Usage:

csum [--context "context text"] [-d|--debug] [content]

Options:

--context: Context for summarization (e.g., "explain to a child"). Default: none.
-d, --debug: Enable debug logging.
content: Input content to summarize (text, URL, file path, JSON, or XML). If omitted, reads from stdin.

Examples:

# Summarize text
csum "AI is transforming industries."

# Summarize with context
csum --context "in bullet points" "AI is transforming industries."

# Summarize piped content
cat article.txt | csum --context "one sentence"

# Summarize content from URL
csum https://example.com

# Summarize a file's content
csum document.txt

Quick Start

You can quickly integrate content-core into your Python projects to extract, clean, and summarize content from various sources.

import content_core as cc

# Extract content from a URL, file, or text
result = await cc.extract("https://example.com/article")

# Clean messy content
cleaned_text = await cc.clean("...messy text with [brackets] and extra spaces...")

# Summarize content with optional context
summary = await cc.summarize_content("long article text", context="explain to a child")

# Extract audio with custom speech-to-text model
from content_core.common import ProcessSourceInput
result = await cc.extract(ProcessSourceInput(
    file_path="interview.mp3",
    audio_provider="openai",
    audio_model="whisper-1"
))

Documentation

For more information on how to use the Content Core library, including details on AI model configuration and customization, refer to our Usage Documentation.

MCP Server Integration

Content Core includes a Model Context Protocol (MCP) server that enables seamless integration with Claude Desktop and other MCP-compatible applications. The MCP server exposes Content Core's powerful extraction capabilities through a standardized protocol.

Quick Setup with Claude Desktop

# Install Content Core (MCP server included)
pip install content-core

# Or use directly with uvx (no installation required)
uvx --from "content-core" content-core-mcp

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "content-core": {
      "command": "uvx",
      "args": [
        "--from",
        "content-core",
        "content-core-mcp"
      ]
    }
  }
}

For detailed setup instructions, configuration options, and usage examples, see our MCP Documentation.

Enhanced PDF Processing

Content Core features an optimized PyMuPDF extraction engine with significant improvements for scientific documents and complex PDFs.

Key Improvements

🔬 Mathematical Formula Extraction: E

README truncated. View full README on GitHub.

Alternatives

Knowledge Graph Memory

anthropic

80.5k

Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory across chats efficiently.

OfficialPopular

2.7k171

Context7

upstash

48.2k

Boost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into your coding workflow.

OfficialRemotePopular

17.3k832

GitHub

github

27.6k

Extend your developer tools with GitHub MCP Server for advanced automation, supporting GitHub Student and student packages integration.

OfficialRemotePopular

4.8k268

Task Master

eyaltoledano

25.8k

Boost productivity with Task Master: an AI-powered tool for project management and agile development workflows, integrated with popular editors.

CommunityPopular

5.1k115

Related Skills

Browse all skills

seo-optimizer

Search Engine Optimization specialist for content strategy, technical SEO, keyword research, and ranking improvements. Use when optimizing website content, improving search rankings, conducting keyword analysis, or implementing SEO best practices. Expert in on-page SEO, meta tags, schema markup, and Core Web Vitals.

humanize-cli-ai-text-detection-rewriting

Detect AI-generated text patterns and get fixes. Score detection risk, find AI vocabulary, suggest improvements. Free CLI for writers and content creators.

teams-channel-post-writer

Creates educational Teams channel posts for internal knowledge sharing about Claude Code features, tools, and best practices. Applies when writing posts, announcements, or documentation to teach colleagues effective Claude Code usage, announce new features, share productivity tips, or document lessons learned. Provides templates, writing guidelines, and structured approaches emphasizing concrete examples, underlying principles, and connections to best practices like context engineering. Activates for content involving Teams posts, channel announcements, feature documentation, or tip sharing.

hugging-face-evaluation

Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.

seo-expert-kit

Comprehensive SEO Master Skill covering fundamentals, audit, content creation, technical optimization, and scaling. Replaces all fragmented SEO skills. Covers E-E-A-T, Core Web Vitals, Programmatic SEO, and Content Strategy.

orchardcore-tester

Tests OrchardCore CMS features through browser automation. Use when the user needs to build, run, setup, or test OrchardCore functionality including admin features, content management, media library, and module testing.