github-archive

3views

1installs

Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.

Install

mkdir -p .claude/skills/github-archive && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3846" && unzip -o skill.zip -d .claude/skills/github-archive && rm skill.zip

Installs to .claude/skills/github-archive

About this skill

GitHub Archive

Purpose: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.

When to Use This Skill

Investigating security incidents involving GitHub repositories
Building threat actor attribution profiles
Verifying claims about repository activity (media reports, incident reports)
Reconstructing attack timelines with definitive timestamps
Analyzing automation system compromises
Detecting supply chain reconnaissance
Cross-repository behavioral analysis
Workflow execution verification (legitimate vs API abuse)
Pattern-based anomaly detection
Recovering deleted content: PRs, issues, branches, tags, entire repositories

GitHub Archive analysis should be your FIRST step in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.

Core Principles

ALWAYS PREFER GitHub Archive as forensic evidence over:

Local git command outputs (git log, git show) - commits can be backdated/forged
Unverified claims from articles or reports - require independent confirmation
GitHub web interface screenshots - can be manipulated
Single-source evidence - always cross-verify

GitHub Archive IS your ground truth for:

Actor attribution (who performed actions)
Timeline reconstruction (when events occurred)
Event verification (what actually happened)
Pattern analysis (behavioral fingerprinting)
Cross-repository activity tracking
Deleted content recovery (issues, PRs, tags, commit references remain in archive)
Repository deletion forensics (commit SHAs persist even after repo deletion and history rewrites)

What Persists After Deletion

Deleted Issues & PRs:

Issue creation events (IssuesEvent) remain in archive
Issue comments (IssueCommentEvent) remain accessible
PR open/close/merge events (PullRequestEvent) persist
Forensic Value: Recover deleted evidence of social engineering, reconnaissance, or coordination

Deleted Tags & Branches:

CreateEvent records for tag/branch creation persist
DeleteEvent records document when deletion occurred
Forensic Value: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)

Deleted Repositories:

All PushEvent records to the repository remain queryable
Commit SHAs are permanently recorded in archive
Fork relationships (ForkEvent) survive deletion
Forensic Value: Access commit metadata even after threat actor deletes evidence

Deleted User Accounts:

All activity events remain attributed to deleted username
Timeline reconstruction remains possible
Limitation: Direct code access lost, but commit SHAs can be searched elsewhere

Quick Start

Investigate if user opened PRs in June 2025:

from google.cloud import bigquery
from google.oauth2 import service_account

# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Query for PR events
query = """
SELECT
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
for row in results:
    print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
    print(f"  Title: {row.pr_title}")

Expected Output (if PR exists):

2025-06-15 14:23:11 UTC: PR #123 - opened
  Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
  Title: Add new feature

Interpretation:

No results → Claim disproven (no PR activity found)
Results found → Claim verified, proceed with detailed analysis

Setup

Prerequisites

Google Cloud Project:
- Login to Google Developer Console
- Create a project and activate BigQuery API
- Create a service account with BigQuery User role
- Download JSON credentials file
Install BigQuery Client:

pip install google-cloud-bigquery google-auth

Initialize Client

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id
)

Free Tier: Google provides 1 TB of data processed per month free.

Cost Management & Query Optimization

Understanding GitHub Archive Costs

BigQuery charges $6.25 per TiB of data scanned (after the 1 TiB free tier). GitHub Archive tables are large - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. Unoptimized queries can cost $10-100+, while optimized versions of the same query cost $0.10-1.00.

Key Cost Principle: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with SELECT * on one day of data scans ~3 GB even with LIMIT 10.

ALWAYS Estimate Costs Before Querying

CRITICAL RULE: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.

from google.cloud import bigquery

def estimate_gharchive_cost(query: str) -> dict:
    """Estimate cost before running GitHub Archive query."""
    client = bigquery.Client()

    # Dry run - validates query and returns bytes to scan
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    job = client.query(query, job_config=dry_run_config)

    bytes_processed = job.total_bytes_processed
    gb_processed = bytes_processed / (1024**3)
    tib_processed = bytes_processed / (1024**4)
    estimated_cost = tib_processed * 6.25

    return {
        'bytes': bytes_processed,
        'gigabytes': round(gb_processed, 2),
        'tib': round(tib_processed, 4),
        'estimated_cost_usd': round(estimated_cost, 4)
    }

# Example: Always check cost before running
estimate = estimate_gharchive_cost(your_query)
print(f"Cost estimate: {estimate['gigabytes']} GB → ${estimate['estimated_cost_usd']}")

if estimate['estimated_cost_usd'] > 1.0:
    print("⚠️ HIGH COST QUERY - Review optimization before proceeding")

Command-line dry run:

bq query --dry_run --use_legacy_sql=false 'YOUR_QUERY_HERE' 2>&1 | grep "bytes"

When to Ask the User About Costs

ASK USER BEFORE RUNNING if any of these conditions apply:

Estimated cost > $1.00 - Always confirm with user for queries over $1
Wildcard spans > 3 months - Queries like githubarchive.day.2025* scan entire year (~400 GB)
No partition filter - Queries without date/time filters scan entire table range
SELECT * used - Selecting all columns dramatically increases cost
Cross-repository searches - Queries without repo.name filter scan all GitHub activity

Example user confirmation:

Query estimate: 120 GB ($0.75)
Scanning: githubarchive.day.202506* (June 2025, 30 days)
Reason: Cross-repository search for actor 'suspected-user'

This exceeds typical query cost ($0.10-0.30). Proceed? [y/n]

DON'T ASK if:

Estimated cost < $0.50 AND query is well-scoped (specific repo + date range)
User explicitly requested broad analysis (e.g., "scan all of 2025")

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

-- ❌ EXPENSIVE: Scans ALL columns (~3 GB per day)
SELECT * FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans only needed columns (~0.3 GB per day)
SELECT
    type,
    created_at,
    repo.name,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Never use SELECT * in production queries. Always specify exact columns needed.

2. Use Specific Date Ranges (10-100x cost reduction)

-- ❌ EXPENSIVE: Scans entire year (~400 GB)
SELECT ... FROM `githubarchive.day.2025*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans specific month (~40 GB)
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ BEST: Scans single day (~3 GB)
SELECT ... FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Strategy: Start with narrow date ranges (1-7 days), then expand if needed. Use monthly tables (githubarchive.month.202506) for multi-month queries instead of daily wildcards.

3. Filter by Repository Name (5-50x cost reduction)

-- ❌ EXPENSIVE: Scans all GitHub activity
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Filter by repo (BigQuery can prune data blocks)
SELECT ... FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target-org/target-repo'
    AND actor.login = 'target-user'

Rule: Always include repo.name filter when investigating a specific repository.

4. Avoid SELECT * with Wildcards (Critical)

-- ❌ CATASTROPHIC: Can scan 1+ TiB ($6.25+)
SELECT * FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

-- ✅ OPTIMIZED: Scans ~50 GB ($0.31)
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

IMPORTANT: LIMIT does *not

Content truncated.

More by gadievron

View all skills by gadievron →

code-coverage-with-gcov

gadievron

Add gcov code coverage instrumentation to C/C++ projects

9613

line-execution-checker

gadievron

Check if specific lines were executed using gcov data

github-wayback-recovery

gadievron

Recover deleted GitHub content using the Wayback Machine and Archive.org APIs. Use when repositories, files, issues, PRs, or wiki pages have been deleted from GitHub but may persist in web archives. Covers CDX API queries, URL patterns, and systematic recovery workflows.

rr-debugger

gadievron

Deterministic debugging with rr record-replay. Use when debugging crashes, ASAN faults, or when reverse execution is needed. Provides reverse-next, reverse-step, reverse-continue commands and crash trace extraction.

github-commit-recovery

gadievron

Recover deleted commits from GitHub using REST API, web interface, and git fetch. Use when you have commit SHAs and need to retrieve actual commit content, diffs, or patches. Includes techniques for accessing "deleted" commits that remain on GitHub servers.

function-call-tracing

gadievron

Instrument C/C++ with -finstrument-functions for execution tracing and Perfetto visualization

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5831,376

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1251,201

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4211,110

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,204751

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,163690

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,333621

Related MCP Servers

Browse all servers

Repomix

Optimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool

22,2988 tools

GitGuardian MCP Server

GitGuardian MCP Server: auto secret scanning, secrets detection, honeytokens, and remediation for secrets management and

340 tools

Microsoft Sentinel

Query and analyze security data, incidents, and threat intelligence in Microsoft Sentinel’s data lake using natural lang

10 tools

GitHub

Extend your developer tools with GitHub MCP Server for advanced automation, supporting GitHub Student and student packag

27,6470 tools

GhidraMCP

MCP server for Ghidra reverse engineering. Enables AI assistants to decompile binaries, analyze functions, rename variab

7,8360 tools

HexStrike AI

Advanced MCP server enabling AI agents to autonomously run 150+ security and penetration testing tools. Covers reconnais

7,2980 tools

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.

Install

mkdir -p .claude/skills/github-archive && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3846" && unzip -o skill.zip -d .claude/skills/github-archive && rm skill.zip

Installs to .claude/skills/github-archive

Stats

Views

Installs

Author

gadievron

7 skills published

Links

Source Code

github-archive

Install

About this skill

GitHub Archive

When to Use This Skill

Core Principles

What Persists After Deletion

Quick Start

Setup

Prerequisites

Initialize Client

Cost Management & Query Optimization

Understanding GitHub Archive Costs

ALWAYS Estimate Costs Before Querying

When to Ask the User About Costs

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

2. Use Specific Date Ranges (10-100x cost reduction)

3. Filter by Repository Name (5-50x cost reduction)

4. Avoid SELECT * with Wildcards (Critical)

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

More by gadievron

code-coverage-with-gcov

line-execution-checker

github-wayback-recovery

rr-debugger

github-commit-recovery

function-call-tracing

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers

Stay ahead of the MCP ecosystem