github-archive

3
1
Source

Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.

Install

mkdir -p .claude/skills/github-archive && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3846" && unzip -o skill.zip -d .claude/skills/github-archive && rm skill.zip

Installs to .claude/skills/github-archive

About this skill

GitHub Archive

Purpose: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.

When to Use This Skill

  • Investigating security incidents involving GitHub repositories
  • Building threat actor attribution profiles
  • Verifying claims about repository activity (media reports, incident reports)
  • Reconstructing attack timelines with definitive timestamps
  • Analyzing automation system compromises
  • Detecting supply chain reconnaissance
  • Cross-repository behavioral analysis
  • Workflow execution verification (legitimate vs API abuse)
  • Pattern-based anomaly detection
  • Recovering deleted content: PRs, issues, branches, tags, entire repositories

GitHub Archive analysis should be your FIRST step in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.

Core Principles

ALWAYS PREFER GitHub Archive as forensic evidence over:

  • Local git command outputs (git log, git show) - commits can be backdated/forged
  • Unverified claims from articles or reports - require independent confirmation
  • GitHub web interface screenshots - can be manipulated
  • Single-source evidence - always cross-verify

GitHub Archive IS your ground truth for:

  • Actor attribution (who performed actions)
  • Timeline reconstruction (when events occurred)
  • Event verification (what actually happened)
  • Pattern analysis (behavioral fingerprinting)
  • Cross-repository activity tracking
  • Deleted content recovery (issues, PRs, tags, commit references remain in archive)
  • Repository deletion forensics (commit SHAs persist even after repo deletion and history rewrites)

What Persists After Deletion

Deleted Issues & PRs:

  • Issue creation events (IssuesEvent) remain in archive
  • Issue comments (IssueCommentEvent) remain accessible
  • PR open/close/merge events (PullRequestEvent) persist
  • Forensic Value: Recover deleted evidence of social engineering, reconnaissance, or coordination

Deleted Tags & Branches:

  • CreateEvent records for tag/branch creation persist
  • DeleteEvent records document when deletion occurred
  • Forensic Value: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)

Deleted Repositories:

  • All PushEvent records to the repository remain queryable
  • Commit SHAs are permanently recorded in archive
  • Fork relationships (ForkEvent) survive deletion
  • Forensic Value: Access commit metadata even after threat actor deletes evidence

Deleted User Accounts:

  • All activity events remain attributed to deleted username
  • Timeline reconstruction remains possible
  • Limitation: Direct code access lost, but commit SHAs can be searched elsewhere

Quick Start

Investigate if user opened PRs in June 2025:

from google.cloud import bigquery
from google.oauth2 import service_account

# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)

# Query for PR events
query = """
SELECT
    created_at,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
    JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
    actor.login = 'suspected-actor'
    AND repo.name = 'target/repository'
    AND type = 'PullRequestEvent'
ORDER BY created_at
"""

results = client.query(query)
for row in results:
    print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
    print(f"  Title: {row.pr_title}")

Expected Output (if PR exists):

2025-06-15 14:23:11 UTC: PR #123 - opened
  Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
  Title: Add new feature

Interpretation:

  • No results → Claim disproven (no PR activity found)
  • Results found → Claim verified, proceed with detailed analysis

Setup

Prerequisites

  1. Google Cloud Project:

    • Login to Google Developer Console
    • Create a project and activate BigQuery API
    • Create a service account with BigQuery User role
    • Download JSON credentials file
  2. Install BigQuery Client:

pip install google-cloud-bigquery google-auth

Initialize Client

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    'path/to/credentials.json',
    scopes=['https://www.googleapis.com/auth/bigquery']
)

client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id
)

Free Tier: Google provides 1 TB of data processed per month free.

Cost Management & Query Optimization

Understanding GitHub Archive Costs

BigQuery charges $6.25 per TiB of data scanned (after the 1 TiB free tier). GitHub Archive tables are large - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. Unoptimized queries can cost $10-100+, while optimized versions of the same query cost $0.10-1.00.

Key Cost Principle: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with SELECT * on one day of data scans ~3 GB even with LIMIT 10.

ALWAYS Estimate Costs Before Querying

CRITICAL RULE: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.

from google.cloud import bigquery

def estimate_gharchive_cost(query: str) -> dict:
    """Estimate cost before running GitHub Archive query."""
    client = bigquery.Client()

    # Dry run - validates query and returns bytes to scan
    dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    job = client.query(query, job_config=dry_run_config)

    bytes_processed = job.total_bytes_processed
    gb_processed = bytes_processed / (1024**3)
    tib_processed = bytes_processed / (1024**4)
    estimated_cost = tib_processed * 6.25

    return {
        'bytes': bytes_processed,
        'gigabytes': round(gb_processed, 2),
        'tib': round(tib_processed, 4),
        'estimated_cost_usd': round(estimated_cost, 4)
    }

# Example: Always check cost before running
estimate = estimate_gharchive_cost(your_query)
print(f"Cost estimate: {estimate['gigabytes']} GB → ${estimate['estimated_cost_usd']}")

if estimate['estimated_cost_usd'] > 1.0:
    print("⚠️ HIGH COST QUERY - Review optimization before proceeding")

Command-line dry run:

bq query --dry_run --use_legacy_sql=false 'YOUR_QUERY_HERE' 2>&1 | grep "bytes"

When to Ask the User About Costs

ASK USER BEFORE RUNNING if any of these conditions apply:

  1. Estimated cost > $1.00 - Always confirm with user for queries over $1
  2. Wildcard spans > 3 months - Queries like githubarchive.day.2025* scan entire year (~400 GB)
  3. No partition filter - Queries without date/time filters scan entire table range
  4. SELECT * used - Selecting all columns dramatically increases cost
  5. Cross-repository searches - Queries without repo.name filter scan all GitHub activity

Example user confirmation:

Query estimate: 120 GB ($0.75)
Scanning: githubarchive.day.202506* (June 2025, 30 days)
Reason: Cross-repository search for actor 'suspected-user'

This exceeds typical query cost ($0.10-0.30). Proceed? [y/n]

DON'T ASK if:

  • Estimated cost < $0.50 AND query is well-scoped (specific repo + date range)
  • User explicitly requested broad analysis (e.g., "scan all of 2025")

Cost Optimization Techniques for GitHub Archive

1. Select Only Required Columns (50-90% cost reduction)

-- ❌ EXPENSIVE: Scans ALL columns (~3 GB per day)
SELECT * FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans only needed columns (~0.3 GB per day)
SELECT
    type,
    created_at,
    repo.name,
    actor.login,
    JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Never use SELECT * in production queries. Always specify exact columns needed.

2. Use Specific Date Ranges (10-100x cost reduction)

-- ❌ EXPENSIVE: Scans entire year (~400 GB)
SELECT ... FROM `githubarchive.day.2025*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Scans specific month (~40 GB)
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ BEST: Scans single day (~3 GB)
SELECT ... FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'

Strategy: Start with narrow date ranges (1-7 days), then expand if needed. Use monthly tables (githubarchive.month.202506) for multi-month queries instead of daily wildcards.

3. Filter by Repository Name (5-50x cost reduction)

-- ❌ EXPENSIVE: Scans all GitHub activity
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'

-- ✅ OPTIMIZED: Filter by repo (BigQuery can prune data blocks)
SELECT ... FROM `githubarchive.day.202506*`
WHERE
    repo.name = 'target-org/target-repo'
    AND actor.login = 'target-user'

Rule: Always include repo.name filter when investigating a specific repository.

4. Avoid SELECT * with Wildcards (Critical)

-- ❌ CATASTROPHIC: Can scan 1+ TiB ($6.25+)
SELECT * FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

-- ✅ OPTIMIZED: Scans ~50 GB ($0.31)
SELECT
    created_at,
    actor.login,
    repo.name,
    JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'

5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)

IMPORTANT: LIMIT does *not


Content truncated.

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,5831,376

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,1251,201

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,4211,110

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,204751

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,163690

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,333621

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.