github-archive
Investigate GitHub security incidents using tamper-proof GitHub Archive data via BigQuery. Use when verifying repository activity claims, recovering deleted PRs/branches/tags/repos, attributing actions to actors, or reconstructing attack timelines. Provides immutable forensic evidence of all public GitHub events since 2011.
Install
mkdir -p .claude/skills/github-archive && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3846" && unzip -o skill.zip -d .claude/skills/github-archive && rm skill.zipInstalls to .claude/skills/github-archive
About this skill
GitHub Archive
Purpose: Query immutable GitHub event history via BigQuery to obtain tamper-proof forensic evidence for security investigations.
When to Use This Skill
- Investigating security incidents involving GitHub repositories
- Building threat actor attribution profiles
- Verifying claims about repository activity (media reports, incident reports)
- Reconstructing attack timelines with definitive timestamps
- Analyzing automation system compromises
- Detecting supply chain reconnaissance
- Cross-repository behavioral analysis
- Workflow execution verification (legitimate vs API abuse)
- Pattern-based anomaly detection
- Recovering deleted content: PRs, issues, branches, tags, entire repositories
GitHub Archive analysis should be your FIRST step in any GitHub-related security investigation. Start with the immutable record, then enrich with additional sources.
Core Principles
ALWAYS PREFER GitHub Archive as forensic evidence over:
- Local git command outputs (
git log,git show) - commits can be backdated/forged - Unverified claims from articles or reports - require independent confirmation
- GitHub web interface screenshots - can be manipulated
- Single-source evidence - always cross-verify
GitHub Archive IS your ground truth for:
- Actor attribution (who performed actions)
- Timeline reconstruction (when events occurred)
- Event verification (what actually happened)
- Pattern analysis (behavioral fingerprinting)
- Cross-repository activity tracking
- Deleted content recovery (issues, PRs, tags, commit references remain in archive)
- Repository deletion forensics (commit SHAs persist even after repo deletion and history rewrites)
What Persists After Deletion
Deleted Issues & PRs:
- Issue creation events (
IssuesEvent) remain in archive - Issue comments (
IssueCommentEvent) remain accessible - PR open/close/merge events (
PullRequestEvent) persist - Forensic Value: Recover deleted evidence of social engineering, reconnaissance, or coordination
Deleted Tags & Branches:
CreateEventrecords for tag/branch creation persistDeleteEventrecords document when deletion occurred- Forensic Value: Reconstruct attack staging infrastructure (e.g., malicious payload delivery tags)
Deleted Repositories:
- All
PushEventrecords to the repository remain queryable - Commit SHAs are permanently recorded in archive
- Fork relationships (
ForkEvent) survive deletion - Forensic Value: Access commit metadata even after threat actor deletes evidence
Deleted User Accounts:
- All activity events remain attributed to deleted username
- Timeline reconstruction remains possible
- Limitation: Direct code access lost, but commit SHAs can be searched elsewhere
Quick Start
Investigate if user opened PRs in June 2025:
from google.cloud import bigquery
from google.oauth2 import service_account
# Initialize client (see Setup section for credentials)
credentials = service_account.Credentials.from_service_account_file(
'path/to/credentials.json',
scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(credentials=credentials, project=credentials.project_id)
# Query for PR events
query = """
SELECT
created_at,
repo.name,
JSON_EXTRACT_SCALAR(payload, '$.pull_request.number') as pr_number,
JSON_EXTRACT_SCALAR(payload, '$.pull_request.title') as pr_title,
JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.202506*`
WHERE
actor.login = 'suspected-actor'
AND repo.name = 'target/repository'
AND type = 'PullRequestEvent'
ORDER BY created_at
"""
results = client.query(query)
for row in results:
print(f"{row.created_at}: PR #{row.pr_number} - {row.action}")
print(f" Title: {row.pr_title}")
Expected Output (if PR exists):
2025-06-15 14:23:11 UTC: PR #123 - opened
Title: Add new feature
2025-06-20 09:45:22 UTC: PR #123 - closed
Title: Add new feature
Interpretation:
- No results → Claim disproven (no PR activity found)
- Results found → Claim verified, proceed with detailed analysis
Setup
Prerequisites
-
Google Cloud Project:
- Login to Google Developer Console
- Create a project and activate BigQuery API
- Create a service account with
BigQuery Userrole - Download JSON credentials file
-
Install BigQuery Client:
pip install google-cloud-bigquery google-auth
Initialize Client
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
'path/to/credentials.json',
scopes=['https://www.googleapis.com/auth/bigquery']
)
client = bigquery.Client(
credentials=credentials,
project=credentials.project_id
)
Free Tier: Google provides 1 TB of data processed per month free.
Cost Management & Query Optimization
Understanding GitHub Archive Costs
BigQuery charges $6.25 per TiB of data scanned (after the 1 TiB free tier). GitHub Archive tables are large - a single month table can be 50-100 GB, and yearly wildcards can scan multiple TiBs. Unoptimized queries can cost $10-100+, while optimized versions of the same query cost $0.10-1.00.
Key Cost Principle: BigQuery uses columnar storage - you pay for ALL data in the columns you SELECT, not just matching rows. A query with SELECT * on one day of data scans ~3 GB even with LIMIT 10.
ALWAYS Estimate Costs Before Querying
CRITICAL RULE: Run a dry run to estimate costs before executing any query against GitHub Archive production tables.
from google.cloud import bigquery
def estimate_gharchive_cost(query: str) -> dict:
"""Estimate cost before running GitHub Archive query."""
client = bigquery.Client()
# Dry run - validates query and returns bytes to scan
dry_run_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
job = client.query(query, job_config=dry_run_config)
bytes_processed = job.total_bytes_processed
gb_processed = bytes_processed / (1024**3)
tib_processed = bytes_processed / (1024**4)
estimated_cost = tib_processed * 6.25
return {
'bytes': bytes_processed,
'gigabytes': round(gb_processed, 2),
'tib': round(tib_processed, 4),
'estimated_cost_usd': round(estimated_cost, 4)
}
# Example: Always check cost before running
estimate = estimate_gharchive_cost(your_query)
print(f"Cost estimate: {estimate['gigabytes']} GB → ${estimate['estimated_cost_usd']}")
if estimate['estimated_cost_usd'] > 1.0:
print("⚠️ HIGH COST QUERY - Review optimization before proceeding")
Command-line dry run:
bq query --dry_run --use_legacy_sql=false 'YOUR_QUERY_HERE' 2>&1 | grep "bytes"
When to Ask the User About Costs
ASK USER BEFORE RUNNING if any of these conditions apply:
- Estimated cost > $1.00 - Always confirm with user for queries over $1
- Wildcard spans > 3 months - Queries like
githubarchive.day.2025*scan entire year (~400 GB) - No partition filter - Queries without date/time filters scan entire table range
- SELECT * used - Selecting all columns dramatically increases cost
- Cross-repository searches - Queries without
repo.namefilter scan all GitHub activity
Example user confirmation:
Query estimate: 120 GB ($0.75)
Scanning: githubarchive.day.202506* (June 2025, 30 days)
Reason: Cross-repository search for actor 'suspected-user'
This exceeds typical query cost ($0.10-0.30). Proceed? [y/n]
DON'T ASK if:
- Estimated cost < $0.50 AND query is well-scoped (specific repo + date range)
- User explicitly requested broad analysis (e.g., "scan all of 2025")
Cost Optimization Techniques for GitHub Archive
1. Select Only Required Columns (50-90% cost reduction)
-- ❌ EXPENSIVE: Scans ALL columns (~3 GB per day)
SELECT * FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'
-- ✅ OPTIMIZED: Scans only needed columns (~0.3 GB per day)
SELECT
type,
created_at,
repo.name,
actor.login,
JSON_EXTRACT_SCALAR(payload, '$.action') as action
FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'
Never use SELECT * in production queries. Always specify exact columns needed.
2. Use Specific Date Ranges (10-100x cost reduction)
-- ❌ EXPENSIVE: Scans entire year (~400 GB)
SELECT ... FROM `githubarchive.day.2025*`
WHERE actor.login = 'target-user'
-- ✅ OPTIMIZED: Scans specific month (~40 GB)
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'
-- ✅ BEST: Scans single day (~3 GB)
SELECT ... FROM `githubarchive.day.20250615`
WHERE actor.login = 'target-user'
Strategy: Start with narrow date ranges (1-7 days), then expand if needed. Use monthly tables (githubarchive.month.202506) for multi-month queries instead of daily wildcards.
3. Filter by Repository Name (5-50x cost reduction)
-- ❌ EXPENSIVE: Scans all GitHub activity
SELECT ... FROM `githubarchive.day.202506*`
WHERE actor.login = 'target-user'
-- ✅ OPTIMIZED: Filter by repo (BigQuery can prune data blocks)
SELECT ... FROM `githubarchive.day.202506*`
WHERE
repo.name = 'target-org/target-repo'
AND actor.login = 'target-user'
Rule: Always include repo.name filter when investigating a specific repository.
4. Avoid SELECT * with Wildcards (Critical)
-- ❌ CATASTROPHIC: Can scan 1+ TiB ($6.25+)
SELECT * FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'
-- ✅ OPTIMIZED: Scans ~50 GB ($0.31)
SELECT
created_at,
actor.login,
repo.name,
JSON_EXTRACT_SCALAR(payload, '$.ref') as branch
FROM `githubarchive.day.2025*`
WHERE type = 'PushEvent'
5. Use LIMIT Correctly (Does NOT reduce cost on GHArchive)
IMPORTANT: LIMIT does *not
Content truncated.
More by gadievron
View all skills by gadievron →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
pdf-to-markdown
aliceisjustplaying
Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.
Related MCP Servers
Browse all serversOptimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool
GitGuardian MCP Server: auto secret scanning, secrets detection, honeytokens, and remediation for secrets management and
Query and analyze security data, incidents, and threat intelligence in Microsoft Sentinel’s data lake using natural lang
Extend your developer tools with GitHub MCP Server for advanced automation, supporting GitHub Student and student packag
MCP server for Ghidra reverse engineering. Enables AI assistants to decompile binaries, analyze functions, rename variab
Advanced MCP server enabling AI agents to autonomously run 150+ security and penetration testing tools. Covers reconnais
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.