Parsing Data from GitHub (Crypto Project Activity)
GitHub activity is one of few on-chain replaceable signals for fundamental analysis of crypto projects. Number of commits, count of active contributors, release frequency, issue closure speed — all publicly available and correlates with real development. Services like Santiment and Electric Capital use exactly these metrics as part of developer activity score.
GitHub REST API vs GraphQL API
GitHub provides two APIs. For collecting repository data GraphQL API v4 is better — allows getting data in one request that REST would take 5–10 requests:
query RepoActivity($owner: String!, $repo: String!) {
repository(owner: $owner, name: $repo) {
stargazerCount
forkCount
defaultBranchRef {
target {
... on Commit {
history(first: 100) {
totalCount
nodes {
committedDate
author { name email }
additions
deletions
}
}
}
}
}
releases(last: 5) {
nodes { tagName createdAt }
}
issues(states: OPEN) { totalCount }
pullRequests(states: MERGED, last: 30) {
nodes { createdAt mergedAt }
}
}
}
One request — and you have commits for last 100 days, releases, open issues, merged PRs. REST API would require separate requests to /commits, /releases, /issues, /pulls.
Authentication: Personal Access Token (PAT) with public_repo scope. Without token rate limit — 60 req/hour, with token — 5000 req/hour for REST, 5000 points/hour for GraphQL. PAT created via Settings → Developer settings → Fine-grained tokens.
What Exactly to Collect
For developer activity dashboard standard metric set:
| Metric | Endpoint / Field | Collection Frequency |
|---|---|---|
| Commit count (30d/90d) | defaultBranchRef.target.history.totalCount |
Daily |
| Active contributors | defaultBranchRef.target.history.nodes[].author |
Daily |
| Code churn (additions + deletions) | history.nodes[].additions/deletions |
Daily |
| Stars / forks | stargazerCount, forkCount |
Daily |
| Issues velocity | Open issues + closed last 30d | Weekly |
| PR merge time (median) | pullRequests.mergedAt - createdAt |
Weekly |
| Release cadence | Dates of latest releases |
Weekly |
Contributor deduplication: one developer can commit with different emails. Normalization via GitHub login (if commit associated with account) or fuzzy matching by name. Bots (Dependabot, renovate, github-actions) should be excluded from human contributor count.
Handling Rate Limits and Large Number of Repositories
Crypto-index of 200 projects — each with 5–10 repositories — is 1000–2000 repositories. At daily collection and 5000 points/hour limit need to manage request budget.
GraphQL request cost = sum(requested_nodes). Complex request with 100 commits costs ~100 points. For 2000 repositories need ~200k points — that's 40 hours with one PAT token.
Solutions:
-
Multiple PAT tokens with rotation. Pool of 10 tokens gives 50k points/hour — enough for daily collection of 2000 repositories.
-
Incremental collection: don't collect full history every time. Store
last_collected_at, request only new commits viasinceparameter. -
Prioritization: "hot" repositories (high activity, large project TVL) collected more often, "cold" — once a week.
import httpx
import asyncio
from collections import deque
class GitHubRateLimiter:
def __init__(self, tokens: list[str]):
self.tokens = deque(tokens)
self.current_remaining = {t: 5000 for t in tokens}
async def get_token(self) -> str:
# Rotate to token with most remaining points
token = max(self.tokens, key=lambda t: self.current_remaining[t])
if self.current_remaining[token] < 100:
await asyncio.sleep(3600) # Wait for reset
return token
async def graphql(self, query: str, variables: dict) -> dict:
token = await self.get_token()
async with httpx.AsyncClient() as client:
resp = await client.post(
"https://api.github.com/graphql",
json={"query": query, "variables": variables},
headers={"Authorization": f"Bearer {token}"},
)
# Update remaining from response headers
cost = resp.json().get("data", {}).get("rateLimit", {}).get("cost", 1)
self.current_remaining[token] -= cost
return resp.json()
Project to Repository Mapping
Compiling and maintaining project → github_repos list is separate task. Sources:
- Electric Capital Developer Report publishes open-source mapping of projects to repositories on GitHub
-
DeFiLlama has
githubfield in protocol data:GET https://api.llama.fi/protocolsreturns protocol list with github URLs - Manual curation: for new projects or projects with non-standard github org names
Important nuance: large projects (Ethereum, Solana, Uniswap) have dozens of repositories in organization. Summing activity across whole org must be filtered — repo mirrors, forks, documentation repositories distort metrics.
Storage and Aggregation
TimescaleDB or ClickHouse for time-series metrics. Schema:
CREATE TABLE github_metrics (
project_id INTEGER,
repo_full_name TEXT,
measured_date DATE,
commits_30d INTEGER,
contributors_30d INTEGER,
additions_30d BIGINT,
deletions_30d BIGINT,
stars INTEGER,
open_issues INTEGER,
PRIMARY KEY (repo_full_name, measured_date)
);
-- Aggregate by project (all its repositories)
CREATE VIEW project_dev_activity AS
SELECT project_id, measured_date,
SUM(commits_30d) AS total_commits,
COUNT(DISTINCT repo_full_name) AS active_repos,
SUM(contributors_30d) AS total_contributors
FROM github_metrics
GROUP BY project_id, measured_date;
Normalized developer activity score: log(commits + 1) × log(contributors + 1) — logarithm smooths outliers (one project with 10k commits shouldn't dominate ranking).







