GitHub Development Activity Scraping for Crypto Projects

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
GitHub Development Activity Scraping for Crypto Projects
Simple
from 1 business day to 3 business days
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1217
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1046
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823

Parsing Data from GitHub (Crypto Project Activity)

GitHub activity is one of few on-chain replaceable signals for fundamental analysis of crypto projects. Number of commits, count of active contributors, release frequency, issue closure speed — all publicly available and correlates with real development. Services like Santiment and Electric Capital use exactly these metrics as part of developer activity score.

GitHub REST API vs GraphQL API

GitHub provides two APIs. For collecting repository data GraphQL API v4 is better — allows getting data in one request that REST would take 5–10 requests:

query RepoActivity($owner: String!, $repo: String!) {
  repository(owner: $owner, name: $repo) {
    stargazerCount
    forkCount
    defaultBranchRef {
      target {
        ... on Commit {
          history(first: 100) {
            totalCount
            nodes {
              committedDate
              author { name email }
              additions
              deletions
            }
          }
        }
      }
    }
    releases(last: 5) {
      nodes { tagName createdAt }
    }
    issues(states: OPEN) { totalCount }
    pullRequests(states: MERGED, last: 30) {
      nodes { createdAt mergedAt }
    }
  }
}

One request — and you have commits for last 100 days, releases, open issues, merged PRs. REST API would require separate requests to /commits, /releases, /issues, /pulls.

Authentication: Personal Access Token (PAT) with public_repo scope. Without token rate limit — 60 req/hour, with token — 5000 req/hour for REST, 5000 points/hour for GraphQL. PAT created via Settings → Developer settings → Fine-grained tokens.

What Exactly to Collect

For developer activity dashboard standard metric set:

Metric Endpoint / Field Collection Frequency
Commit count (30d/90d) defaultBranchRef.target.history.totalCount Daily
Active contributors defaultBranchRef.target.history.nodes[].author Daily
Code churn (additions + deletions) history.nodes[].additions/deletions Daily
Stars / forks stargazerCount, forkCount Daily
Issues velocity Open issues + closed last 30d Weekly
PR merge time (median) pullRequests.mergedAt - createdAt Weekly
Release cadence Dates of latest releases Weekly

Contributor deduplication: one developer can commit with different emails. Normalization via GitHub login (if commit associated with account) or fuzzy matching by name. Bots (Dependabot, renovate, github-actions) should be excluded from human contributor count.

Handling Rate Limits and Large Number of Repositories

Crypto-index of 200 projects — each with 5–10 repositories — is 1000–2000 repositories. At daily collection and 5000 points/hour limit need to manage request budget.

GraphQL request cost = sum(requested_nodes). Complex request with 100 commits costs ~100 points. For 2000 repositories need ~200k points — that's 40 hours with one PAT token.

Solutions:

  1. Multiple PAT tokens with rotation. Pool of 10 tokens gives 50k points/hour — enough for daily collection of 2000 repositories.

  2. Incremental collection: don't collect full history every time. Store last_collected_at, request only new commits via since parameter.

  3. Prioritization: "hot" repositories (high activity, large project TVL) collected more often, "cold" — once a week.

import httpx
import asyncio
from collections import deque

class GitHubRateLimiter:
    def __init__(self, tokens: list[str]):
        self.tokens = deque(tokens)
        self.current_remaining = {t: 5000 for t in tokens}

    async def get_token(self) -> str:
        # Rotate to token with most remaining points
        token = max(self.tokens, key=lambda t: self.current_remaining[t])
        if self.current_remaining[token] < 100:
            await asyncio.sleep(3600)  # Wait for reset
        return token

    async def graphql(self, query: str, variables: dict) -> dict:
        token = await self.get_token()
        async with httpx.AsyncClient() as client:
            resp = await client.post(
                "https://api.github.com/graphql",
                json={"query": query, "variables": variables},
                headers={"Authorization": f"Bearer {token}"},
            )
        # Update remaining from response headers
        cost = resp.json().get("data", {}).get("rateLimit", {}).get("cost", 1)
        self.current_remaining[token] -= cost
        return resp.json()

Project to Repository Mapping

Compiling and maintaining project → github_repos list is separate task. Sources:

  • Electric Capital Developer Report publishes open-source mapping of projects to repositories on GitHub
  • DeFiLlama has github field in protocol data: GET https://api.llama.fi/protocols returns protocol list with github URLs
  • Manual curation: for new projects or projects with non-standard github org names

Important nuance: large projects (Ethereum, Solana, Uniswap) have dozens of repositories in organization. Summing activity across whole org must be filtered — repo mirrors, forks, documentation repositories distort metrics.

Storage and Aggregation

TimescaleDB or ClickHouse for time-series metrics. Schema:

CREATE TABLE github_metrics (
    project_id    INTEGER,
    repo_full_name TEXT,
    measured_date  DATE,
    commits_30d    INTEGER,
    contributors_30d INTEGER,
    additions_30d  BIGINT,
    deletions_30d  BIGINT,
    stars          INTEGER,
    open_issues    INTEGER,
    PRIMARY KEY (repo_full_name, measured_date)
);

-- Aggregate by project (all its repositories)
CREATE VIEW project_dev_activity AS
SELECT project_id, measured_date,
       SUM(commits_30d)      AS total_commits,
       COUNT(DISTINCT repo_full_name) AS active_repos,
       SUM(contributors_30d) AS total_contributors
FROM github_metrics
GROUP BY project_id, measured_date;

Normalized developer activity score: log(commits + 1) × log(contributors + 1) — logarithm smooths outliers (one project with 10k commits shouldn't dominate ranking).