Which GNN architecture is best for bot detection?

GAT (Graph Attention Network) often outperforms GCN because attention highlights anomalous connections—bots tend to have inorganic interaction patterns. Our BotDetectorGNN uses 3 GAT layers with Dropout for robustness.

Can the Louvain algorithm handle large graphs?

Yes, Louvain scales to millions of nodes. For graphs over 50K nodes, we recommend approximating betweenness centrality or switching to mini-batch GNN training.

How do GNNs improve friend recommendations (link prediction)?

GNN-based link prediction uses structural embeddings that encode topological proximity. On test sets we achieve Hits@50 0.65–0.75, 15–20% higher than classical methods (Jaccard, Adamic-Adar).

What is included in the deliverables of a GNN analysis project?

Delivery includes: trained model (PyTorch), inference code, report on detected communities/bots/influential nodes, API documentation, integration guidelines, and 3 months of support.

How do you evaluate community detection quality?

Primary metric is modularity (target >0.3 for real graphs). We also use Normalized Mutual Information (NMI) when ground truth is available.

Which GNN architecture is best for bot detection?

GAT (Graph Attention Network) often outperforms GCN because attention highlights anomalous connections—bots tend to have inorganic interaction patterns. Our BotDetectorGNN uses 3 GAT layers with Dropout for robustness.

Can the Louvain algorithm handle large graphs?

Yes, Louvain scales to millions of nodes. For graphs over 50K nodes, we recommend approximating betweenness centrality or switching to mini-batch GNN training.

How do GNNs improve friend recommendations (link prediction)?

GNN-based link prediction uses structural embeddings that encode topological proximity. On test sets we achieve Hits@50 0.65–0.75, 15–20% higher than classical methods (Jaccard, Adamic-Adar).

What is included in the deliverables of a GNN analysis project?

Delivery includes: trained model (PyTorch), inference code, report on detected communities/bots/influential nodes, API documentation, integration guidelines, and 3 months of support.

How do you evaluate community detection quality?

Primary metric is modularity (target >0.3 for real graphs). We also use Normalized Mutual Information (NMI) when ground truth is available.

GNN for Social Graphs: Bots, Communities, Link Prediction

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1564 services

GNN for Social Graphs: Bots, Communities, Link Prediction

Medium

~2-4 weeks

Frequently Asked Questions

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

B2B ADVANCE company website development
1354
Development of a web application for FEEDME
1248
Website development for BELFINGROUP
951
Development of an online store for the company FURNORO
1186
B2B Advance company logo design
643
Development of a web application for Enviok
925

Show more works

Bots in social networks are a problem that costs millions of dollars in advertising budgets. They mimic real user behavior, fake activity metrics, and trigger fraudulent schemes. Classic ML models (XGBoost, logistic regression) rely on handcrafted features that bots have learned to bypass. GNNs—Graph Neural Networks—leverage link topology: anomalous accounts have distinct interaction patterns. We are a team of engineers with 6+ years specializing in GNNs, having delivered 30+ projects on social graph analysis. On a project for a major social network, we detected 12% bots that mimicked activity but had abnormally high connectivity—GAT with attention immediately identified this pattern. Deploying the model cut the advertising budget by 40% through audience cleansing and improved the recommendation system—link prediction using GNN achieved Hits@50 of 0.72 vs. 0.48. Order a pilot project—we will show results on your data within two weeks.

What problems do GNNs solve in social graphs?

GNNs outperform feature-based methods where topology matters. Bots on Twitter/Telegram: they can fake features but cannot hide anomalous connections. Our BotDetectorGNN uses GATConv—the attention mechanism detects atypical patterns. Result: AUC 0.90–0.94 on the TwiBot-20 benchmark. For comparison, XGBoost on handcrafted features gives AUC 0.82–0.85—a significant difference. Detecting fraud rings is another task where GNNs are indispensable. Organized bot groups are interconnected, visible in the graph. Our FraudRingDetector finds high-density clusters with high bot probability, computing a risk_score.

Community detection also benefits from GNNs. The Louvain algorithm yields an initial partition with modularity ~0.3, but GNNs can refine the partition by training on structural embeddings. In our pipeline, we combine Louvain for initialization and GAT for refining community boundaries—this boosts modularity to 0.45–0.5.

Example: BotDetectorGNN in PyTorch Geometric

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GAEConv
from torch_geometric.utils import to_networkx, negative_sampling
import networkx as nx
import numpy as np
import pandas as pd
from community import community_louvain  # python-louvain

class SocialGraphAnalyzer:
    """Analysing social graph structure"""

    def build_graph_from_edges(self, edges: pd.DataFrame,
                                node_features: pd.DataFrame = None) -> tuple:
        """
        edges: source_id, target_id, weight (optional)
        node_features: node_id, feature_1, ..., feature_n
        """
        # Map string IDs to numeric indices
        all_nodes = pd.unique(edges[['source_id', 'target_id']].values.ravel())
        node_idx = {nid: i for i, nid in enumerate(all_nodes)}
        n_nodes = len(node_idx)

        src = edges['source_id'].map(node_idx).values
        dst = edges['target_id'].map(node_idx).values

        # Undirected graph: add reverse edges
        edge_index = torch.tensor([
            np.concatenate([src, dst]),
            np.concatenate([dst, src])
        ], dtype=torch.long)

        # Node features
        if node_features is not None:
            feat_matrix = node_features.set_index('node_id').reindex(all_nodes).fillna(0).values
            x = torch.tensor(feat_matrix, dtype=torch.float)
        else:
            # Degree as base feature
            degrees = np.bincount(src, minlength=n_nodes) + np.bincount(dst, minlength=n_nodes)
            x = torch.tensor(degrees.reshape(-1, 1), dtype=torch.float)

        return edge_index, x, node_idx

    def detect_communities_louvain(self, edge_index: torch.Tensor,
                                    n_nodes: int) -> dict:
        """
        Louvain algorithm for community detection.
        Optimizes modularity—a measure of partition quality.
        """
        # Convert to NetworkX
        G = nx.Graph()
        G.add_nodes_from(range(n_nodes))
        edges = edge_index.T.numpy()
        G.add_edges_from(edges)

        # Louvain algorithm
        partition = community_louvain.best_partition(G)

        # Modularity quality
        modularity = community_louvain.modularity(partition, G)

        community_sizes = pd.Series(partition).value_counts().sort_values(ascending=False)

        return {
            'node_to_community': partition,
            'n_communities': len(set(partition.values())),
            'modularity': round(modularity, 4),
            'largest_community_size': int(community_sizes.iloc[0]),
            'community_size_distribution': community_sizes.head(10).to_dict()
        }

    def compute_node_centrality(self, G: nx.Graph,
                                  top_k: int = 20) -> pd.DataFrame:
        """Node centrality metrics"""
        # Degree centrality
        degree_centrality = nx.degree_centrality(G)

        # Betweenness (for small graphs; for large — approximation)
        if G.number_of_nodes() < 5000:
            betweenness = nx.betweenness_centrality(G, normalized=True)
        else:
            betweenness = nx.betweenness_centrality(G, k=500, normalized=True)  # Approximation

        # PageRank
        pagerank = nx.pagerank(G, alpha=0.85, max_iter=100)

        df = pd.DataFrame({
            'degree_centrality': degree_centrality,
            'betweenness': betweenness,
            'pagerank': pagerank,
        })

        # Normalized composite score
        df_norm = (df - df.min()) / (df.max() - df.min() + 1e-9)
        df['influence_score'] = (
            df_norm['degree_centrality'] * 0.30 +
            df_norm['betweenness'] * 0.35 +
            df_norm['pagerank'] * 0.35
        )

        return df.nlargest(top_k, 'influence_score')


class BotDetectorGNN(nn.Module):
    """GNN for bot detection in social networks"""

    def __init__(self, node_features: int, hidden_dim: int = 64):
        super().__init__()
        # GAT is better than GCN for this task:
        # bots often have anomalous connections — attention reveals that
        from torch_geometric.nn import GATConv

        self.conv1 = GATConv(node_features, hidden_dim, heads=4, dropout=0.3)
        self.conv2 = GATConv(hidden_dim * 4, hidden_dim, heads=1, dropout=0.3)
        self.conv3 = GATConv(hidden_dim, 32, heads=1, dropout=0.3)

        self.classifier = nn.Sequential(
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(16, 2)  # Human vs Bot
        )

    def forward(self, x, edge_index):
        x = F.elu(self.conv1(x, edge_index))
        x = F.elu(self.conv2(x, edge_index))
        x = self.conv3(x, edge_index)
        return self.classifier(x)

    def get_bot_probability(self, x: torch.Tensor,
                             edge_index: torch.Tensor) -> np.ndarray:
        self.eval()
        with torch.no_grad():
            logits = self.forward(x, edge_index)
            probs = torch.softmax(logits, dim=-1)[:, 1]
        return probs.cpu().numpy()


class LinkPredictor(nn.Module):
    """
    Link prediction: predict new connections.
    Applications: 'People you may know', partner recommendations, fraud rings.
    """

    def __init__(self, node_features: int, hidden_dim: int = 64):
        super().__init__()
        self.encoder = nn.ModuleList([
            GCNConv(node_features, hidden_dim),
            GCNConv(hidden_dim, hidden_dim // 2),
        ])

        # Decoder: from embeddings of two nodes predict link
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def encode(self, x, edge_index):
        for conv in self.encoder:
            x = F.relu(conv(x, edge_index))
        return x

    def decode(self, z, edge_index):
        """Dot product of node pair embeddings"""
        src_emb = z[edge_index[0]]
        dst_emb = z[edge_index[1]]
        return self.decoder(src_emb * dst_emb).squeeze()

    def forward(self, x, edge_index, pos_edge, neg_edge=None):
        z = self.encode(x, edge_index)

        pos_scores = self.decode(z, pos_edge)

        if neg_edge is not None:
            neg_scores = self.decode(z, neg_edge)
            return pos_scores, neg_scores

        return pos_scores

    def predict_new_links(self, z: torch.Tensor,
                           candidate_pairs: torch.Tensor,
                           threshold: float = 0.7) -> list:
        """Predict new links from candidate pairs"""
        with torch.no_grad():
            scores = self.decode(z, candidate_pairs)

        predicted = []
        for i, score in enumerate(scores):
            if float(score) >= threshold:
                predicted.append({
                    'node_a': int(candidate_pairs[0, i]),
                    'node_b': int(candidate_pairs[1, i]),
                    'probability': round(float(score), 3)
                })

        return sorted(predicted, key=lambda x: -x['probability'])

Fraud Ring Detection

class FraudRingDetector:
    """Detect organized fraud via subgraph analysis"""

    def __init__(self, gnn_model: BotDetectorGNN):
        self.model = gnn_model

    def find_suspicious_clusters(self, graph_data,
                                   bot_probs: np.ndarray,
                                   min_cluster_bot_ratio: float = 0.6,
                                   min_cluster_size: int = 5) -> list[dict]:
        """
        Find densely connected subgraphs with high bot ratio.
        Fraud ring signature: interconnected group of accounts.
        """
        G = to_networkx(graph_data, to_undirected=True)

        # Add bot probabilities as node attributes
        for node_id in G.nodes():
            G.nodes[node_id]['bot_prob'] = float(bot_probs[node_id])

        suspicious_clusters = []

        # Find cliques and dense subgraphs
        for component in nx.connected_components(G):
            if len(component) < min_cluster_size:
                continue

            subgraph = G.subgraph(component)
            nodes = list(component)
            bot_ratio = np.mean([G.nodes[n]['bot_prob'] for n in nodes])

            if bot_ratio < min_cluster_bot_ratio:
                continue

            # Cluster density metrics
            density = nx.density(subgraph)
            avg_clustering = nx.average_clustering(subgraph)

            suspicious_clusters.append({
                'cluster_id': len(suspicious_clusters),
                'nodes': nodes,
                'size': len(nodes),
                'bot_probability': round(float(bot_ratio), 3),
                'density': round(density, 3),
                'avg_clustering': round(avg_clustering, 3),
                'risk_score': round(bot_ratio * density * avg_clustering, 3)
            })

        return sorted(suspicious_clusters, key=lambda x: -x['risk_score'])

Why GNNs outperform classical methods?

This table shows the difference on real tasks:

Task	Feature-based (XGBoost)	GNN (GAT)	GNN Advantage
Bot detection	AUC 0.82–0.85	AUC 0.90–0.94	+8–12% due to structure awareness
Link prediction	Hits@50 0.45–0.55	Hits@50 0.65–0.75	+18–27% on OGB-Collab
Community detection	Modularity 0.2–0.3	Modularity 0.35–0.5	+25–40% via end-to-end learning

Choosing the right GNN architecture matters. Compare popular variants:

Architecture	Strengths	When to use
GCN	Simplicity, speed	Graphs with homogeneous structure, low noise
GAT	Adaptive attention to edges	Bots, anomalies, heterogeneous connections
GraphSAGE	Scales to millions of nodes	Large graphs, inductive tasks

How does attention work in GAT?

GAT computes attention weights for each edge: $\alpha_{ij} = \text{softmax}(\text{LeakyReLU}(a^T[Wh_i || Wh_j]))$. This allows the model to focus on the most important connections while ignoring noise.

How we do it: work process

Analytics — collect graph data (SQL, social network APIs), deduplication, build edge_index. Check for asymmetry and duplicate edges.
Design — choose architecture (GAT/GCN), tune parameters (heads=4, dropout=0.3), loss function (binary cross entropy with negative sampling). Optimize for latency and memory.
Implementation — PyTorch Geometric, GPU training with early stopping, logging in Weights & Biases. Experiments with quantization (INT8) for faster inference.
Testing — time-based split (train: before date T, test: after), metrics: AUC, Hits@K, modularity. A/B test on live data.
Deployment — Triton Inference Server or ONNX Runtime, latency p99 < 50 ms for 10K nodes. Monitor data drift.

Timeline: from 2 to 6 weeks depending on data volume. Project cost is calculated individually—we'll discuss after analyzing your data.

What is included in deliverables

Trained model (PyTorch checkpoint + ONNX export)
Inference code with Dockerfile
Report: detected communities, bots, top influential nodes
API documentation and integration example
Team training session (2–4 hours)
3 months of model support

Get a consultation on GNN architecture for your project—contact us. We'll analyze your task in 30 minutes and offer an optimal solution with guaranteed results. Order a pilot project—we will show results on your data within two weeks.

Recommender System Development: From Collaborative Filtering to Real-Time Serving

On one e-commerce project with a catalog of 300k SKUs, we boosted CTR from 1.8% to 4.4% — a 2.4x increase. The first leap came from switching from 'popular in the last 7 days' to collaborative filtering; the second from adding content features and re-ranking. The difference between showing popular items and showing personalized recommendations is measurable and significant. Below is the engineering experience that made this possible, along with architectures that actually work in production.

Collaborative Filtering: Matrix Factorization and Neural Approaches

Matrix Factorization is the classic approach for implicit feedback (clicks, views, purchases without explicit ratings). ALS (Alternating Least Squares) from the Implicit library handles user×item matrices with hundreds of millions of non-zero values in minutes on GPU. Latent factors 64–256, regularization λ=0.01–0.1 are starting parameters. Cold start problem: no history for new users or items — pure CF fails; content features or hybrid approach needed.

Neural Collaborative Filtering (NCF) replaces the dot product with a neural network. In practice, the gain over a well-tuned ALS is modest, but NCF is easier to extend with additional features (age, category, time of day). Sequence-aware models (SASRec, BERT4Rec) account for the order of interactions — state-of-the-art for session-based recommendations.

How to Choose Recommender System Architecture?

The answer depends on data, load, and cold start requirements. Below are three main approaches with selection criteria.

Criterion	Collaborative Filtering	Content-Based Filtering	Hybrid (two-stage)
Data required	Interaction history	Item/user features	Both
Cold start	Poor	Works for new items	Partially solved
Diversity (long-tail)	Low, popularity bias	High	Medium–High
Serving latency	<5 ms (precomputed)	<10 ms (FAISS)	20–50 ms
Implementation complexity	Low	Medium	High

Hybrid architecture outperforms pure CF by 20–40% in long-tail coverage — validated on catalogs from 100k SKU.

Content-Based Filtering: When Interaction History is Scarce

Content-based recommends based on item characteristics rather than other users' behavior — solves cold start for new items. Text embeddings via sentence-transformers (multilingual-e5-base, BGE-M3) → similarity search using FAISS IndexFlatIP — query in <5 ms for 100k items. Item2Vec (Word2Vec on view sequences) yields interpretable 'similar items' in a couple hours of training.

Structured features (category, brand, price) are fed through embedding layers or gradient boosting — CatBoost handles categories without manual encoding.

Why Hybrid Models Work Better?

Production systems are almost always two-level. Stage 1 (Retrieval) — fast selection of 100–500 candidates from 300k items using ALS or Two-Tower model with vector search (FAISS, Qdrant). Stage 2 (Ranking) — heavy ranker on LightGBM or neural network with cross-features, time, device, and session context. LightFM is a good starting point for medium scale without heavy infrastructure. Our practice shows: moving from single-stage to two-stage yields a 15–25% accuracy improvement with only 20–30 ms additional latency.

Real-Time Serving: Architecture Under Load

Latency SLA — 50–100 ms at thousands of requests per second. Base recommendations precomputed (batch job hourly) → Redis by user_id → <5 ms. Real-time re-ranking via Kafka for events (clicks, cart adds) → update of context features. Feature serving — Redis with TTL (views in 24 hours, last clicked item). At 10k req/s, we deploy Redis Cluster with replication.

A/B testing is the only reliable way to measure improvements. Offline metrics do not always correlate with online. Kohavi et al., 'Online Controlled Experiments at Large Scale' (KDD 2013) — a must-read for the team. Test on 5–10% of traffic, monitor CTR, conversion, revenue per session. One of our client systems after hybridization increased revenue by 18% over a month of A/B.

Recommender System Development Timeline

The stages and typical time frames are in the table below. Costs are calculated individually based on catalog scale and latency requirements.

Stage	Duration	Result
Data audit and baseline	1–2 weeks	Report with matrix density, cold start zones, 'popular' metrics
Prototype (offline validation)	2–3 weeks	Working model with offline metrics (Recall@k, NDCG)
Production system (two-stage, A/B)	1.5–2.5 months	Low-latency service with monitoring and A/B infrastructure
Team training and documentation	1–2 weeks	Model card, deployment runbook, fine-tuning session

What's Included in Turnkey Development

Data audit — user×item matrix density (typically <0.1%), activity distribution, temporal patterns, cold start statistics.
Baseline — 'popular' as a simple threshold that is often hard to beat.
Iterative improvement — ALS → content features → two-stage → sequence-aware. Each step with A/B.
Serving infrastructure — batch precomputation, Redis, real-time re-ranking, Grafana monitoring.
Documentation — model card with metrics, deployment instructions, feature descriptions.
Team training — session on interpreting results and model fine-tuning.
Support — 1 month post-launch (incident fixes, pipeline tuning).

We are a team with 7+ years of experience in recommender systems, having delivered over 30 projects for e-commerce and media. We guarantee transparent A/B testing and documented metric improvements.

Want to assess the growth potential of your catalog? Contact us for a free data audit. Order recommender system development — first prototype within two weeks.

Example ALS config for implicit feedback

from implicit.als import AlternatingLeastSquares

model = AlternatingLeastSquares(
    factors=64,
    regularization=0.05,
    iterations=15,
    use_gpu=True
)
model.fit(user_item_matrix)

More about the mathematics of recommender systems — in specialized literature.