Developing AI e-Discovery Legal System
e-Discovery (electronic disclosure) — process of discovering, collecting, and analyzing electronic documents in litigation or investigation. AI system processes terabytes of data and identifies relevant documents.
e-Discovery Stages (EDRM Framework)
Identification: determine data sources (email servers, file systems, messengers, cloud storage).
Preservation: legal hold — data retention without changes after lawsuit notice.
Collection: data gathering from sources with chain of custody compliance.
Processing: conversion to single format, deduplication, filtering by date/custodian.
Review: AI-assisted review — document prioritization by relevance.
Production: document delivery to opposing party in required format.
Technology-Assisted Review (TAR)
TAR (Predictive Coding) — key AI task in e-Discovery. System learns from small sample marked by lawyers and predicts relevance for remaining corpus:
class DocumentRelevance(BaseModel):
document_id: str
relevance_score: float # 0-1
is_privileged: bool # attorney-client privilege
is_responsive: bool # answers disclosure request
key_topics: list[str]
custodians: list[str] # who participates in correspondence
date: date | None
def predict_relevance(
document: str,
seed_set: list[tuple[str, bool]] # (doc, is_relevant) for training
) -> DocumentRelevance:
# Active Learning: select most informative documents for annotation
...
Privileged Document Detection
Attorney-client privilege — documents exempt from disclosure. AI identifies:
- Communications with external counsel (by email domain)
- Legal consultation requests
- Documents marked Confidential/Privileged
- Lawyer work product
False negative critical: missing privileged document → serious violation.
Data and Formats
Typical sources: Outlook/Exchange (PST), Gmail (mbox), Slack/Teams (JSON API), SharePoint (CSOM), file servers. Conversion to single format: Relativity RSMF or custom pipeline via Apache Tika.
Scale: enterprise e-Discovery — millions of documents. FAISS ANN-index provides search across millions of vectors in < 100ms.







