Implementing AI-Powered Document Key Thesis Extraction in Mobile Applications
Extracting theses differs from summarization: not "retell shorter," but "pull specific claims the author makes." For research papers—hypotheses and conclusions. For contracts—key obligations. For reports—recommendations and metrics.
Requires understanding document structure; needs a different prompt.
Document loading on mobile client
Document sources in mobile: PDFs from UIDocumentPickerViewController, photos via PHPickerViewController, text from clipboard or URL.
PDF requires text extraction before sending to LLM. iOS: PDFKit:
import PDFKit
func extractText(from url: URL) -> String {
guard let document = PDFDocument(url: url) else { return "" }
return (0..<document.pageCount).compactMap { index in
document.page(at: index)?.string
}.joined(separator: "\n\n")
}
PDFKit doesn't recognize text in scanned PDFs (images). For scans, need OCR—Vision.VNRecognizeTextRequest or cloud Google Document AI. Separate pipeline stage.
Android: PdfRenderer to render pages into Bitmap, then ML Kit Text Recognition for OCR, or itextpdf/pdfbox-android library for native text extraction from digital PDFs.
Prompt for thesis extraction
Prompt is most critical. "Highlight key ideas" gives summarization. For theses need structured output:
You are an expert document analyst. Extract the key theses from the document.
A thesis is a specific, arguable claim the author makes — not a topic or summary.
Return JSON:
{
"theses": [
{
"text": "exact or closely paraphrased thesis statement",
"location": "section or paragraph reference",
"type": "hypothesis|conclusion|recommendation|fact|argument",
"confidence": 0.0-1.0
}
],
"document_type": "research|contract|report|article|other"
}
Limit: 5-10 most important theses only.
type field is important. For contracts, only obligation and condition matter; for papers, hypothesis and conclusion. Client-side filtering by type shows what's relevant for your use case.
struct Thesis: Codable {
let text: String
let location: String
let type: ThesisType
let confidence: Float
}
enum ThesisType: String, Codable {
case hypothesis, conclusion, recommendation, fact, argument, obligation
}
Display: annotations in document
Thesis is more valuable if anchored to specific location. iOS: PDFAnnotation to highlight the fragment.
func highlightThesis(_ thesis: Thesis, in document: PDFDocument) {
guard let page = findPage(for: thesis.location, in: document) else { return }
let annotation = PDFAnnotation(
bounds: findBounds(for: thesis.text, on: page),
forType: .highlight,
withProperties: nil
)
annotation.color = colorForType(thesis.type)
annotation.contents = thesis.text
page.addAnnotation(annotation)
}
func colorForType(_ type: ThesisType) -> UIColor {
switch type {
case .conclusion: return .systemGreen.withAlphaComponent(0.4)
case .hypothesis: return .systemBlue.withAlphaComponent(0.4)
case .recommendation: return .systemOrange.withAlphaComponent(0.4)
default: return .systemYellow.withAlphaComponent(0.4)
}
}
Finding bounds for text on PDF page: via page.findString(_:withOptions:). Works for digital PDFs; for scans need coordinates from OCR.
Working with large documents
50-page contract = ~60k tokens. Fits in gpt-4o context but expensive and slow. Smarter: first extract document structure (headings, sections), then process each section separately, aggregate theses.
func extractThesesFromLargeDocument(_ text: String) async throws -> [Thesis] {
let sections = splitBySections(text) // split by heading patterns
var allTheses = [Thesis]()
for section in sections {
guard section.content.count > 200 else { continue } // skip TOC and empty sections
let theses = try await extractTheses(from: section.content, sectionTitle: section.title)
allTheses.append(contentsOf: theses)
}
// Deduplicate similar theses via embeddings similarity
return deduplicate(allTheses)
}
Deduplication is critical: different sections repeat same idea. Simple deduplication by Jaccard similarity of text; more accurate via cosine similarity of embeddings.
Timeline estimates
Basic thesis extraction from text document—3–5 days. Full pipeline with PDF parsing, OCR for scans, annotations in document, large file handling—2–3 weeks.







