Contextual Compression Implementation for RAG

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Contextual Compression Implementation for RAG
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_website-b2b-advance_0.png
    B2B ADVANCE company website development
    1212
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_websites_belfingroup_462_0.webp
    Website development for BELFINGROUP
    852
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822

Implementation of Contextual Compression for RAG

Contextual Compression is a post-processing technique for retrieved documents where only the part of each chunk relevant to the specific query is extracted. This reduces "noise" in LLM context, cuts token count, and improves answer faithfulness.

Problem Without Contextual Compression

Standard RAG passes complete chunks (512–1024 tokens) to the LLM. Typical situation: a chunk has 600 tokens, of which 80 tokens actually answer the question, the rest is irrelevant context. This:

  • Increases cost (more input tokens)
  • Lowers accuracy (LLM "gets lost" in irrelevant text)
  • Reduces effective context window (less room for truly important chunks)

LLM-based Contextual Compression

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

# LLM-based compressor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8}),
)

compressed_docs = compression_retriever.invoke(
    "What is the procedure for contract approval?"
)

# Each document contains only the relevant fragment
for doc in compressed_docs:
    print(len(doc.page_content), "chars (vs original ~2000)")

Embedding-based Compressor (EmbeddingsFilter)

A faster and cheaper variant — filtering by cosine similarity:

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embeddings_filter = EmbeddingsFilter(
    embeddings=embeddings,
    similarity_threshold=0.76,  # Filter documents below threshold
)

filtering_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8}),
)

Pipeline: Compression + Reranking

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Pipeline: EmbeddingFilter → RedundantFilter → LLMExtractor
cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=3)

compressor_pipeline = DocumentCompressorPipeline(
    transformers=[
        EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.75),
        EmbeddingsRedundantFilter(embeddings=embeddings),  # Remove duplicates
        reranker,  # Rank remaining
    ]
)

pipeline_retriever = ContextualCompressionRetriever(
    base_compressor=compressor_pipeline,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
)

Practical Case: Context Compression for Technical Documentation

Task: Assistant for technical manuals (chunks ~800 tokens). After compression, average context reduced from 4800 to 1200 tokens per query.

Metric Without Compression With Compression (LLM)
Input tokens/query 5200 1450
Faithfulness (RAGAS) 0.79 0.94
Answer Relevancy 0.81 0.89
Cost (GPT-4o-mini) 0.3×
Latency 1.8s 2.4s (+compression LLM)

Compression reduced cost by 3.3× while improving faithfulness by 19%.

Timeline

  • Implementing Contextual Compression: 2–3 days
  • Tuning threshold/compressor: 2–3 days
  • Total: 1 week