Implementation of Contextual Compression for RAG
Contextual Compression is a post-processing technique for retrieved documents where only the part of each chunk relevant to the specific query is extracted. This reduces "noise" in LLM context, cuts token count, and improves answer faithfulness.
Problem Without Contextual Compression
Standard RAG passes complete chunks (512–1024 tokens) to the LLM. Typical situation: a chunk has 600 tokens, of which 80 tokens actually answer the question, the rest is irrelevant context. This:
- Increases cost (more input tokens)
- Lowers accuracy (LLM "gets lost" in irrelevant text)
- Reduces effective context window (less room for truly important chunks)
LLM-based Contextual Compression
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# LLM-based compressor
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8}),
)
compressed_docs = compression_retriever.invoke(
"What is the procedure for contract approval?"
)
# Each document contains only the relevant fragment
for doc in compressed_docs:
print(len(doc.page_content), "chars (vs original ~2000)")
Embedding-based Compressor (EmbeddingsFilter)
A faster and cheaper variant — filtering by cosine similarity:
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
embeddings_filter = EmbeddingsFilter(
embeddings=embeddings,
similarity_threshold=0.76, # Filter documents below threshold
)
filtering_retriever = ContextualCompressionRetriever(
base_compressor=embeddings_filter,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 8}),
)
Pipeline: Compression + Reranking
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
# Pipeline: EmbeddingFilter → RedundantFilter → LLMExtractor
cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")
reranker = CrossEncoderReranker(model=cross_encoder, top_n=3)
compressor_pipeline = DocumentCompressorPipeline(
transformers=[
EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.75),
EmbeddingsRedundantFilter(embeddings=embeddings), # Remove duplicates
reranker, # Rank remaining
]
)
pipeline_retriever = ContextualCompressionRetriever(
base_compressor=compressor_pipeline,
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),
)
Practical Case: Context Compression for Technical Documentation
Task: Assistant for technical manuals (chunks ~800 tokens). After compression, average context reduced from 4800 to 1200 tokens per query.
| Metric | Without Compression | With Compression (LLM) |
|---|---|---|
| Input tokens/query | 5200 | 1450 |
| Faithfulness (RAGAS) | 0.79 | 0.94 |
| Answer Relevancy | 0.81 | 0.89 |
| Cost (GPT-4o-mini) | 1× | 0.3× |
| Latency | 1.8s | 2.4s (+compression LLM) |
Compression reduced cost by 3.3× while improving faithfulness by 19%.
Timeline
- Implementing Contextual Compression: 2–3 days
- Tuning threshold/compressor: 2–3 days
- Total: 1 week







