Groq Integration for Fast LLM Inference
Groq uses its own LPU (Language Processing Unit) — a specialized processor for language model inference. Result: 500–800 tokens/sec vs 50–100 at GPU providers. This opens scenarios that were previously impossible: real-time transcription with instant answers, interactive coding assistants without noticeable delays.
Basic Integration
from groq import Groq, AsyncGroq
client = Groq(api_key="GROQ_API_KEY")
async_client = AsyncGroq(api_key="GROQ_API_KEY")
# Synchronous request — noticeably faster than other providers
response = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[{"role": "user", "content": "Explain the concept"}],
temperature=0,
max_tokens=1024,
)
print(response.choices[0].message.content)
# Async
async def fast_query(prompt: str) -> str:
response = await async_client.chat.completions.create(
model="llama-3.1-8b-instant", # Extremely fast
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# Streaming (low latency to first token)
def stream_fast(prompt: str):
with client.chat.completions.stream(
model="llama-3.1-70b-versatile",
messages=[{"role": "user", "content": prompt}],
) as stream:
for text in stream.text_stream:
yield text
Audio Transcription (Whisper on Groq)
# Whisper on Groq — fastest transcription in the cloud
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
file=("audio.mp3", audio_file.read()),
model="whisper-large-v3",
language="ru",
response_format="verbose_json", # With timestamps
)
print(transcription.text)
# Translation
translation = client.audio.translations.create(
file=("audio.mp3", open("audio.mp3", "rb").read()),
model="whisper-large-v3",
)
Available Models Groq
| Model | Speed | Context | Use |
|---|---|---|---|
| llama-3.1-70b-versatile | ~330 tok/s | 128K | General tasks |
| llama-3.1-8b-instant | ~750 tok/s | 128K | Realtime apps |
| mixtral-8x7b-32768 | ~500 tok/s | 32K | Long context |
| gemma2-9b-it | ~500 tok/s | 8K | Fast tasks |
| whisper-large-v3 | — | — | Audio |
When Groq is Right Choice
Groq optimal for:
- Chatbot requiring < 500 ms to first token
- Realtime code completion (IDE assistant)
- Batch processing with tight SLAs
- Audio transcription in real-time
Groq not suitable for:
- Tasks with very large output (cost higher for long answers)
- When maximum accuracy matters (Llama 70B < Claude Opus/GPT-4o)
- Cost at high loads
Groq Pricing
| Model | Input (1M) | Output (1M) |
|---|---|---|
| Llama 3.1 70B | $0.59 | $0.79 |
| Llama 3.1 8B | $0.05 | $0.08 |
| Whisper Large v3 | $0.111 / audio hour | — |
Timeline
- Basic integration: 0.5 days
- Realtime chat with streaming: 1–2 days
- Whisper transcription pipeline: 2–3 days







