Spike Testing: Testing Sudden Traffic Surges
Spike test simulates sudden multifold load increase: TV ad, news story, flash sale, DDoS. Unlike stress test, here it's not gradual ramp but instantaneous jump. Goal—ensure system doesn't crash and recovers in acceptable time.
Typical Spike Scenarios
- Flash sale: normal traffic 200 RPS → 2000 RPS in 30 seconds
- Email campaign: 100k users click link in 5 minutes
- News spike: major media publication—traffic ×10 in 2 minutes
- Bot attack: sudden DDoS from thousands of IPs
k6 Spike Test
// tests/spike/flash-sale.js
import http from 'k6/http'
import { check, sleep } from 'k6'
import { Rate } from 'k6/metrics'
const errorRate = new Rate('errors')
export const options = {
scenarios: {
// Baseline traffic always present
baseline: {
executor: 'constant-vus',
vus: 20,
duration: '15m',
},
// Spike: sudden increase
spike: {
executor: 'ramping-arrival-rate',
startRate: 20,
timeUnit: '1s',
preAllocatedVUs: 500,
maxVUs: 1000,
stages: [
{ duration: '5m', target: 20 }, // normal load
{ duration: '10s', target: 500 }, // sharp spike
{ duration: '2m', target: 500 }, // peak
{ duration: '10s', target: 20 }, // load removal
{ duration: '5m', target: 20 }, // recovery
]
}
},
thresholds: {
// During spike allow degradation but not failure
'http_req_duration{scenario:spike}': [
{ threshold: 'p(95)<3000', abortOnFail: false }
],
// Minimize errors
'errors{scenario:spike}': ['rate<0.05'], // < 5% during spike
// After spike—full recovery
'http_req_duration{scenario:baseline}': ['p(95)<500']
}
}
const BASE_URL = __ENV.BASE_URL || 'https://staging.example.com'
export default function() {
// Flagship endpoint for spike testing
const res = http.get(`${BASE_URL}/api/products/flash-sale`, {
timeout: '10s'
})
const success = check(res, {
'status 200': (r) => r.status === 200,
'responded in time': (r) => r.timings.duration < 3000
})
errorRate.add(!success)
sleep(Math.random() * 0.5)
}
Artillery Spike Scenario
# tests/spike/artillery-spike.yml
config:
target: "{{ $processEnvironment.BASE_URL }}"
phases:
- name: "Normal traffic"
duration: 300
arrivalRate: 50
- name: "Spike onset"
duration: 30
arrivalRate: 50
rampTo: 500
- name: "Spike peak"
duration: 120
arrivalRate: 500
- name: "Spike recovery"
duration: 30
arrivalRate: 500
rampTo: 50
- name: "Post-spike normal"
duration: 300
arrivalRate: 50
ensure:
# System must survive
thresholds:
- http.codes.200.percent: 95 # >= 95% successful responses
- http.response_time.p95: 5000 # p95 < 5 seconds
Autoscaling: Checking Response
#!/bin/bash
# scripts/watch-autoscaling.sh
# Run in parallel with spike test
NAMESPACE="production"
DEPLOYMENT="api"
echo "timestamp,replicas,ready_replicas,cpu_usage"
while true; do
TS=$(date -u +%Y-%m-%dT%H:%M:%SZ)
REPLICAS=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE \
-o jsonpath='{.spec.replicas}')
READY=$(kubectl get deployment $DEPLOYMENT -n $NAMESPACE \
-o jsonpath='{.status.readyReplicas}')
CPU=$(kubectl top pods -n $NAMESPACE --selector=app=$DEPLOYMENT \
--no-headers | awk '{sum+=$2} END {print sum}')
echo "$TS,$REPLICAS,$READY,${CPU}m"
sleep 15
done
Checking Queues and Circuit Breakers
# Monitor queue state during spike
import redis
import time
r = redis.Redis()
def monitor_queues():
metrics = {}
queues = ['jobs:default', 'jobs:critical', 'jobs:email']
for queue in queues:
metrics[queue] = r.llen(queue)
return metrics
# During spike queues may accumulate tasks
# Normal: queue grows during spike but drains after
# Problem: queue doesn't drain—workers can't keep up
What to Observe During Spike Test
Metric Before spike During spike Recovery
────────────────────────────────────────────────────────────────
RPS 50 500 50
p95 latency (ms) 200 2000 200 ✓
Error rate (%) 0.1 2.0 0.1 ✓
DB active connections 10 50 10 ✓
DB queue wait (ms) 5 500 5 ✓
App replicas (k8s) 2 8 2 ✓
Memory per pod (MB) 256 512 256 ✓
Job queue depth 0 5000 0 ✓ (in 5m)
If metric doesn't recover within 5 minutes after load removal—that's a problem.
Typical Spike Issues and Solutions
Connection pool exhaustion: during spike all workers simultaneously request DB connections. Solution: pgBouncer transaction mode, increase max_connections, rate-limit at application level.
Thundering herd on cache miss: spike clears cache, all requests hit DB simultaneously. Solution: request coalescing (one DB request, others wait), probabilistic early expiration.
Memory pressure: spike allocates many objects, GC can't keep up. Solution: increase heap limit, profile allocations.
HPA responds too slowly: Kubernetes HPA by default waits 5 minutes before scale-up. Solution: reduce --horizontal-pod-autoscaler-sync-period, use KEDA for event-driven scaling, keep pre-warmed pods.
KEDA for Instant Scaling
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-scaledobject
spec:
scaleTargetRef:
name: api-deployment
minReplicaCount: 3
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_per_second
query: sum(rate(http_requests_total[30s]))
threshold: '100' # 1 pod per 100 RPS
Timeline
Spike test with autoscaling and circuit breaker monitoring—1–2 business days.







