Setting Up Elasticsearch Monitoring and Alerts (Kibana)
Elasticsearch won't report problems on its own — you need to set up monitoring in advance. Most common scenarios: disk filled up, cluster entered red status due to unallocated shards, heap reached 95% and GC storm kicked in. Without alerts, you learn about issues from users.
Stack Monitoring in Kibana
Kibana Stack Monitoring is a built-in dashboard for monitoring Elasticsearch, Logstash, Kibana, Beats. Monitoring data is collected through Metricbeat or through a built-in mechanism (deprecated, not recommended).
Setting up Metricbeat for ES metric collection:
# metricbeat.yml
metricbeat.modules:
- module: elasticsearch
xpack.enabled: true
period: 10s
hosts: ["https://localhost:9200"]
username: "remote_monitoring_user"
password: "${ES_MONITOR_PASSWORD}"
ssl.certificate_authorities: ["/etc/elasticsearch/certs/ca.crt"]
scope: cluster
metricsets:
- ccr
- cluster_stats
- enrich
- index
- index_recovery
- index_summary
- ml_job
- node
- node_stats
- pending_tasks
- shard
output.elasticsearch:
hosts: ["https://monitoring-es:9200"]
username: "metricbeat_writer"
password: "${MONITOR_WRITER_PASSWORD}"
It's better to write metrics to a separate monitoring cluster, not the one being monitored — otherwise, when the main cluster has issues, you lose monitoring too.
Key Metrics
Cluster health — first thing to look at:
-
green— all shards (primary + replica) are assigned -
yellow— primary shards OK, some replicas not assigned (normal for one node, problem for production) -
red— some primary shards not assigned, data is inaccessible
JVM Heap Used % — critical indicator:
- < 75% — normal
- 75–85% — monitor, frequent GC possible
-
85% — danger zone, performance degrades
-
95% — JVM freezes on GC, cluster stops responding
Disk usage per node — ES blocks indexing when disk fills:
-
flood_stagethreshold (default 95%) — index becomes read-only -
high_watermarkthreshold (90%) — shard rebalancing starts -
low_watermarkthreshold (85%) — rebalancing completes, normal
Search latency — query execution time. p50, p95, p99. Watch p95 — spikes on 1–2% of queries signal problems.
Indexing rate — documents/sec. Sharp drop indicates resource issues.
Watcher — Elasticsearch Alerts
Elastic Watcher (X-Pack) is a built-in alerting mechanism. Configured via API or Kibana UI.
Alert on red cluster status:
PUT _watcher/watch/cluster_status_red
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"http": {
"request": {
"host": "localhost",
"port": 9200,
"path": "/_cluster/health",
"auth": {
"basic": {
"username": "elastic",
"password": "{{ctx.metadata.es_password}}"
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.status": {
"eq": "red"
}
}
},
"actions": {
"send_telegram": {
"webhook": {
"scheme": "https",
"host": "api.telegram.org",
"port": 443,
"method": "post",
"path": "/bot{{ctx.metadata.telegram_token}}/sendMessage",
"params": {
"chat_id": "{{ctx.metadata.telegram_chat_id}}",
"text": "ALERT: Elasticsearch cluster status is RED! Time: {{ctx.execution_time}}"
}
}
}
}
}
Alert on disk usage > 85%:
PUT _watcher/watch/disk_usage_high
{
"trigger": {
"schedule": { "interval": "5m" }
},
"input": {
"http": {
"request": {
"path": "/_nodes/stats/fs",
"auth": { "basic": { "username": "elastic", "password": "changeme" } }
}
}
},
"condition": {
"script": {
"source": """
for (node in ctx.payload.nodes.values()) {
def total = node.fs.total.total_in_bytes;
def free = node.fs.total.free_in_bytes;
def used_pct = (total - free) / total * 100;
if (used_pct > 85) return true;
}
return false;
"""
}
},
"actions": {
"log": {
"logging": {
"level": "warn",
"text": "High disk usage detected on Elasticsearch node"
}
}
}
}
Setting Up Alerts via Kibana UI
In Kibana 8.x — Alerts & Actions section (Stack Management > Rules). Visual rule builder without manual JSON writing.
Ready templates: Elasticsearch cluster health, nodes changed, version mismatch, CPU usage, JVM memory.
Notification channels: Email, Slack, PagerDuty, Webhook (Telegram, Teams).
Metrics via Prometheus + Grafana
If infrastructure already uses Prometheus, connect elasticsearch_exporter:
docker run -d \
--name elasticsearch_exporter \
-p 9114:9114 \
prometheuscommunity/elasticsearch-exporter:latest \
--es.uri=https://elastic:changeme@localhost:9200 \
--es.ssl-skip-verify \
--es.all \
--es.indices \
--es.shards
Prometheus scrape_config:
- job_name: 'elasticsearch'
static_configs:
- targets: ['localhost:9114']
scrape_interval: 30s
Import Grafana dashboard ID 6483 (Elasticsearch Overview) — ready dashboard with key metrics.
Elasticsearch Logs
Key log files on each node:
/var/log/elasticsearch/myapp-prod.log — main log
/var/log/elasticsearch/myapp-prod_gc.log — GC log (analyze for heap issues)
/var/log/elasticsearch/myapp-prod_server.log — system events
Slow search queries are logged via slow log:
PUT /products/_settings
{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.slowlog.threshold.query.info": "2s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.indexing.slowlog.threshold.index.warn": "5s"
}
Timeline
Basic monitoring via Kibana Stack Monitoring with Metricbeat — 1 working day. Setting up alerts via Watcher or Kibana Rules with notifications to Telegram/Slack — 1 more day. Grafana dashboard with Prometheus — 1 day if Prometheus infrastructure exists.







