Replacing Elasticsearch with ClickHouse : A 90% Cost-Reduction Migration
A practical guide based on shipping this for a crypto-derivatives platform — annual observability bill went from high six figures to ~$50K, with faster queries and AI-powered log search as a bonus. If you're paying mid-six figures for Elasticsearch and ~90% of your queries are aggregations (error counts, latency percentiles, service health), you're paying full price for a feature you barely use — full-text search. This post walks through how a crypto exchange replaced Elasticsearch with ClickHouse for OpenTelemetry logs, traces, and metrics. Same OTEL instrumentation, just a different backend. Result: 5× smaller storage footprint, 2-6× faster queries on benchmarks, and natural-language log queries via an AI agent — at ~10% of the cost. If you have an existing Elasticsearch + Kibana observability stack and you've been wondering whether ClickHouse is a serious alternative, this is the deep-dive. Includes the schema, the migration plan, the OTEL Collector configuration, the cost numbers, and the gotchas. Note on numbers: The cost figures below ($400K Elasticsearch → ~$50K ClickHouse) are real annual numbers from this deployment, on log volumes typical of a high-traffic trading platform (low-tens of TB ingested per year, 90-day hot retention, multi-region). Your mileage will vary substantially with log volume, retention, and cluster size. The architectural patterns are universal. Executive Summary The Problem — Elasticsearch at $400K/Year Why ClickHouse for Observability Platform Architecture ClickHouse Schema — Logs, Traces, Metrics End-to-End Distributed Tracing — HTTP to ClickHouse Full-Text Log Search — Replacing Kibana Discover Visualization — Grafana Replaces Kibana Data Retention & Tiered Storage AI Layer — Natural Language Over Logs and Traces Migration Plan — Zero Downtime Cutover (Standard OTEL or BindPlane) Cost Analysis Risk Assessment Success Metrics Reference Links Replace the current Elasticsearch-based observability stack (application logs + OTEL traces + metrics) with ClickHouse — reducing annual infrastructure cost from $400K to ~$35-60K while gaining faster aggregations, better compression, unified storage with business data, and AI-powered log querying. Paying $400K/year for Elasticsearch managed service Elasticsearch is optimized for full-text search — most log queries are aggregations (error counts, latency percentiles, service health) where ClickHouse is 2-6x faster on cold queries, 1.7-2.6x on hot queries (ClickHouse/TextBench benchmark, OTEL logs at 1B–50B rows) Logs, traces, metrics, and business data live in separate systems — no cross-correlation Storage costs are high: same OTEL dataset takes 5x more space in Elasticsearch (49 GB vs 245 GB at 1B rows; 2.4 TB vs 12 TB at 50B rows) Kibana is the only query interface — no programmatic access, no AI layer Keep OpenTelemetry (OTEL) as the instrumentation standard — zero application changes. Change only the destination: swap Elasticsearch exporter → ClickHouse exporter in OTEL Collector. All logs, traces, and metrics land in ClickHouse. Grafana reads ClickHouse for dashboards and alerts. AI agent queries logs/traces in plain English via the same LibreChat + MCP platform. Metric Target Annual cost reduction $340-365K saved (~85-90% reduction) Storage reduction 5x smaller total footprint (16x on column files) — real benchmark at 1B–50B OTEL rows Query speed improvement 2-6x faster cold queries, 1.7-2.6x hot queries (ClickHouse/TextBench) Retention period Same or longer — at lower cost Unification Logs + traces + metrics + business data in one DB AI queries over logs Plain English → SQL → instant answer Cost Driver Elasticsearch Behaviour Annual Cost Share Storage Row-oriented index, 2-3x compression, needs SSD ~$100K Compute CPU-heavy indexing on every write, inverted index maintenance ~$150K Licensing Elastic managed service / Elastic Cloud premium ~$100K Operations Shard management, index lifecycle management (ILM), tuning ~$50K (eng time) Elasticsearch was built for full-text search on documents (web pages, articles). Application logs and OTEL telemetry are structured time-series data — they need aggregations, not document search. What teams actually query Elasticsearch efficiency ClickHouse efficiency "Error count per service last hour" Slow (aggregation on inverted index) Fast (columnar scan) "P99 latency for /api/trade endpoint" Slow (percentile aggregation) Fast (built-in quantile functions) "Show logs for TraceId = abc123" Fast (indexed term lookup) Fast (bloom filter) "Which services degraded after deploy at 14:00?" Medium Fast "Free-text: find logs containing OutOfMemoryError" Fast (native) Good (tokenbf bloom filter) ~90% of real observability queries are aggregations. Elasticsearch is paying full price for a capability (full-text search) that covers only ~10% of use cases. Where ClickHouse wins most — benchmark by query type (cold, 50B rows): Query Category Examples ClickHouse speedup Log retrieval (text match + fetch rows) "Find logs containing OutOfMemoryError" Narrowest gap — ES competitive Error/match counts "Count 500 errors in last hour" Moderate advantage Service-level breakdowns "Group errors by service" Large advantage Time-series trend analysis "Error rate per minute over last 24h" Widest gap — 6x+ faster The speedup grows with analytical complexity. Retrieval-only queries (find and show matching rows) is ES's home turf. The moment you add grouping, aggregation, or time-bucketing on top of a text match — the dominant pattern in observability — ClickHouse's vectorized engine pulls away decisively. Application │ ▼ OTEL SDK (instrumented) OTEL Collector │ ▼ Elasticsearch Exporter Elasticsearch Cluster │ ▼ Kibana ← only UI, no programmatic access No way to join log data with business data (e.g., "which users were affected by this error?") Kibana dashboards require manual setup — no AI layer Retention limited by cost — older logs are deleted or archived to cold storage with no query access Every new service that emits logs increases Elasticsearch cost linearly Cloudflare — Replaced Elasticsearch with ClickHouse for HTTP logs: 36 petabytes of data Queries that took 30+ seconds in Elasticsearch run in = today() GROUP BY username ORDER BY volume DESC LIMIT 10 `) return parseRows(rows), err } ClickHouse reads the traceparent header from every query and emits its own internal spans to system.opentelemetry_span_log: -- ClickHouse internal spans for a specific trace SELECT trace_id, span_id, parent_span_id, operation_name, (finish_time_us - start_time_us) / 1000 AS duration_ms, attribute['clickhouse.query'] AS sql_query FROM system.opentelemetry_span_log WHERE trace_id = '4bf92f3577b34da6a3ce929d0e0e4736' ORDER BY start_time_us; Export these to otel_traces via a materialized view so they appear in Grafana alongside app spans: -- Auto-export ClickHouse internal spans to otel_traces CREATE MATERIALIZED VIEW otel.clickhouse_spans_mv TO otel.otel_traces AS SELECT fromUnixTimestamp64Micro(start_time_us) AS Timestamp, lower(hex(trace_id)) AS TraceId, lower(hex(span_id)) AS SpanId, lower(hex(parent_span_id)) AS ParentSpanId, '' AS TraceState, operation_name AS SpanName, 'SPAN_KIND_SERVER' AS SpanKind, 'clickhouse' AS ServiceName, map() AS ResourceAttributes, attribute AS SpanAttributes, (finish_time_us - start_time_us) * 1000 AS Duration, -- convert to nanoseconds 'STATUS_CODE_OK' AS StatusCode, '' AS StatusMessage, [] AS `Events.Timestamp`, [] AS `Events.Name`, [] AS `Events.Attributes`, [] AS `Links.TraceId`, [] AS `Links.SpanId`, [] AS `Links.TraceState`, [] AS `Links.Attributes` FROM system.opentelemetry_span_log; -- Full waterfall for a single trade request SELECT SpanName, ServiceName, Duration / 1e6 AS duration_ms, SpanAttributes['http.method'] AS http_method, SpanAttributes['http.route'] AS route, SpanAttributes['db.statement'] AS sql, SpanAttributes['db.rows_affected'] AS rows, StatusCode, Timestamp FROM otel.otel_traces WHERE TraceId = '4bf92f3577b34da6a3ce929d0e0e4736' ORDER BY Timestamp ASC; Result — full waterfall in one query: SpanName Service duration_ms ────────────────────────────── ──────────────────── ─────────── http.server POST /api/trade api-gateway 245.3 ms └─ trade.execute trading-service 241.1 ms ├─ db.query (risk check) trading-service 18.4 ms ← SELECT mart_user_risk_profile ├─ db.query (insert trade) trading-service 4.2 ms ← INSERT INTO mart_trades_futures ├─ SELECT (internal CH) clickhouse 3.1 ms ← ClickHouse internal span └─ kafka.produce trading-service 2.1 ms Full-text search used to be the primary reason to keep Elasticsearch. ClickHouse has significantly closed this gap with a new text index type that works natively on object storage (S3) with the same performance as local disk — removing the last major technical advantage Elasticsearch held. Reference: ClickHouse Full-Text Search on Object Storage text Index — How to Add It to otel_logs -- Add the new text index to the Body column ALTER TABLE otel.otel_logs ADD INDEX body_text_idx(Body) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(Body)) GRANULARITY 1; -- Materialize the index on existing data ALTER TABLE otel.otel_logs MATERIALIZE INDEX body_text_idx; What the text index supports: Function Example Use Case hasToken hasToken(Body, 'OutOfMemoryError') Exact token match — fastest hasAllTokens hasAllTokens(Body, ['trade', 'failed']) All tokens must appear hasAnyTokens hasAnyTokens(Body, ['ERROR', 'FATAL']) Any token match LIKE Body LIKE '%OutOfMemoryError%' Wildcard match startsWith startsWith(Body, 'WARN') Prefix match match (regex) match(Body, 'user_[0-9]+') Regex search Performance: 7.4x speedup vs full table scan on text search (ClickHouse benchmark, 10M rows with array tags). Why it works on S3: The index uses sequential dictionary reads with front-coding compression — no random I/O, which is the key constraint on object storage. 94.5% of tokens appear in ≤6 rows, so embedded posting lists handle the vast majority of lookups without reading large posting lists. Real-world proof point — gitTrends: ClickHouse's own reference demo (github.com/ClickHouse/gitTrends) searches 10 billion+ GitHub events using hasToken() on a body text index — the exact same pattern used for log search here. The app lets users compare FTS index vs bloom-filter skip index vs full table scan in real time, with live row-scan counters streamed from ClickHouse. Sub-second queries at 10B rows validate the production viability of hasToken() for high-volume text search workloads. text index (preferred) -- Uses new text index — 7.4x faster than full scan SELECT Timestamp, ServiceName, SeverityText, Body FROM otel.otel_logs WHERE hasToken(Body, 'OutOfMemoryError') AND Timestamp >= now() - INTERVAL 1 HOUR ORDER BY Timestamp DESC LIMIT 100; -- LIKE also uses the text index automatically SELECT Timestamp, ServiceName, SeverityText, Body FROM otel.otel_logs WHERE Body LIKE '%OutOfMemoryError%' AND Timestamp >= now() - INTERVAL 1 HOUR ORDER BY Timestamp DESC LIMIT 100; -- Near O(1) — bloom_filter index makes this very fast SELECT * FROM otel.otel_logs WHERE TraceId = 'abc123def456' ORDER BY Timestamp; -- Filter on OTEL log attributes — common in structured logging SELECT ServiceName, count() AS error_count FROM otel.otel_logs WHERE LogAttributes['http.status_code'] = '500' AND Timestamp >= today() GROUP BY ServiceName ORDER BY error_count DESC; -- "All ERROR logs from trading-service in last 2 hours that mention user_id" SELECT Timestamp, SeverityText, Body, LogAttributes['user_id'] AS user_id, LogAttributes['error_code'] AS error_code, TraceId FROM otel.otel_logs WHERE ServiceName = 'trading-service' AND SeverityText = 'ERROR' AND Timestamp >= now() - INTERVAL 2 HOUR AND hasToken(Body, 'user_id') -- uses text index ORDER BY Timestamp DESC LIMIT 200; -- The text index preprocessor lowercases at index time — search is case-insensitive SELECT Timestamp, ServiceName, Body FROM otel.otel_logs WHERE hasToken(lower(Body), 'outofmemoryerror') -- matches OOM, oom, Oom, etc. AND Timestamp >= now() - INTERVAL 1 HOUR; Query Type Elasticsearch ClickHouse Notes Exact token match (hasToken) ~10ms ~15-30ms Near-comparable with text index Keyword search (LIKE) ~10ms ~20-50ms text index — 7.4x faster than scan Structured attribute filter ~50ms ~5-20ms CH wins (columnar) Aggregation (error count by service) ~200ms-2s ~10-50ms CH wins significantly Time-range + service filter ~100ms ~10-30ms CH wins (partition pruning) TraceId lookup ~10ms ~20-50ms Comparable (bloom filter) Free-text fuzzy search Excellent Good (regex via match()) ES still leads for fuzzy Search on S3/object storage Degraded (SSD required) Full speed on S3 CH advantage — no SSD needed Bottom line: ~95% of observability queries are structured (service + time + severity + attribute) — ClickHouse wins on all of those. For the remaining ~5% requiring keyword search in log bodies, the new text index brings ClickHouse to near-Elasticsearch performance. The only remaining ES advantage is fuzzy/phrase search, which is rarely needed for structured application logs. Coming soon in ClickHouse: phrase search (position-aware token matching) and JSON column indexing — which will close the remaining gap further. # Install ClickHouse datasource plugin grafana-cli plugins install grafana-clickhouse-datasource # Or in docker-compose: environment: - GF_INSTALL_PLUGINS=grafana-clickhouse-datasource # grafana/provisioning/datasources/clickhouse.yaml apiVersion: 1 datasources: - name: ClickHouse-OTEL type: grafana-clickhouse-datasource uid: clickhouse-otel jsonData: host: clickhouse port: 9000 database: otel username: grafana_readonly secureJsonData: password: ${GRAFANA_CH_PASSWORD} -- Error rate per service — last 1 hour, 1-minute buckets SELECT toStartOfMinute(Timestamp) AS time, ServiceName, countIf(SeverityText = 'ERROR') AS errors, count() AS total, errors / total * 100 AS error_rate_pct FROM otel.otel_logs WHERE Timestamp >= now() - INTERVAL 1 HOUR GROUP BY time, ServiceName ORDER BY time ASC; -- HTTP endpoint latency percentiles — last 30 minutes SELECT SpanAttributes['http.route'] AS endpoint, quantile(0.50)(Duration) / 1e6 AS p50_ms, quantile(0.95)(Duration) / 1e6 AS p95_ms, quantile(0.99)(Duration) / 1e6 AS p99_ms, count() AS request_count FROM otel.otel_traces WHERE SpanKind = 'SPAN_KIND_SERVER' AND Timestamp >= now() - INTERVAL 30 MINUTE GROUP BY endpoint ORDER BY p99_ms DESC; -- Slowest ClickHouse queries in last 1 hour SELECT SpanAttributes['db.statement'] AS sql, count() AS calls, avg(Duration) / 1e6 AS avg_ms, max(Duration) / 1e6 AS max_ms, quantile(0.99)(Duration) / 1e6 AS p99_ms FROM otel.otel_traces WHERE SpanAttributes['db.system'] IN ('clickhouse', 'postgresql') AND Timestamp >= now() - INTERVAL 1 HOUR GROUP BY sql ORDER BY p99_ms DESC LIMIT 20; -- Live log tail with filtering — wire to Grafana Logs panel SELECT Timestamp, ServiceName, SeverityText, Body, TraceId, LogAttributes FROM otel.otel_logs WHERE Timestamp >= $__timeFrom() -- Grafana time variable AND Timestamp = now()-INTERVAL 1 HOUR GROUP BY ServiceName ORDER BY count() DESC User: "Show me all traces where ClickHouse queries took more than 500ms today" → SQL: SELECT TraceId, ServiceName, Duration/1e6 AS ms, SpanAttributes['db.statement'] AS sql FROM otel_traces WHERE SpanAttributes['db.system']='clickhouse' AND Duration > 500000000 AND Timestamp >= today() ORDER BY Duration DESC User: "Which users were affected by the trading-service errors between 2pm and 3pm today?" → SQL: SELECT l.LogAttributes['user_id'], u.kyc_status, count() as errors FROM otel_logs l JOIN mart_users u ON l.LogAttributes['user_id'] = u.username WHERE l.ServiceName='trading-service' AND l.SeverityText='ERROR' AND l.Timestamp BETWEEN today()+toIntervalHour(14) AND today()+toIntervalHour(15) GROUP BY 1,2 ORDER BY errors DESC The last query is impossible in Kibana — it crosses observability data (logs) with business data (users). In ClickHouse, it's one SQL query the AI generates automatically. Add to business_glossary.yaml: # Observability terms - "error logs" = otel_logs WHERE SeverityText = 'ERROR' - "slow query" = otel_traces WHERE SpanAttributes['db.system'] IN ('clickhouse','postgresql') AND Duration > 500000000 - "HTTP 5xx" = otel_logs WHERE LogAttributes['http.status_code'] LIKE '5%' - "trade service" = ServiceName = 'trading-service' - "trace" = otel_traces WHERE TraceId = '' - "latency" = Duration / 1e6 (milliseconds) - "p99" = quantile(0.99)(Duration) / 1e6 - "error rate" = countIf(SeverityText='ERROR') / count() * 100 If you run a small number of centralized OTEL Collectors (one per environment), the standard approach is sufficient — edit the collector config YAML to add the ClickHouse exporter alongside the existing Elasticsearch exporter for dual-write. No extra tooling needed. If you run OTEL Collectors on many individual servers/services, BindPlane is worth considering. It is a centralized management platform for OTEL Collector fleets — instead of editing YAML configs on each server manually, you manage all collector configurations from one dashboard. Without BindPlane: Edit otel-collector.yaml on server 1 Edit otel-collector.yaml on server 2 Edit otel-collector.yaml on server N Restart each collector ... With BindPlane: Add ClickHouse destination once in BindPlane UI Roll out to entire fleet in one click What BindPlane adds for this migration: Capability Value Central config management One change pushes to all collectors instantly Dual-write in one click Route to Elasticsearch AND ClickHouse simultaneously without touching individual collectors Service-by-service cutover Route trading-service logs to ClickHouse first, validate, then add more services gradually Severity-based routing Route ERROR logs to Elasticsearch (keep during validation), INFO/DEBUG to ClickHouse only Safe fleet rollout Progressive rollout with automatic rollback on failure 130+ sources and destinations Supports standard OTEL ClickHouse exporter as destination BindPlane for self-managed ClickHouse: BindPlane's native "ClickStack destination" connects to ClickHouse Cloud managed product. For self-managed ClickHouse, use the standard OTEL clickhouseexporter as a generic OTLP destination in BindPlane — same outcome, slightly more manual config. Example BindPlane destination config: # BindPlane destination config for self-managed ClickHouse destination: type: otlp name: clickhouse-self-managed config: endpoint: tcp://clickhouse.internal:9000 headers: - key: x-clickhouse-database value: otel tls: insecure: false Reference: https://clickhouse.com/blog/bindplane-faster-otel-migrations-to-clickstack Task Details Deploy ClickHouse schema Create otel_logs, otel_traces, otel_metrics tables Add ClickHouse exporter to OTEL Collector Dual-write: send to both Elasticsearch AND ClickHouse (via YAML edit or BindPlane) Set up Grafana Install ClickHouse datasource, build core dashboards Validate data parity Compare row counts, spot-check log content Test trace waterfall Pick 5-10 real TraceIds, verify waterfall in Grafana matches Kibana Task Details Run both stacks simultaneously Elasticsearch + ClickHouse receive same data Migrate dashboards Rebuild all Kibana dashboards in Grafana Migrate alerts Recreate all Kibana alerts in Grafana Alerting Train teams Grafana walkthrough for each team Build AI log queries Add observability terms to business glossary Gradual service cutover (optional) Use BindPlane to route one service at a time to ClickHouse-only, validate, expand Task Details Confirm all dashboards working in Grafana Sign-off from each team Remove Elasticsearch exporter from OTEL Collector Single line config change (or one click in BindPlane) Verify ClickHouse-only flow 24-hour monitoring window Cancel Elasticsearch subscription After 48-hour clean run Task Details Tune TTL and tiered storage Configure S3 cold tier based on actual usage patterns Enable AI log queries in LibreChat Add observability glossary, test with teams Set up cross-data dashboards Logs + business data correlation panels in Grafana Performance tuning Review slow queries via system.query_log, tune ORDER BY keys if needed Dual-write period: Both systems receive data simultaneously for 4 weeks — no data loss risk Rollback: Removing ClickHouse exporter (or reverting BindPlane config) restores Elasticsearch-only in seconds No application changes: OTEL SDK configuration is unchanged throughout Gradual cutover: Cut over one service at a time using BindPlane routing rules if preferred Component Annual Cost Elasticsearch managed service (compute) ~$200K Storage (SSD, replicated) ~$100K Licensing (Elastic managed / premium) ~$50K Engineering time (ops, tuning, ILM) ~$50K Total ~$400K Component Annual Cost ClickHouse nodes (2x m7g.4xlarge, Graviton3) ~$18K Storage SSD (hot, last 7 days) ~$3K S3 (cold, 7-90 days) ~$2K ClickHouse Keeper (3x t3.small for consensus) ~$2K Engineering time (minimal ops) ~$10K Total ~$35K Component Annual Cost ClickHouse Cloud (auto-scaling) ~$36-60K S3 tiered storage Included Engineering time ~$5K (fully managed) Total ~$41-65K Scenario Annual Cost Saving vs Elasticsearch Current (Elasticsearch) $400K — ClickHouse Cloud $41-65K $335-359K saved (84-90%) ClickHouse Self-Hosted $35K $365K saved (91%) AI queries over logs — eliminates ad-hoc log digging by engineering (~20 hrs/month) Cross-data correlation — compliance/risk can correlate errors with affected users instantly Longer retention — at ClickHouse costs, retain 1 year vs 90 days for same budget Unified cluster — observability + business data in one system, one operations team Risk Likelihood Impact Mitigation Full-text search gaps Low Low 95% of queries are structured; bloom filters cover keyword search Data loss during migration Very Low High 4-week dual-write window eliminates risk Grafana learning curve Medium Low Grafana is widely used; team familiarity is high ClickHouse cluster instability Low High 2-replica HA, Keeper for consensus, daily S3 backups OTEL Collector overload Low Medium Batch processor + memory limiter configured; scale collector horizontally Schema changes in new service Low Low MergeTree handles new columns gracefully; OTEL schema is stable Cold data access latency (S3) Medium Low S3 queries are slower but acceptable for historical lookups Metric Target All dashboards migrated to Grafana 100% Elasticsearch subscription cancelled Done Log query latency (aggregation) 5x smaller footprint Alert parity All Kibana alerts recreated in Grafana Metric Target Annual cost reduction > $300K vs Elasticsearch baseline AI log queries active Teams using LibreChat for log analysis Cross-data queries At least 5 dashboards joining logs + business data Retention extended From 90 days to 180+ days (same cost) Engineering time saved 20+ hrs/month (no more ad-hoc log queries) Resource URL ClickHouse Observability docs https://clickhouse.com/docs/use-cases/observability Observability solution guide https://clickhouse.com/docs/use-cases/observability/overview ClickHouse as Elasticsearch alternative https://clickhouse.com/blog/elasticsearch-to-clickhouse-for-logs Building an Observability solution with ClickHouse https://clickhouse.com/blog/storing-log-data-in-clickhouse-fluent-bit-vector-open-telemetry ClickHouse for logs blog https://clickhouse.com/blog/using-clickhouse-for-log-analytics Resource URL OTEL Collector ClickHouse exporter https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter OTEL Collector contrib repo https://github.com/open-telemetry/opentelemetry-collector-contrib OTEL ClickHouse schema reference https://clickhouse.com/docs/use-cases/observability/schema-design Resource URL Grafana ClickHouse datasource https://grafana.com/grafana/plugins/grafana-clickhouse-datasource Grafana ClickHouse plugin docs https://grafana.com/docs/grafana/latest/datasources/clickhouse Resource URL Cloudflare: ClickHouse for HTTP logs https://blog.cloudflare.com/log-analytics-using-clickhouse ClickHouse vs Elasticsearch log analytics benchmark https://clickhouse.com/blog/elasticsearch-log-analytics-clickhouse Benchmark source code (reproducible) https://github.com/ClickHouse/TextBench BindPlane — OTEL fleet management for migrations https://clickhouse.com/blog/bindplane-faster-otel-migrations-to-clickstack Langfuse + ClickHouse (LLM observability) https://clickhouse.com/blog/langfuse-llm-analytics Component Technology Purpose Instrumentation OTEL SDK (unchanged) Auto-instrument apps — zero code changes Collection OTEL Collector + clickhouseexporter Receives and ships logs/traces/metrics to ClickHouse Storage ClickHouse otel database Logs, traces, metrics — compressed columnar storage Visualization Grafana + ClickHouse datasource Dashboards, trace waterfall, log explorer, alerting AI queries LibreChat + Qwen + MCP (existing) Plain English queries over logs and traces Cold storage S3 (tiered via ClickHouse TTL) Cheap long-term retention for historical data HA ClickHouse 2-replica cluster Same HA setup as the existing ClickHouse cluster Kibana Feature Grafana Equivalent Discover (log search) Explore → Logs panel with ClickHouse query Dashboard Dashboard (same concept) Visualize Panel with ClickHouse SQL query APM (traces) Explore → Traces panel with ClickHouse datasource Alerts Grafana Alerting (same or better) Index Lifecycle Management ClickHouse TTL (simpler — one SQL line) KQL (Kibana Query Language) SQL (standard, more powerful) Lens (drag-drop charts) Grafana panel builder If you're considering this migration, the decisions that matter most: Don't skip the OTEL Aggregator pattern. Agent-only loses data on ClickHouse blips. Run a couple of central aggregators with retry-on-failure — that's the production-grade choice. Use BindPlane if you have a large collector fleet. Worth it for fleet-wide config rollout. For a handful of central collectors, standard YAML is fine. Get the schema right the first time. ORDER BY (ServiceName, SeverityText, Timestamp) and the right CODECs are the difference between a 3× and 30× compression ratio. The schema in Section 6 has been validated at production scale. Run dual-write for 4 weeks, not 1. The gradual cutover is cheap insurance and lets you validate every dashboard/alert before cutting Elasticsearch off. The AI layer pays for itself. Plain-English log queries via LibreChat + an LLM means no more pinging engineering when the compliance team needs a one-off analysis. Once ClickHouse has the data, the AI integration is one config change. The full schema, OTEL Collector configs, Grafana queries, migration plan, and 16 production-tested recovery runbooks live in the companion repo on GitHub. The repo also includes the AI DBA MCP server (152 tools for ClickHouse operations) — if you're operating at scale, that's worth a look. If you've migrated off Elasticsearch (or are mid-migration), I'd love to compare notes. Reach out via LinkedIn or comment below. If this is useful to your team, the deeper architectural piece — Building an Agentic AI Data Platform on ClickHouse — is coming next.
