blog
Re-Architecting a 100TB+ Data Pipeline on GCP
2024
Big DataClickHouseBigQueryGCPETLPostgreSQL
The Challenge
At Air360, our PostgreSQL databases were buckling under the volume of UX event data. We needed to process 100TB+ of data per month — click events, scroll events, page views, and session recordings — while maintaining sub-second query response times for real-time analytics dashboards.
Architecture Decisions
- ▹ClickHouse for real-time analytical queries — sub-second response on billions of rows, columnar storage optimized for aggregation
- ▹Google BigQuery for batch analytics, long-term storage, and ad-hoc BI queries
- ▹Parquet format for cold storage — reduced storage costs by 70% compared to JSON
- ▹Data freshness monitoring with alerts for pipeline lag exceeding 5 minutes
- ▹Dead-letter queues for graceful error handling and replay capability
- ▹Partitioning and clustering strategies for query performance optimization
yaml
# Data Pipeline Architecture
source:
events: [clicks, scrolls, page_views, sessions]
volume: "100TB+/month"
format: JSON (ingestion) → Parquet (storage)
processing:
real_time:
engine: ClickHouse
latency: "sub-second on billions of rows"
use_case: "Live analytics dashboards"
batch:
engine: Google BigQuery
use_case: "Ad-hoc BI, long-term trends"
reliability:
freshness_sla: "<5 minutes lag"
error_handling: "Dead-letter queues + replay"
storage_savings: "−70% (JSON → Parquet)"
uptime: "99.95%"Key Takeaways
The migration was not just a tech change — it was an organizational shift. We moved from "query our database" to "design our data plane." Teams had to think in terms of data freshness SLAs, schema evolution, and storage tiers. The operational complexity increased, but so did our capacity by 100x.