blog

Re-Architecting a 100TB+ Data Pipeline on GCP

2024

Big DataClickHouseBigQueryGCPETLPostgreSQL

The Challenge

At Air360, our PostgreSQL databases were buckling under the volume of UX event data. We needed to process 100TB+ of data per month — click events, scroll events, page views, and session recordings — while maintaining sub-second query response times for real-time analytics dashboards.

Architecture Decisions

▹ClickHouse for real-time analytical queries — sub-second response on billions of rows, columnar storage optimized for aggregation
▹Google BigQuery for batch analytics, long-term storage, and ad-hoc BI queries
▹Parquet format for cold storage — reduced storage costs by 70% compared to JSON
▹Data freshness monitoring with alerts for pipeline lag exceeding 5 minutes
▹Dead-letter queues for graceful error handling and replay capability
▹Partitioning and clustering strategies for query performance optimization

yaml

# Data Pipeline Architecture
source:
  events: [clicks, scrolls, page_views, sessions]
  volume: "100TB+/month"
  format: JSON (ingestion) → Parquet (storage)

processing:
  real_time:
    engine: ClickHouse
    latency: "sub-second on billions of rows"
    use_case: "Live analytics dashboards"
  batch:
    engine: Google BigQuery
    use_case: "Ad-hoc BI, long-term trends"

reliability:
  freshness_sla: "<5 minutes lag"
  error_handling: "Dead-letter queues + replay"
  storage_savings: "−70% (JSON → Parquet)"
  uptime: "99.95%"

Key Takeaways

The migration was not just a tech change — it was an organizational shift. We moved from "query our database" to "design our data plane." Teams had to think in terms of data freshness SLAs, schema evolution, and storage tiers. The operational complexity increased, but so did our capacity by 100x.

← All entries Home →