Open Source Lakehouse Platform PDF Print E-mail
Written by Administrator   
Sunday, 14 June 2026 06:59

AI-assisted deployment of a fully open-source lakehouse platform on Scaleway. Pick only the components you need — the AI provisions infrastructure, applies configuration, sets up integrations between services, and validates the final deployment. Components can be added or removed later without rebuilding the stack.

Scaleway Object Storage (S3)

S3-compatible object storage for raw, silver, and gold data layers. Single source of truth for the lakehouse warehouse, accessed by Spark, Trino, and ClickHouse through the s3a:// protocol.

Open-source storage layer on top of S3 providing ACID transactions, schema enforcement, time-travel, and unified batch/streaming reads. The foundation of the bronze — silver — gold medallion architecture.

Apache Kafka

Durable event log and streaming backbone. Captures change-data-capture streams from source systems, decouples producers from consumers, and feeds real-time pipelines into Spark, ClickHouse, and downstream APIs.

Apache Spark

Distributed processing engine for ETL/ELT pipelines, Delta Lake writes, and large-scale data transformations. Deployed on Kubernetes via the Spark Operator.

OpenMetadata

Centralized metadata catalog and data lineage platform. Tracks schemas, ownership, quality metrics, and dependencies across all components.

ClickHouse

Real-time analytical columnar database for sub-second queries on billions of rows. Ideal for dashboards and ad-hoc OLAP workloads.

Apache Airflow

Workflow orchestration and scheduling. DAG-based pipeline definitions for batch and incremental data processing across the platform.

Kubernetes

Container orchestration platform underneath everything. All services run as deployments on a single Scaleway-managed k8s cluster.

Apache Superset

Self-service business intelligence and dashboarding. Connects directly to Trino and ClickHouse for interactive exploration of the lakehouse.

Trino

Distributed SQL query engine that federates across the Delta Lake (S3), PostgreSQL, and other catalogs. Query everywhere with one SQL dialect.

PostgreSQL

Relational store for Hive Metastore, Airflow metadata, OpenMetadata catalog, and other transactional workloads.

Last Updated on Wednesday, 17 June 2026 07:46