About Datalake Foundation

The foundation layer for your data lakehouse — bridging bronze storage and curated silver tables.

What It Is

Datalake Foundation is a Scala library that provides the foundation layer in a data lakehouse architecture. It bridges raw "bronze" storage and curated "silver" layers by standardizing and automating data ingestion and transformation.

Built for Apache Spark 4.0 and Delta Lake 4.0, Datalake Foundation runs natively on Databricks (Runtime 17.3+) and Microsoft Fabric (Runtime 2.0), eliminating the repetitive, manual code that data engineering teams typically write for each new data source. It integrates seamlessly with Databricks workflows, Unity Catalog, OneLake, and the broader lakehouse platform ecosystem.

Instead of writing custom Spark jobs per entity, you define your entities, columns, and processing strategies in metadata. Datalake Foundation handles the rest — transformations, key generation, change detection, and writes.

Architecture

Four cleanly separated layers, each with a clear responsibility.

Metadata Layer

Configuration and entity management. Supports JSON files, SQL Server function-based config (cfg.fnGetFoundationConfig()), folder-based configs, and programmatic string-based definitions. Integrates with Data Factory for orchestration-ready entity group serialization. Provides a settings hierarchy: environment > connection > entity.

Processing Layer

Strategy pattern with three implementations: Full Load (complete overwrite with partition pruning), Merge (incremental upsert via Delta MERGE), and Historic (SCD Type 2 with temporal tracking).

Core Layer

Utilities and base classes — schema comparison, column operations, implicit DataFrame extensions, data type definitions, and an expression evaluation engine.

Logging Layer

Structured logging with Delta and Parquet appenders, async support to prevent blocking, run ID tracking, and detailed audit trails for production monitoring. Processing summaries with row-level metrics (inserted, updated, deleted, touched counts) per merge operation.

Explore all features →

The Transformation Pipeline

Every data slice passes through a 12-step pipeline, fully automated.

1 Inject Transformations
2 Add Calculated Columns
3 Compute Source Hash
4 Add Temporal Columns
5 Track Source Filename
6 Generate Primary Key
7 Cast Column Types
8 Rename Columns
9 Add Deleted Flag
10 Track LastSeen
11 Normalize Names
12 Cache DataFrame

Open Source

Datalake Foundation is released under the GPL-3.0. The source code is publicly available, and contributions are welcome.

Built and maintained by Rucal, the library reflects years of experience building production data pipelines for enterprise clients.

Organizations that need to use Datalake Foundation in proprietary software without copyleft obligations can obtain a commercial license from Rucal.

Want to Contribute?

Check out the source code, open an issue, or submit a pull request.