Metadata-Driven Data Ingestion for Apache Spark
Datalake Foundation standardizes and automates your bronze-to-silver data pipelines on Databricks, with Microsoft Fabric support coming soon. Configure once, process everything.
The Missing Layer in Your Data Lakehouse
Extraction tools get data into your lake. Analytics tools model it for dashboards. But who handles the messy middle — transforming raw bronze data into clean, reliable silver tables?
Source → Bronze
Data Factory / Fivetran
Extraction & loading
Bronze → Silver
Datalake Foundation
Ingestion & transformation
Silver → Gold
dbt / Synapse
Analytics modeling
Six Lines. That's All It Takes.
Your pipeline handles transformations, keys, hashing, and writes automatically — no repetitive coding required.
val settings = new JsonMetadataSettings()
settings.initialize("/path/to/metadata.json")
val metadata = new Metadata(settings)
val entity = metadata.getEntity(42)
val processing = new Processing(entity, "2025-07-01-slice.parquet")
processing.Process() Load metadata, select an entity, process a data slice — Datalake Foundation handles the rest.
Add to Your Project
Available on Maven Central. One line to add it to your build.
libraryDependencies += "nl.rucal" % "datalakefoundation_2.13" % "1.6.2" SBT (Scala)
<dependency>
<groupId>nl.rucal</groupId>
<artifactId>datalakefoundation_2.13</artifactId>
<version>1.6.2</version>
</dependency> Maven (pom.xml)
View on Maven Central
Core Features
Processing Strategies
Full Load, Merge (upsert), and Historic (SCD Type 2) — choose the right strategy per entity, all configured via metadata.
Metadata-Driven
Define pipelines in JSON, SQL Server, or folder-based configs. No manual coding — just configuration.
Enterprise Ready
Watermarking, delete inference, schema drift detection, structured logging, merge metrics, and Data Factory orchestration built in.
How Datalake Foundation Compares
The only dedicated, metadata-driven bronze-to-silver library for Apache Spark.
| Feature | Datalake Foundation | Databricks DLT | Custom-Built Spark | dbt |
|---|---|---|---|---|
| Bronze-to-silver focus | ✓ | Partial | Manual | ✗ |
| SCD Type 2 built-in | ✓ | ✗ | Manual | ✓ |
| Metadata-driven config | ✓ | YAML | ✗ | YAML/SQL |
| No repetitive code needed | ✓ | ✓ | ✗ | ✓ |
| Watermark incremental | ✓ | ✓ | Manual | Manual |
| Delete inference | ✓ | ✗ | Manual | ✗ |
| Schema drift detection | ✓ | Partial | Manual | Partial |
| Open source | ✓ | Partial | N/A | ✓ |
| Spark 4.0 native | ✓ | ✗ | Varies | ✗ |
| Merge metrics & observability | ✓ | Partial | Manual | ✗ |
| Data Factory orchestration | ✓ | ✗ | Manual | ✗ |
| Microsoft Fabric support | ✓ | N/A | N/A | Partial |
Databricks DLT is powerful for declarative pipelines but doesn't offer built-in SCD Type 2, metadata-repository management, or delete inference. Custom-built Spark jobs give full control but require significant engineering effort per entity. dbt excels at silver-to-gold analytics modeling but doesn't operate at the ingestion layer.
Key Capabilities
Automatic Transformations
Calculated columns, type casting, column renaming, primary key generation, and source hashing.
Incremental Processing
Multi-column watermarks with AND/OR grouping, dynamic expressions, partition-aware filtering, and programmatic watermark reset.
SCD Type 2 History
Full version history with ValidFrom/ValidTo/IsCurrent temporal tracking.
Flexible Output
Write to file paths (ADLS, S3, OneLake) or Unity Catalog tables with variable interpolation and per-entity mixed output mode.
Built for Databricks & Microsoft Fabric
Datalake Foundation runs natively on Databricks and Microsoft Fabric Runtime 2.0, integrating with the tools your team already uses.
Databricks Runtime 17.3+
Tested and optimized for the latest Databricks Runtime with Spark 4.0 and Delta Lake 4.0 support.
Unity Catalog
Write directly to Unity Catalog managed tables or external locations on ADLS and S3.
Databricks Workflows
Orchestrate Datalake Foundation jobs with Databricks Workflows, notebooks, or any Spark job scheduler.
Microsoft Fabric
Run on Fabric Runtime 2.0 with OneLake path-based operations. Full strategy works today; Merge and Historic ready when Runtime 2.0 adds MERGE support.
From Silver to Insight
Clean silver tables power your entire analytics stack.
Synapse Analytics
Expose your silver and gold layers as SQL databases via Synapse Serverless SQL. Query Delta tables directly without moving data.
Power BI
Connect Power BI to your curated silver tables — via Synapse SQL endpoints or direct Delta Lake connectivity.
Internal Applications
Enriched data is available to any application via SQL connections, APIs, or direct file access on ADLS.
Need Help Implementing?
Rucal provides consulting, commercial licensing, and enterprise support for Datalake Foundation.