Metadata-Driven Data Ingestion for Apache Spark

Datalake Foundation standardizes and automates your bronze-to-silver data pipelines on Databricks, with Microsoft Fabric support coming soon. Configure once, process everything.

v1.6.2 — Scala 2.13 · Spark 4.0 · Delta Lake 4.0

The Missing Layer in Your Data Lakehouse

Extraction tools get data into your lake. Analytics tools model it for dashboards. But who handles the messy middle — transforming raw bronze data into clean, reliable silver tables?

Source → Bronze

Data Factory / Fivetran

Extraction & loading

Bronze → Silver

Datalake Foundation

Ingestion & transformation

Silver → Gold

dbt / Synapse

Analytics modeling

Six Lines. That's All It Takes.

Your pipeline handles transformations, keys, hashing, and writes automatically — no repetitive coding required.

scala
val settings = new JsonMetadataSettings()
settings.initialize("/path/to/metadata.json")
val metadata = new Metadata(settings)

val entity = metadata.getEntity(42)
val processing = new Processing(entity, "2025-07-01-slice.parquet")
processing.Process()

Load metadata, select an entity, process a data slice — Datalake Foundation handles the rest.

Add to Your Project

Available on Maven Central. One line to add it to your build.

sbt
libraryDependencies += "nl.rucal" % "datalakefoundation_2.13" % "1.6.2"

SBT (Scala)

xml
<dependency>
  <groupId>nl.rucal</groupId>
  <artifactId>datalakefoundation_2.13</artifactId>
  <version>1.6.2</version>
</dependency>

Maven (pom.xml)

View on Maven Central

Core Features

Explore all features →

Processing Strategies

Full Load, Merge (upsert), and Historic (SCD Type 2) — choose the right strategy per entity, all configured via metadata.

Metadata-Driven

Define pipelines in JSON, SQL Server, or folder-based configs. No manual coding — just configuration.

Enterprise Ready

Watermarking, delete inference, schema drift detection, structured logging, merge metrics, and Data Factory orchestration built in.

How Datalake Foundation Compares

The only dedicated, metadata-driven bronze-to-silver library for Apache Spark.

Feature Datalake Foundation Databricks DLT Custom-Built Spark dbt
Bronze-to-silver focus Partial Manual
SCD Type 2 built-in Manual
Metadata-driven config YAML YAML/SQL
No repetitive code needed
Watermark incremental Manual Manual
Delete inference Manual
Schema drift detection Partial Manual Partial
Open source Partial N/A
Spark 4.0 native Varies
Merge metrics & observability Partial Manual
Data Factory orchestration Manual
Microsoft Fabric support N/A N/A Partial

Databricks DLT is powerful for declarative pipelines but doesn't offer built-in SCD Type 2, metadata-repository management, or delete inference. Custom-built Spark jobs give full control but require significant engineering effort per entity. dbt excels at silver-to-gold analytics modeling but doesn't operate at the ingestion layer.

Key Capabilities

Automatic Transformations

Calculated columns, type casting, column renaming, primary key generation, and source hashing.

Incremental Processing

Multi-column watermarks with AND/OR grouping, dynamic expressions, partition-aware filtering, and programmatic watermark reset.

SCD Type 2 History

Full version history with ValidFrom/ValidTo/IsCurrent temporal tracking.

Flexible Output

Write to file paths (ADLS, S3, OneLake) or Unity Catalog tables with variable interpolation and per-entity mixed output mode.

Built for Databricks & Microsoft Fabric

Datalake Foundation runs natively on Databricks and Microsoft Fabric Runtime 2.0, integrating with the tools your team already uses.

Databricks Runtime 17.3+

Tested and optimized for the latest Databricks Runtime with Spark 4.0 and Delta Lake 4.0 support.

Unity Catalog

Write directly to Unity Catalog managed tables or external locations on ADLS and S3.

Databricks Workflows

Orchestrate Datalake Foundation jobs with Databricks Workflows, notebooks, or any Spark job scheduler.

Microsoft Fabric

Run on Fabric Runtime 2.0 with OneLake path-based operations. Full strategy works today; Merge and Historic ready when Runtime 2.0 adds MERGE support.

From Silver to Insight

Clean silver tables power your entire analytics stack.

Synapse Analytics

Expose your silver and gold layers as SQL databases via Synapse Serverless SQL. Query Delta tables directly without moving data.

Power BI

Connect Power BI to your curated silver tables — via Synapse SQL endpoints or direct Delta Lake connectivity.

Internal Applications

Enriched data is available to any application via SQL connections, APIs, or direct file access on ADLS.

Need Help Implementing?

Rucal provides consulting, commercial licensing, and enterprise support for Datalake Foundation.