Metadata-Driven Data Ingestion for Apache Spark

Name: Datalake Foundation
Author: Rucal

Datalake Foundation standardizes and automates your bronze-to-silver data pipelines on Databricks, with Microsoft Fabric support coming soon. Configure once, process everything.

v1.6.2 — Scala 2.13 · Spark 4.0 · Delta Lake 4.0

Get Started Rucal Services

The Missing Layer in Your Data Lakehouse

Extraction tools get data into your lake. Analytics tools model it for dashboards. But who handles the messy middle — transforming raw bronze data into clean, reliable silver tables?

Source → Bronze

Data Factory / Fivetran

Extraction & loading

Bronze → Silver

Datalake Foundation

Ingestion & transformation

Silver → Gold

dbt / Synapse

Analytics modeling

Six Lines. That's All It Takes.

Your pipeline handles transformations, keys, hashing, and writes automatically — no repetitive coding required.

scala

val settings = new JsonMetadataSettings()
settings.initialize("/path/to/metadata.json")
val metadata = new Metadata(settings)

val entity = metadata.getEntity(42)
val processing = new Processing(entity, "2025-07-01-slice.parquet")
processing.Process()

Load metadata, select an entity, process a data slice — Datalake Foundation handles the rest.

Add to Your Project

Available on Maven Central. One line to add it to your build.

sbt

libraryDependencies += "nl.rucal" % "datalakefoundation_2.13" % "1.6.2"

SBT (Scala)

xml

<dependency>
  <groupId>nl.rucal</groupId>
  <artifactId>datalakefoundation_2.13</artifactId>
  <version>1.6.2</version>
</dependency>

Maven (pom.xml)

View on Maven Central

Core Features

Explore all features →

Processing Strategies

Full Load, Merge (upsert), and Historic (SCD Type 2) — choose the right strategy per entity, all configured via metadata.

Metadata-Driven

Define pipelines in JSON, SQL Server, or folder-based configs. No manual coding — just configuration.

Enterprise Ready

Watermarking, delete inference, schema drift detection, structured logging, merge metrics, and Data Factory orchestration built in.

How Datalake Foundation Compares

The only dedicated, metadata-driven bronze-to-silver library for Apache Spark.

Feature	Datalake Foundation	Databricks DLT	Custom-Built Spark	dbt
Bronze-to-silver focus	✓	Partial	Manual	✗
SCD Type 2 built-in	✓	✗	Manual	✓
Metadata-driven config	✓	YAML	✗	YAML/SQL
No repetitive code needed	✓	✓	✗	✓
Watermark incremental	✓	✓	Manual	Manual
Delete inference	✓	✗	Manual	✗
Schema drift detection	✓	Partial	Manual	Partial
Open source	✓	Partial	N/A	✓
Spark 4.0 native	✓	✗	Varies	✗
Merge metrics & observability	✓	Partial	Manual	✗
Data Factory orchestration	✓	✗	Manual	✗
Microsoft Fabric support	✓	N/A	N/A	Partial

Databricks DLT is powerful for declarative pipelines but doesn't offer built-in SCD Type 2, metadata-repository management, or delete inference. Custom-built Spark jobs give full control but require significant engineering effort per entity. dbt excels at silver-to-gold analytics modeling but doesn't operate at the ingestion layer.

Key Capabilities

Automatic Transformations

Calculated columns, type casting, column renaming, primary key generation, and source hashing.

Incremental Processing

Multi-column watermarks with AND/OR grouping, dynamic expressions, partition-aware filtering, and programmatic watermark reset.

SCD Type 2 History

Full version history with ValidFrom/ValidTo/IsCurrent temporal tracking.

Flexible Output

Write to file paths (ADLS, S3, OneLake) or Unity Catalog tables with variable interpolation and per-entity mixed output mode.

Built for Databricks & Microsoft Fabric

Datalake Foundation runs natively on Databricks and Microsoft Fabric Runtime 2.0, integrating with the tools your team already uses.

Databricks Runtime 17.3+

Tested and optimized for the latest Databricks Runtime with Spark 4.0 and Delta Lake 4.0 support.

Unity Catalog

Write directly to Unity Catalog managed tables or external locations on ADLS and S3.

Databricks Workflows

Orchestrate Datalake Foundation jobs with Databricks Workflows, notebooks, or any Spark job scheduler.

Microsoft Fabric

Run on Fabric Runtime 2.0 with OneLake path-based operations. Full strategy works today; Merge and Historic ready when Runtime 2.0 adds MERGE support.

From Silver to Insight

Clean silver tables power your entire analytics stack.

Synapse Analytics

Expose your silver and gold layers as SQL databases via Synapse Serverless SQL. Query Delta tables directly without moving data.

Power BI

Connect Power BI to your curated silver tables — via Synapse SQL endpoints or direct Delta Lake connectivity.

Internal Applications

Enriched data is available to any application via SQL connections, APIs, or direct file access on ADLS.

Need Help Implementing?

Rucal provides consulting, commercial licensing, and enterprise support for Datalake Foundation.

View Services View on GitHub