Data Lake Ingestion Features

Everything you need to build production-grade bronze-to-silver pipelines — configured, not coded.

Processing Strategies

Three battle-tested strategies for every ingestion pattern. Choose per entity via metadata.

Full Load

Complete overwrite with dynamic partition mode. Automatic database creation in catalog mode and first-run detection for seamless initial loads.

Merge (Upsert)

Delta MERGE with partition elimination for performance. Source hash change detection, soft-delete inference, and detailed metrics — inserted, updated, deleted, and touched counts per operation.

Historic (SCD Type 2)

Full version history with ValidFrom, ValidTo, and IsCurrent temporal columns. Automatic version management with source hash comparison to detect real changes.

Advanced Watermark System

Fine-grained incremental processing with multi-column watermarks, dynamic expressions, and full history tracking.

Multi-Column Watermarks

Define multiple watermark columns per entity with AND/OR operation groups for complex incremental filtering scenarios.

Dynamic Expressions

Runtime Scala expression engine with access to Java time libraries, epoch day calculations, and previous watermark values for fully dynamic incremental windows.

Watermark Storage

Delta-based system tables with full history tracking per entity ID, column, timestamp, and value. Query your watermark history like any other table.

Reset & Re-processing

Programmatic watermark reset per column or entire entity. Re-process any time window without rebuilding tables.

Production Observability

Know exactly what happened in every run — row counts, schema changes, and structured logs.

scala
// Returned after every merge operation
ProcessingSummary(
  recordsInSlice = 1560,
  inserted       = 84,
  updated        = 305,
  deleted        = 7,
  unchanged      = 1164
)

Every merge operation returns a structured ProcessingSummary with exact counts.

Merge Metrics

Structured counts for every merge operation: rows inserted, updated, deleted, and total touched. Feed into dashboards or alerting systems.

Structured Logging

Delta and Parquet log appenders with async support to prevent pipeline blocking. Run ID tracking, audit markers, and detailed trails for production monitoring.

Schema Drift Detection

Automatic detection of added and removed columns, type mismatches, and structural changes. Non-blocking warnings so pipelines continue while you investigate.

Flexible Metadata Sources

Store your pipeline configuration wherever it fits your architecture.

JSON Configuration

Single file or folder-based configuration. Version-control friendly — review pipeline changes in pull requests alongside your code.

SQL Server

Centralized metadata via cfg.fnGetFoundationConfig() stored procedure. Enterprise-friendly — manage hundreds of entities from a single database.

Programmatic (String)

Pass configuration as a JSON string for testing, CI/CD pipelines, and embedded scenarios where file access isn't available.

Orchestration & Output Creation

Serialize entity configurations for orchestrators, schedulers, or custom tooling — with a built-in Data Factory output and an extensible architecture for more.

The outputs package serializes entity configurations into ready-to-use formats for external consumers. Organize entities into named groups, request configuration per group at runtime, and feed the result into ADF ForEach activities, Databricks Workflows, or any orchestration tool.

scala
// In your Data Factory pipeline or orchestrator
val items = DataFactory.getConfigItems(
  metadata,
  groupName = "daily-incremental"
)
// Returns JSON array of entity configs
// ready for ADF ForEach or any orchestrator

Serialize entity group configs for orchestration — one call, ready-to-use JSON output.

Data Factory Output

Built-in serialization of entity configs to JSON arrays for ADF ForEach activities, Databricks Workflows, or any JSON-consuming orchestrator.

Entity Groups

Organize entities into logical groups (e.g. daily-incremental, weekly-full). Request config per group at runtime for targeted orchestration.

Extensible Output Architecture

The outputs package is designed for new output types. Need a different format or orchestration target? Add a new output object to the package.

Built-in Data Quality

Catch problems before they reach your silver tables.

Duplicate Business Key Detection

Automatic detection of duplicate business keys before merge with a clear DuplicateBusinesskeyException. Prevents silent data corruption.

Delete Inference Safety

Watermark window scoping, partition filters, NULL handling, and first-run safety to ensure deletes are only inferred when the data supports it.

Expression Evaluation Engine

Runtime Scala expression evaluation for watermark values and calculated columns. Access Java time libraries and previous run state dynamically.

Multi-Platform Support

One library, multiple runtimes. Run wherever Spark runs.

Databricks

Runtime 17.3+, Unity Catalog managed and external tables, Databricks Workflows, and full integration with the Databricks lakehouse platform.

Microsoft Fabric

Runtime 2.0 with OneLake path-based operations. Full strategy works today. Merge and Historic strategies ready when Runtime 2.0 adds Delta MERGE support.

Any Spark Environment

Standard Spark 4.0 and Delta Lake 4.0 APIs. Runs on any environment that supports these versions — cloud, on-premises, or local development.

Ready to Simplify Your Pipelines?

Get started with Datalake Foundation or talk to Rucal about implementation support.