Data Lake Ingestion Features
Everything you need to build production-grade bronze-to-silver pipelines — configured, not coded.
Processing Strategies
Three battle-tested strategies for every ingestion pattern. Choose per entity via metadata.
Full Load
Complete overwrite with dynamic partition mode. Automatic database creation in catalog mode and first-run detection for seamless initial loads.
Merge (Upsert)
Delta MERGE with partition elimination for performance. Source hash change detection, soft-delete inference, and detailed metrics — inserted, updated, deleted, and touched counts per operation.
Historic (SCD Type 2)
Full version history with ValidFrom, ValidTo, and IsCurrent temporal columns. Automatic version management with source hash comparison to detect real changes.
Advanced Watermark System
Fine-grained incremental processing with multi-column watermarks, dynamic expressions, and full history tracking.
Multi-Column Watermarks
Define multiple watermark columns per entity with AND/OR operation groups for complex incremental filtering scenarios.
Dynamic Expressions
Runtime Scala expression engine with access to Java time libraries, epoch day calculations, and previous watermark values for fully dynamic incremental windows.
Watermark Storage
Delta-based system tables with full history tracking per entity ID, column, timestamp, and value. Query your watermark history like any other table.
Reset & Re-processing
Programmatic watermark reset per column or entire entity. Re-process any time window without rebuilding tables.
Production Observability
Know exactly what happened in every run — row counts, schema changes, and structured logs.
// Returned after every merge operation
ProcessingSummary(
recordsInSlice = 1560,
inserted = 84,
updated = 305,
deleted = 7,
unchanged = 1164
) Every merge operation returns a structured ProcessingSummary with exact counts.
Merge Metrics
Structured counts for every merge operation: rows inserted, updated, deleted, and total touched. Feed into dashboards or alerting systems.
Structured Logging
Delta and Parquet log appenders with async support to prevent pipeline blocking. Run ID tracking, audit markers, and detailed trails for production monitoring.
Schema Drift Detection
Automatic detection of added and removed columns, type mismatches, and structural changes. Non-blocking warnings so pipelines continue while you investigate.
Flexible Metadata Sources
Store your pipeline configuration wherever it fits your architecture.
JSON Configuration
Single file or folder-based configuration. Version-control friendly — review pipeline changes in pull requests alongside your code.
SQL Server
Centralized metadata via cfg.fnGetFoundationConfig() stored procedure. Enterprise-friendly — manage hundreds of entities from a single database.
Programmatic (String)
Pass configuration as a JSON string for testing, CI/CD pipelines, and embedded scenarios where file access isn't available.
Orchestration & Output Creation
Serialize entity configurations for orchestrators, schedulers, or custom tooling — with a built-in Data Factory output and an extensible architecture for more.
The outputs package serializes entity configurations into ready-to-use formats for external consumers.
Organize entities into named groups, request configuration per group at runtime, and feed the result into ADF ForEach activities, Databricks Workflows, or any orchestration tool.
// In your Data Factory pipeline or orchestrator
val items = DataFactory.getConfigItems(
metadata,
groupName = "daily-incremental"
)
// Returns JSON array of entity configs
// ready for ADF ForEach or any orchestrator Serialize entity group configs for orchestration — one call, ready-to-use JSON output.
Data Factory Output
Built-in serialization of entity configs to JSON arrays for ADF ForEach activities, Databricks Workflows, or any JSON-consuming orchestrator.
Entity Groups
Organize entities into logical groups (e.g. daily-incremental, weekly-full). Request config per group at runtime for targeted orchestration.
Extensible Output Architecture
The outputs package is designed for new output types. Need a different format or orchestration target? Add a new output object to the package.
Built-in Data Quality
Catch problems before they reach your silver tables.
Duplicate Business Key Detection
Automatic detection of duplicate business keys before merge with a clear DuplicateBusinesskeyException. Prevents silent data corruption.
Delete Inference Safety
Watermark window scoping, partition filters, NULL handling, and first-run safety to ensure deletes are only inferred when the data supports it.
Expression Evaluation Engine
Runtime Scala expression evaluation for watermark values and calculated columns. Access Java time libraries and previous run state dynamically.
Multi-Platform Support
One library, multiple runtimes. Run wherever Spark runs.
Databricks
Runtime 17.3+, Unity Catalog managed and external tables, Databricks Workflows, and full integration with the Databricks lakehouse platform.
Microsoft Fabric
Runtime 2.0 with OneLake path-based operations. Full strategy works today. Merge and Historic strategies ready when Runtime 2.0 adds Delta MERGE support.
Any Spark Environment
Standard Spark 4.0 and Delta Lake 4.0 APIs. Runs on any environment that supports these versions — cloud, on-premises, or local development.
Ready to Simplify Your Pipelines?
Get started with Datalake Foundation or talk to Rucal about implementation support.