How to Backfill Data Effectively: A Data Engineer’s Guide

Data Saint Consulting Inc
7 min readNov 19, 2023
Photo by Mika Baumeister on Unsplash

Data backfilling is a common task for data engineers, especially when working with large and constantly changing datasets. Data backfilling is the process of filling in missing data from the past on a new system that didn’t exist before, or replacing old records with new records on an update¹. Data backfilling typically occurs after a data anomaly or data quality incident has resulted in bad data entering the data warehouse¹.

Backfilling data can be challenging and time-consuming, especially if done manually or without proper planning. The only thing worse than backfilling data is having to do it a second time after making a mistake¹. To avoid this mess, data engineers need to follow some best practices and patterns for backfilling data effectively. In this article, we will cover some of these tips and techniques, as well as how to use lakeFS, an open-source project that enables data version control, to simplify and automate the backfilling process.

Why Backfill Data?

There are many reasons why data engineers may need to backfill data, such as:

Missing data:

Sometimes, a part of the data may be missing when the first calculation is done, due to network issues, system failures, or human errors². For example, a column may…

--

--

Data Saint Consulting Inc

For Consultation services regarding Data Engineering and Analytics: datasaintconsulting@ gmail.com