The Internet of Things (IoT) has ushered in a new era of technological advancement, connecting devices and enabling
As we enter 2023 and smart automation takes the stage, the importance of data lineage comes to fruition. But what is data lineage? And what Data lineage problems are we going to see in the coming years?
Well, let’s start from the beginning.
Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to destinations. It includes all the data transformations that occur along the way – transformations that reveal what changed and why it needed to change showing how the data evolved.
But knowing a dataset’s source isn’t always enough to fully understand it. Identifying aspects within data lineage can have a big impact in areas like data migrations, data governance, and the strategic reliance on data that enables organizations to understand their processes and their results to make decisions accordingly.
The truth of the matter is, even in 2022, data processes are still hurting us and many of the issues that come up are due to how we go about building our pipelines from the get-go. Dozens of plumbers in the field of data pipelines admit that the most common approach is to build them Step-by-Step. This means designs are fairly straightforward starting with their data sources, creating a strategy for ingestion, coming up with a suitable data processing plan, and determining where the destination data should be sent to.
Basically, these operators or transformers are placed one after the other and in the end, we can only hope the pipeline works as it should. With so many different pieces being put together, the logic gets lost and this is where the importance of data lineage comes into play.
When you can’t see the data – you can’t see the problems and here are some of the problems that have been making the idea of data lineage seem further and further away from reality.
Debugging and Maintenance
Data pipelines are built using a backward process that results in lengthy debugging and maintenance – this is the root of the problem. Each aspect of a pipeline is normally coded separately, with different teams and different engineers in charge of each step – and the result? Knowledge gets lost along the way.
Most data pipelines are built by combining different, individually coded, inputs, transformations, and outputs, running data through the flow and hoping it reaches your wanted destination in the format and file types you want.
The issue is, that this so-called “normal way” doesn’t show you what is wrong along your pipeline until you are done developing your entire data journey. Now when we detect an issue, which we pretty much always do we’ve solved only half the problem. And fixing that issue is a whole new ball game – many times requiring knowledge from every team and engineer involved.
But this wastes time and effort.
Even in companies with a perfect data flow set-up, time is being wasted in the wrong places, for the wrong reasons. Engineers, the lifeblood of a company, are spending their days merely getting their pipeline up and running – not to mention the upkeep, but all in all, tasks that bring no value to the company as a whole.
Shortage of Data Engineers
Data engineers are increasingly hard to come by. For one, no real education or degree even exists for them because data engineers need a unique set of skills to succeed in their field, an ever-evolving field that is growing exponentially.
A data engineer needs to understand everything, from programming languages, like Python and SQL, to databases, the popular ETL & ELT technologies, widely known streaming services as well as different infrastructure types both locally and on the cloud. This huge variety of technologies means an even larger variety of skill sets, making the data engineer position – one of the hardest positions to fill.
Now data stacks are becoming more and more complex as well, meaning the role of a data engineer is growing and increasing in complexity. For these reasons, the demand for data engineers is increasing by 50% every year, and the cost to acquire one is skyrocketing.
So why are we wasting their valuable time on manual, routine tasks such as manual impact analysis or incident investigation, increasing their frustration, leading to burnout, and risking them leaving?
High Employee Turnover
More than 60% of data engineers switch jobs after no more than 2 years.
Employee turnover is a big issue, especially in engineering, affecting our teams and forcing us to spend our days onboarding new colleagues, filling them in, and educating them on tools, technologies, and the current data architecture – a design that could change in its entirety in several months’ time.
The most difficult part of attempting to teach a new colleague is the fact that when a data engineer leaves the company, same as an SW engineer, important documentation either gets lost or is simply not understandable.
During their time employed, engineers and data citizens write endless amounts of documentation creating a maze of sorts – when they leave, new employees are exposed to this maze, and the time it takes them to find their way out seems incomprehensible.
Due to the complexities of data pipelines themselves, many data engineers eventually recreate data pipelines using their own code. This restricts new employees from doing what they were hired to do – add value to the company.
Data Landscape Complexity
In an ecosystem like data, where systems are only getting more and more complex, data groups need all the help they can get. Data lineage is the light at the end of the pipeline and that’s why tackling the beast now is more important than ever.
Why Are We Talking About Data Lineage in 2022?
Moving into the big data era, Data management has undergone a massive transformation.
In the past, we were gathering data merely to summarize old processes and used it to derive business insights. Mass amounts of data were collected without thinking about why and then modern AI or ML algorithms were trained to predict the future of our businesses.
But now data collection is all-encompassing. As data infrastructures grow in complexity—from batch, processed, and structural data, they have evolved into this crazy data ecosystem with thousands of components aimed at one goal: to derive more value from existing data sources and ensure it is sent to the right destinations – and in many circumstances – in real-time. But, these new, crazier, all-encompassing ecosystems are, to put it simply, too much for typical, older, non-modern data flows to handle; they are just too diverse and too interconnected.
For business intelligence, users need data accessibility and simplicity – and all of this comes down to data lineage.
5 Steps to Tackle Data Lineage in its Entirety
1. Documentation is Everything
Document the work in a common, agreed-upon way so the whole data group is on the same page, and if anyone leaves the company, each component configuration is clear. When this is achieved, the entire pipeline is documented in a way that helps current and future data consumers understand how their data is built. There are obviously different levels of documentation, the more detailed it is, the more it will help tomorrow’s data engineer
In general, when coding processes, documentation is key to describing what is occurring throughout multiple lines of code. We’ve seen too many cases where code can only be understood by those within the same department, those who speak not only the same language but code in it as well. This makes it difficult for anyone using the data to explain the origins of any data set as they do not know exactly how the code was made or how their results came to be – causing a kink in the chain.
2. Measure Twice, Cut Once
Creating a data pipeline should be done step by step – if you climb too fast, you might miss one. Data pipelines should be created incrementally- after each building block/ transformer is placed, output results should be checked to ensure they give us the result we are expecting. If it is, great! Now you can go ahead and set the next block in your logic and test it out. If it’s not? No problem! We know where to start looking for our problem.
By using a step-by-step method, if you’re scratching your head and trying to understand what’s going on with a result you can just take a breather, adjust the logic, and give it another go.
These iterations may be a little more time-consuming, but you increase the likelihood of the data pipeline goal doing as it was required. Whether you’re using no code, low code, or just code – make sure you break down your pipeline transformation into smaller pieces so it’s easier to understand and also to maintain.
3. Maintenance is Everyone’s Problem
Maintenance in the realm of data is a phrase that should not be taken lightly. The process is tedious, never-ending, and relies heavily on data lineage. Data pipeline maintenance usually includes determining an issue that occurred, finding the issue, coming up with a solution then finally implementing and verifying it.
Many solutions offer alerts when pipelines break or have scheduled maintenance, but tracking specific keys, and events, checking their scheme as well as value transformation throughout a pipeline, in near real-time, is crucial to saving time. If you have created your pipeline correctly, and have broken it down into smaller pieces or blocks, then it is much easier to pinpoint the problem.
4. It Takes a Village
Creating and maintaining data pipelines normally takes a combination of domain experts, aka logic, and data engineers, aka implementation. When everyone is looking at the same data, using the same tools, less gets lost in translation.
5. Customization is Everything
Every company has its own capabilities, data pipelines, and architecture, which is why one tool could be a great fit for your company, while another tool may not be. Time and time again data engineers will do POC with an ETL tool and sign a contract only to learn after that they forgot to verify a very important use case for their company. A few weeks later they dropped the tool and were searching for an alternative tool all over again. With the many customization options available today we no longer have to conform to a certain fit but it is vital to review newly wanted tools and ensure they can adapt to our ever-changing business needs.
Tackling data lineage is no longer a beast that can’t be tamed, follow these simple steps for a data infrastructure solution that can scale as you require now and withstand the test of time.
Data is a critical asset for most enterprises and the trend is only increasing with the advent of
“A user interface is like a joke. If you have to explain it, it’s probably not that good.”