Several years ago, while leading the development at an IDF (Israeli Defence Force) technological unit, I found myself
Companies were once hopeful about the rise of the so-called data lake. A repository for raw data pulled in from thousands of sources for further transformation and analysis. This architecture is now falling out of favor with CIOs due to its security flaws, governance issues, and general scalability costs.
A data lake has a certain hoarding element to it, a store-now-extract-value-later mentality that can bite the company in the back given how progressively more difficult value-extraction gets as the platform scales up.
It’s too early to wave goodbye to the trusty data warehouse, estimated to evolve into a $51 billion market by 2028, but alas, this design has its issues too, as it tends to be too rigid for the ever-changing business needs and struggles with latency as well as real-time streaming data. A deeper look at the problem reveals that a lot of the issues with both architectures come down to a bottleneck in its underlying piping.
Both a warehouse and a lake, two approaches locked in a silent stand-off behind the industry’s scenes, require robust data pipelines to get the data in. These pipelines usually incorporate the same steps: extracting (E) the data from the source, transforming (T) the data in the desired ways, and loading (L) the data into the repository. An ETL pipeline is a default for a data warehouse, while ELT moves the raw data into the lake first before transforming it and passing the processed data to end-users. Transformation is ultimately a major component in both approaches, and, unfortunately, the key pain point for any data project out there.
“T” for trouble
Data transformation is an umbrella term involving any sort of manipulation data has to go through before it is ready for analysts to dig in. If dealing with a binary output from a sensor array, you may want to convert it into a human-readable format. If your logistics provider sends in data with dates formatted as “DD/MM/YY,” while your warehouse is set for “MM/DD/YY,” this is the point where you get the format right. Duplicate and corrupt data need weeding out, too, the same as with regular data cleaning.
This process may also include calculations or alterations in the data, with new variables added into the set. For example, when receiving a stream of separate transaction events, you may want to add hourly and daily statistics, which would have to be calculated based on the incoming granular data points.
Naturally, data transformation is quite computationally-intensive. Both in ETL and ELT, “T” is the component that includes complex operations on a variety of data types, which makes it quite demanding in terms of CPU workloads. Naturally, this represents a scalability issue and amps up the project’s costs too.
In strategic terms, transformation is meant to adjust the raw data into the format that’s ultimately defined by its value from the business perspective, which depends on the business question at hand. Thus, if the business question changes, the company must adjust the transformation part of the pipeline to accommodate the strategic shift, even if the incoming raw data is exactly the same.
“T” is also very time-consuming, as it often involves writing and testing the code that handles the incoming raw data, bringing it as close to the production-grade level as possible. Accumulating and preparing data for any given analytics project takes a week on average, an IDG poll found. From personal experience, “T” is where most of that delay comes from, and one week is lightspeed for many companies I’ve spoken with.
Breaking through the conundrum
The transformation challenge is part of the equation for any business-oriented data infrastructure. Raw data is not particularly helpful for analysts and Business Intelligence teams, and tasking them with its initial processing is the ultimate waste of their time and the company’s money.
Over the past few years, businesses have been increasingly reliant on cloud warehousing, which offers a workaround to the “T” challenge. A cloud infrastructure scales up the underlying computational resources in line with the system’s needs. The prices go up too, accordingly, and unless the cloud is private, this approach also means handing management of all of your critical data to a third party, which is about as secure as it sounds.
Hypothetically, a higher degree of data standardization across entire industries could somewhat mitigate the challenge. This would reduce the need for transformations, as the data flowing in would be in a more analytics-friendly format than now. That said, it is hard to imagine this work in real-world terms, where companies would still use custom data formats as a competitive edge. Furthermore, data engineers would need to adjust the pipeline for the specifics of every source and API, so it’s hardly a silver-bullet solution.
To streamline and optimize their pipelines, whether ETL or ELT, companies must be strategic about the overall architecture of their data infrastructure. They must see if they can move some of the workload to edge devices, which are in some cases capable of doing a few initial transformations. On-premise capabilities and processing are another piece of the puzzle that needs to find the right place, with cloud processing as the third and final pillar. Optimizing the data pipeline is more than a matter of cleaning up the code, it is a balancing act that requires strategically distributing the workload across the entire hardware stack in use.
The data transformation challenge is a major hurdle for any company looking to become data-driven. Collecting troves and troves of raw data only works against companies that only have a vague idea of what are they storing it for in the first place. Businesses should either transform data as soon as possible with an outlook to generate immediate value or not store it at all. With both data warehouses and data lakes, overcoming the transformation challenge is a matter of bringing companies’ business arms closer to the IT desks, and of course, clever innovation too.
About the Author
Ronen Korman is a Founder and CEO at Datorios, formerly Metrolink.ai, a data transformation framework company. A technology leader with 30 years of experience in R&D leading high-risk, high-budget multidisciplinary technological projects. He served as the commander of the Israel Defense Force (IDF) elite technological unit 81 in the rank of a Colonel and was Head of Technology for Operation and Cyber division (General equivalent) at the Israeli Prime Minister’s Office and has vast experience in resources optimization and making decisions in situations of uncertainty. Ronen believes in the power of the human mind and considers nothing to be truly impossible. He has been awarded the prestigious Israel Defense Prize and takes pride in being a seasoned technology geek.
Imagine making your way through a crowd, thousands of people donning anything from casual wear to the most
In modern business environments, data pipelines stand at the core of every aligned digital process and action. From