The global landscape is currently amidst a digital transformation that’s pushing the boundaries of data processing and management
In modern business environments, data pipelines stand at the core of every aligned digital process and action. From marketing and sales to product development and customer support, it’s your data operations that make these applications work. In this sense, data processing is the “end-to-end” management of structured data including its capture, aggregation, integration, analytics, and delivery.
While the Extract and Load areas of the ETL process receive a lot of attention, it’s the Transformation part that holds the key to unlocking the success of a data project, especially considering the significant cost and time it usually requires. In fact, our focus on transformation is one of the core functional elements that make Datorios such a powerful platform.
ETL? The magic is actually in the T
“Transformation” means adjusting raw data into a specific format that’s ultimately defined by its value from the business perspective, which depends on the business question at hand. So if the business question changes, the company needs to be able to adjust the transformation part of the pipeline to accommodate the strategic shift, even if the incoming raw data remains the same.
Data Pipeline Method #1 – Control the flow
A fundamental thumb rule for any effective data transformation is flow control. Controlling the data flow allows the business to connect the data value to the business context. For example, let’s say we’re pulling data from IoT sensors on a solar electricity production system, collecting endpoint battery temperature, usage, and other operational data. To understand the context, this data is important for ensuring that the batteries don’t overheat for safety and validating usage for efficiency. Here are three methods of building effective, everlasting data pipelines:
Conditional data flow
This is the key feature for data quality, making sure that only data with the right value range are processed. In our example, only events with temperature readings that make sense are passed.
Conditional event treatment
Another use case for flow control is conditional pretreatment of data according to its content, for the use of a single target. Let’s say we have a source of data that sends two types of events – one is packed and the other isn’t. If the event isn’t packed, we send it to a defined flow. If it’s packed, we need to extract it and then send it to the target with the open events data flow.
Contextual treatment of events
There may be cases where two different contexts come from the same data source. Based on this context, the flow and process of the data should be managed differently. For example, an urgent safety alert for overheating, and a non-urgent service log for under usage.
Data Pipeline Method #2 – Correlations
The event context isn’t always sufficient to properly handle an event – sometimes additional data sources or historical data are required to provide the full context or enrich the original data.
In this case, it’s possible to correlate different data sets arriving from various sources that are logically connected:
- Time-based correlation
- Location-based correlation
- Value-base correlation
We often see solutions that exclude correlations from the ETL transformation, maybe due to technical limitations, like retaining the state. But correlation is an inseparable part of the transformation process and is a crucial component of data preprocessing for data warehousing and more.
For example, let’s say we want to correlate two data sources based on time, and then distribute the correlated value based on a filtering condition. Processing such a case from raw data in a data warehouse is costly and can cause major delays, especially if the data sources are unsynced and with high capacity, and actually misses the purpose of data warehousing.
Another important usage of correlations is the ability to control the flow of events according to the previous state of that event. By correlating an event to its historical values, we can enrich the business relevance of data that flows downstream. A simple example is deducting the number readings from an oversampling sensor – by correlating the current event to the last processed event, we can control the time intervals at which events pass. Or, by correlating a current event to historic values, we’re able to ensure that only events with changes that have a business value are processed.
Data Pipeline Method #3 – Data Shaping
This involves changing or adjusting the structure and values of an event, so it fits the business need and gains the correct structure for further use with analytics of machine learning.
- Meaningful data: Adjusting the key names of the different fields gives them meaning and context.
- Standardization: Recalculating the different values so they meet the same format (lower case, upper case, Capital letter, $ sign, etc.) or the same units (time formats, the unit of measurement, etc.)
- Cleaning: Correcting missing and unhealthy data
- Creating new keys: with values generated from mathematical or logic functions with the different event and metadata values
To summarize: It’s all about Data Transformation
These three methods are fundamental to data pipelines, and it’s important to use them where relevant, in a way that best serves the business questions that arise along the pipeline – like lego building blocks. For example, changing the pipeline flow based on a value that’s been calculated on correlated data.
To fully leverage the benefits of a Data Warehouse and avoid creating a costly, resource-heavy data swamp, it’s important to ensure that your data strategy includes the following:
- A data transformation solution that preprocesses your data
- A data transformation solution that rapidly adjusts to business changes, producing high-performance pipelines that can handle complex preprocessing of high-frequency data events and multisource correlations
- A data transformation solution that shortens the pipeline production process, and prevents operational bottlenecks in the data-to-value process to automate data
The terms “workflow orchestration” and “data orchestration” are often used interchangeably, but there are important differences between the
Data is a critical asset for most enterprises and the trend is only increasing with the advent of