The Guide to ETL Processing: ETL Stages and Benefits Explained
An ETL tool is software that automates the process of extracting, transforming, and loading (hence the name) data
I’m often asked how to set up auto scaling in Kubernetes. But unlike computing, auto scaling brings more dimensions into the equation
Several years ago, while leading the development at an IDF (Israeli Defence Force) technological unit, I found myself facing a major challenge, how to scale a massive data pipeline. We had various sources, including streaming and complex data in parallel, with dozens of pipelines to maintain. The mission is to alert for potential human or hostile cybernetic activity within minutes. On such occasions, the last thing you want is an infrastructure that can’t scale or support the volume. At that time, we used 2 approaches to scale:
Both approaches were static, and couldn’t adjust dynamically in a timely manner.
Over the last decade, scaling technology has gone through a massive evolution. Nowadays, it’s possible to achieve auto-scaling in Kubernetes. In other words, it allows you to automatically adjust the scale of your clusters, and add additional pods when needed.
Many people wonder how to set up auto scaling in Kubernetes effectively. But in fact, unlike scaling with computing, scaling with data has two dimensions:
Not all pipelines are the same – some pipelines are heavy consumers of CPU/memory, while others require more network/IO operations resources. That’s why better data flux isn’t always achieved by adding more computing resources to the data pipeline. The inefficiency might actually be a result of a shortage of network/IO resources.
The Datorios approach offers multi-layer smart scaling that’s based on 3 core principles.
To illustrate, let’s assume that each data pipeline is populated in one Kubernetes POD. Within the Datorios framework, each data transformer essentially functions as a worker inside that pod. An orchestration layer supplies ongoing monitoring of latency and data backlog for each transformer (worker). These will dynamically adjust at 2 levels:
Each POD implements a pipeline, to the point it exhausts its reserved resources (CPU, RAM). Then, the platform automatically scales up, creating another POD, and assigning it to the relevant pipeline.
Datorios dynamically adjusts the pod based on real-time pipeline logic. For example, a pipeline that’s currently blocking a network isn’t using all computing resources. In such a case, our framework automatically adjusts and concurrently processes quoted events. This will better utilize the POD’s resources, without negatively impacting a single event’s processing latency.
In addition to the two levels of scaling, a truly transformative, modern data framework also needs a management layer. Our management layer automatically orchestrates pipeline scaling based on a defined set of rules, facilitating smart auto-scaling.
Not all pipelines are born equal. In addition to serving different business needs, some pipelines get higher priority than others. It’s important to make sure you prioritize your resources accordingly. For example, pipelines that support traffic management during critical “rush hours” vs. a pipeline that supports a random BI tool. This brings us to the concept of Selective Scaling when resources are being allocated to pipelines dynamically. Only when needed, and for a predefined duration.
The bottom line is that the linear growth of adding additional pods struggles to efficiently address scaling. When you think about auto scaling, make sure to consider the 2 scaling dimensions; horizontal and vertical.
Or in other words, Implementing a smart data orchestration tool layer that scales based on actual business and operational needs.
An ETL tool is software that automates the process of extracting, transforming, and loading (hence the name) data
As the world becomes more data-driven, data engineers are increasingly in demand to create and manage the complex
Employee turnover is a common occurrence in any organization but as we get through the period of the
Fill out the short form below