The industry 4.0 revolution is centralized around how we collect, analyze, and ultimately use our data. But how
When it comes to data one of the main issues that arises is the numerous disparate data sources and formats. This creates a massive labyrinth through which companies, data teams, and end users have to blindly navigate.
A data executive at one of the largest financial companies in the world once said,
“We have all of the data we could want, but it’s all over the place, in different systems, different databases, and different formats. Just to get the data you need takes an Act of God”.
Whether a two-person startup or one of the largest corporations in the world, this is the problem companies are dealing with. It’s baffling why this challenge persists in the saturated market of companies selling ETL/ELT tools. However, when you do the research and look under the hood of some of the top data management technology tools on the market, you can see why this problem has not yet been resolved.
What is ETL?
‘E’ is extraction, gathering data from somewhere.
‘T’ is for transformation. Transforming data covers a massive amount of territory. It can be complex full-fledged feature engineering, or simply formatting a field like zip code formats.
‘L’ is for Load, sending that data to a destination.
The acronym ETL can be rearranged depending on the tools you use and flows in your organization. In some cases, data is pulled from a source ‘E’ and then dropped to a destination ‘L’ at which point the data then finally gets transformed – making it an ELT flow. The ETL/ELT process was the catalyst for the world of data warehouses, a place where data is just parked for use in the future.
Present Data Flows
As time progresses, things are getting more complicated in the ETL/ELT world as the ‘E’ and the ‘L’ are no longer just happening in databases or data warehouses. Moreover, the adoption of the cloud has resulted in data coming from a wide range of sources, whether databases, data stores, or data objects, both structured and unstructured, and it also includes multiple formats, including batch, events, real-time streaming, and APIs. The start and end points are no longer a single RDBMS instance where the data is parked for basic use, they are much more complex.
The ETL Tools Today
Generally, the majority of tools available today assist in transforming the data ‘T’ before it is loaded to the destination point, ‘L’. This is due to the fact that our highly intensive data-driven culture requires that data is prepped on the fly, making it ready for use before it ends up at a destination for consumption.
With the advancement of this data-on-demand culture, data has multiple endpoints. These multiple endpoints stem from the complex data demands for Machine Learning and AI, Business Analytics teams, finance, HR, Marketing, and more. Once other aspects such as API delivery, real-time dashboarding, virtualization, modeling, AI, and processing are added into the equation, it creates a highly complex data environment that requires reevaluating how data is moved and consumed. Ultimately, it’s all about providing the end user with the data they need, when they need it, and how they want it.
The ETL market today holds many solutions to this complex data problem. As the 3 big cloud-based data tools (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) fulfill all of the needs for their respective cloud offerings they will not be reviewed. Instead, the focus will be on ETL tools designed to answer dilemmas beyond just what a single cloud service offers. Opinions within are based on actual hands-on product usage, not websites’ statements or third-party opinions.
Matillion is an ELT company that offers Matillion ETL and Matillion Data Loader. This cloud-based SaaS solution offers VPC deployment bundled into a user management system. Matillion comes with a list of data sources to extract data from while offering the ability to test connections. Their solution allows you to check operations are running as expected and to create and share credentials as needed.
When thinking of the future of data flows, issues with this ETL company’s product come to light. The system itself is cloud-based and cannot be installed in a non-cloud data center. Files cannot be retrieved from local systems and the destinations available are very limited. When looking to add more connections, the UX has numerous steps making it complex and time-consuming. A lack of streaming capabilities becomes evident from the get-go and as their tools do not offer any out-of-the-box transformers, end users are required to create scripts manually. The tools are not event-driven meaning the system can only connect to an empty cloud store bucket and load files as they arrive. If you are working with update vs batch data all of your pipelines will be separated and this may end up increasing costs as pricing is pipeline based.
Dbt is a cloud-based SaaS service for data transformation that is low-cost to operate. This SaaS offers direct integration with GIT and Python as well as a built-in scheduler. The key thing dbt is missing is that its tools offer a minimal set of data sources and they don’t integrate with streaming data. The ability to troubleshoot within dbt, from testing utilities to connections themselves, is poor in comparison to other tools available on the market. These connection limitations are really felt by the end-user as they must understand all the intimate details of each connection with no help from the product itself.
A Cloud SaaS ELT platform, Rivery is equipped with many connectors and out-of-the-box template kits for data infrastructure setups. Built into the platform itself, Rivery offers the ability to build a library of connections to share with users. Their built-in Python options make it simple to work with and their built-in scheduler allows you to automate processes for more fluid data flows. Some flaws found within Rivery’s offerings include the fact that only very basic transformations can be done within their system and SQL needs to be run in order to do so. Another frustrating feature is that multiple pipelines are required in order to run and transform the same data set. Not only is the source schema not viewable, but the system also requires hard paths to follow and is not event-driven.
An all-encompassing SaaS ETL platform, Datorios offers data sovereignty with its cloud VPC and on-prem installations. For both source and destination, enjoy batch, event, and streamed-based connectors with a simple API. It is virtually effortless to build pipelines with Datorios, our no-code data transformations make it possible for anyone, regardless of technical ability, to create the most simplistic or complex data flows. With schedule or events triggering, tasks are handled event by event allowing for the creation of any needed configurations. For those proficient in Python, a coding option for the creation of custom data transformation is available as well. Offering declarative programming capabilities, it is simple to implement CI/CD pipelines and our autoscale feature allows for shorter adaptation periods to changing business needs.
Datorios’s usage-based pricing model is fully visible at all times and gives complete control to customers over their data costs. Our solution also inherits the customer’s privacy protocols entirely – guaranteeing flexible data governance capabilities. Making waves in its initial debut, Datorios was created to solve the fundamental challenges of ETL/ELT processes.
A data processor and distributor, Apache Nifi is an open-source environment that provides a minimal code solution for an array of out-of-the-box connectors and processors. Managed versions of Apache Nifi are available only through third parties and these managed installations require an admin to maintain. The software itself is fairly difficult to manage with an interface providing more of a means for data management than ETL. This comes to light when building parallel pipelines as repetition is not easily achieved. With Nifi’s cluster-based installation utilizing Apache Zookeeper this data management tool definitely has some cool features – but not ones helpful for ETL purposes.
An open-source orchestrator that evolved into an ETL tool through the embedding of python tasks, Apache Airflow allows you to work with ML flows and create ETL pipelines. An easy-to onboard solution, Airflow allows you to implement CI/CD and create manual pipeline triggers or add external triggers while designing this imperative pipeline solution.
Advanced knowledge of Python coding is compulsory in the use of this orchestration tool as everything must be manually coded and no out-of-the-box transformers are available. Without user management availability and its lack of support, this tool can be difficult to manage and even more difficult to scale as it offers no native cloud integrations. Again, due to the requirement that everything is coded from scratch, this solution is great if you know exactly what you want and how to code it as well as the knowledge to troubleshoot and deploy code to meet your business requirements.
Where ETL Tools are Headed
Future forward ETL tools ensure their solutions include attributes such as cost, reliability, scalability, and adaptability. Answering the issues of yesterday, all tools mentioned above allow for the creation of data pipelines, but, the ones you choose will ultimately determine your needed integration times and the associated costs for needed maintenance. Only those solutions that work with multiple end-points, allow for ETL and ELT, and can adapt and scale as your needs change – will be able to stand the test of time. Choose the right tools today, backed by future-forward thinking minds, and get back to what matters most – accurate, real-time data needed for decision making.
What Is Data Management and Why Is It Important? We live in a world where data is everywhere.
Several years ago, while leading the development at an IDF (Israeli Defence Force) technological unit, I found myself