Several years ago, while leading the development at an IDF (Israeli Defence Force) technological unit, I found myself
The fact that 80% of data scientists’ time is wasted on data preparation has become a narrative too often heard. Is that 80% really true?
Well, that is difficult to define as it depends on the organization, the skills, the procedures that were developed and assimilated, as well as the tools applied. But one thing can be said for sure – sourcing data for data science requires significant effort, regardless of the percentage cited.
Digital transformation is challenging: it involves an exhausting process of pulling data from somewhat known sources to a known and controlled schema. Once this process has been completed, it can be quite stable, only requiring reasonable adjustments and modifications to ETL pipelines.
Data science is different. The power of data science lies in the combination and integration of data from multiple sources – the richer the sources are, the better the model gets.
The problem with these data sources is that they are not semantically compatible. Too many data types from too many data sources and domains have to be continuously transformed into a single format. In addition, data privacy and consistency need to be taken into account throughout. The process needs to end with saving meaningful data which only starts the frequent iterative process of training – meaning more data adjustments and version controls. It is these processes, these hiccups, that shed light on where this 80% of waste lies.
Conventional ETL tools for data warehousing are built for a stable schema, meaning they cannot handle the instability and flexibility data scientists need. Data investigation, and integration tools for data scientists, must reduce the cost of instability and versioning to a frictionless level. It must give the ability to observe, monitor, control, and validate events down to the last key value, and the standard ETL tools available today just don’t do that.
But let’s say you’ve overcome that “80%” now comes the next issue, the 80% of models that are not able to become operational.
For an operational model to truly be productive, it needs to be fed with the operational data in production and run on infrastructure with the right performance and scalability. The migration of the model to the operational environment, which requires the assistance of data engineering teams, creates an overhead that usually causes the entire process to fail. Taking the work data scientists developed and making it functional is now a must-have to leverage the operational rates of ML models. Therefore, data investigation and integration tools must be apart or well connected to the operational data/infrastructure of the company.
Datorios’ complete solution was created by engineers who saw the dilemma data scientists faced with data preparation and set out to solve it. Data infrastructures are commonly created with conventional needs in mind, but solutions built with the orientation of data science challenges, from model development training and migration to operation, leave the issues data scientists face where they should be, in the past.
The two main pillars of our solution are a rapid, enjoyable data investigation/preparation process and smooth transaction to operational data. With all of this found within the same easy-to-use platform, now you can reduce the 80% waste and make what was once 20% of data scientists’ time what it should be, the full 100%.
Imagine making your way through a crowd, thousands of people donning anything from casual wear to the most
In modern business environments, data pipelines stand at the core of every aligned digital process and action. From