Apache Flink Deduplication: Key Strategies
Continuing on my Apache Flink Journey it’s time for some real world use cases. Now that I have
If you’re like most organizations, you currently have real-time data processing projects built, under construction, or planned. The systems involve streaming data from one or more points of origin like a point-of-sale or connected machines or vehicles to some back-end application for processing, and the risk of a real-time data blind spot increases with such complexity. Real-time AI systems are nearly always fed by streaming data as well.
The infrastructure for this usually includes a data streaming platform like Apache Kafka, which is used to reliably transport data from its point of origin to its destinations. It also likely includes a framework like Apache Flink, which is used to process the data as it streams by, calculating averages, looking for anomalies, or attaching other data to the stream for downstream actions and analytics.
As a Head of Data or Head of Data Infrastructure, you’re ultimately responsible for the success and failures of real-time data delivery and processing. Maintaining visibility is crucial to avoid a real-time data blind spot. You need to ensure that the data is flowing and being processed in a timely fashion, with no downtime, and with solid security. Streaming data governance is especially challenging because:
Data is in transit. Unlike data that sits stationary in a database, it needs to be protected and monitored during its entire journey to protect against a real-time data blind spot.
It’s sensitive data. Many real-time systems act on streaming customer data – purchases, location, medical device readings, and so on. A data breach could be disastrous.
It’s business critical. The data-driven actions being taken in real-time can include things like payment approval, fraud detection, vehicle routing, purchase recommendations, securities trading, and alerting. If these systems start acting unpredictably or experience downtime, things will become stressful for you, to say the least.
With these challenges in mind, let’s talk about a data processing blind spot that most data infrastructure teams don’t realize they have until things go sideways, creating a real-time data blind spot.
As mentioned previously, your stream processing programs are created and executed using a special framework like Apache Flink. Flink is widely used by real-time processing leaders like Netflix, Uber, and many others. It makes it easier to run programs that process data in motion and deal with all of the issues that affect stream processing like scale, data arriving out of order, or in unexpected formats, etc.
Popular observability platforms can report on real-time data systems. They can show you beautiful dashboards that indicate the health of your infrastructure. What few of them do, however, is show you what is happening inside your stream processing applications. They will show you what data flows into a Flink application, and what data flows out. The processing that happens inside is a blind spot, and it should be a source of concern, because that’s often where the data is triggering real-time actions. This creates a real-time data blind spot.
With this real-time processing blind spot, you cannot know:
If something goes wrong with real-time data processing, swift action must be taken. You need to understand who in the organization is best suited to troubleshoot the issue, and ensure people have a common visibility into the actual data processing so they can find the root cause and correct it as quickly as possible.
Infrastructure system metrics in your monitoring systems won’t help much. They can indicate that something is wrong, but they cannot tell you why it’s misbehaving. You can have hundreds of data streams and hundreds of stream processing programs. Which programs are causing the problem? Who developed them? A program that’s throwing errors might be the obvious culprit, but those errors could be occurring as a result of an upstream process that has begun emitting unexpected output, contributing to the real-time data blind spot.
Traces and logs, emitted by the stream processing programs will help with troubleshooting. Unfortunately, the availability of those often depends on the developers taking the time to add observability capabilities such as logging to their programs, and do it in a way that is consistent with the output from programs created by other developers, and without negatively impacting the performance of their Flink jobs.
In order to prevent real-time data processing problems, and also increase the speed with which you resolve the issues you cannot prevent, you should consider this three-point plan:
1. Prevent issues with streaming data governance
One of the most common causes of real-time processing errors is a poorly communicated change made to a data stream schema upstream. Streaming data governance systems can help prevent this. They do so by creating a catalog of streaming data schemas and the systems that produce and consume them. This “data contract” makes it easier to see how a change might affect other systems, and thus, reduce errors caused by poor change management. It also makes it clear who on the team owns different real-time infrastructure components. If an issue arises, it helps you assign an owner to begin the troubleshooting process. There are many streaming data governance systems available; some have better change notification facilities than others, so be sure to evaluate that carefully and avoid a real-time data blind spot.
2. Increase Visibility with Flink observability tooling
There are observability tools designed specifically for Apache Flink programs and other real-time processing frameworks. These are a must-have to eliminate the real-time data blind spot because, much like a flight data recorder, they can reveal exactly what Flink programs did at any given time. They enable you to step through Flink program execution, line by line, to confirm exactly where the processing went wrong–or confirm that it did not, so that you can eliminate it as the cause of the problem.
3. Increase access to streaming data
Because securing production streaming data is new to many IT organizations, it’s common for companies to heavily limit access to it. This means that when disaster strikes, the people who might be best qualified to troubleshoot it, do not have access to the production data. In order to increase the availability of troubleshooters, you can either increase access to production streaming data by a. using a Kafka UI tool to implement role-based access control to streaming data and mask PII data fields, or b. using a Flink observability tool to share replayable recordings of Flink program executions. Record and replay lets them see what happened without giving them live access to production data.
Real-time processing can be a game-changer for your business and your customers. Move forward with confidence, and without data processing blindspots! Stream processing observability and governance are essential to preventing a real-time data blind spot and resolving data issues that arise in real-time business.
Continuing on my Apache Flink Journey it’s time for some real world use cases. Now that I have
In this follow-up article (see part 1), building on my initial explorations with Apache Flink, I aim to dive into
In this article, I will recount my initial foray into Apache Flink, shedding light on my background, first impressions,
Fill out the short form below