Apache Flink Deduplication: Key Strategies
Continuing on my Apache Flink Journey it’s time for some real world use cases. Now that I have
In this article, I will recount my initial foray into Apache Flink, shedding light on my background, first impressions, onboarding experience, practical applications, and my overall insights on utilizing Apache Flink for streaming data processing.
As a software engineer, I have dabbled in a wide variety of areas including smart homes and IoT to OCR for documents and checks. For most of my professional career, I’ve developed fintech and insurance applications using Dot Net. My roles have always had me writing code and dealing with data to one end or another. Getting data, using it, cleaning/transforming it, and sending it to be consumed/analyzed has always been a passion of mine. Over the years, I’ve tried, failed, succeeded, loved, and hated thousands of frameworks, libraries, languages, tools, databases, cloud providers, dev ops, and everything in between. Recently, my journey has led me to some solution engineering roles at two amazing companies. One of which is a company called Datrorios where our R&D team has recently set out to reimagine Flink. Considering this background, here are my first impressions and experiences with Apache Flink.
What exactly is Apache Flink, and why all the hype? At its core, Flink is a Java framework designed for in-memory streaming data processing. If you are in need of fault-tolerant, large-scale, high-volume, high-throughput, speedy, distributed, or a combination of these capabilities, Flink is the tool to explore. It supports various languages, including Python, SQL, Scala, Java, and other JVM languages, and it can even handle batch processing. Being open source, Flink has many third party integrations and boasts a rapidly growing community. As far as in-memory speed, Apache claims that Flink can achieve this at any scale, and judging by the companies that use it, I would say it’s probably true.
So, what are some practical use cases for Flink? A quick internet search will yield consistent responses:
Distributed stream processing, speed/latency, statefulness, and fault tolerance, along with scalable capabilities, answer the ‘why.’ Flink has libraries for transformations, advanced windowing, and even features like pattern detection using CEP (Complex Event Processing). Additionally, you can write straight SQL, which in the world of data, is extremely powerful. Flink’s incorporation of watermarks and timestamps addresses progress tracking, especially with out-of-order data. These factors, among others, contribute to its widespread adoption by major corporations.
Honestly, anything event-driven or data-streaming could be a good use case for Flink. Consider manufacturing facilities or hospitals, where an infinite number of sensors generate simultaneous data, necessitating prompt actions, alerts, and logical decision-making — This is where Flink excels.
From what my experience has taught me so far, I would say Flink handles all the meat and bones remarkably. Compared to other tools, I think it has everything you need without compromising much. Where Flink starts lacking for me is the user experience, particularly the UI, documentation, and learning resources, this left much to be desired. All of this can be overcome and it seems like Flink was built from the ground up to be the best streaming engine around. I think it succeeded in that, but it has not yet been given the time it needs to improve the user experience. While it does offer some monitoring features, such as back pressure and the ability to submit jobs, I believe this can be drastically improved, and I look forward to what may be added in future iterations.
How did I start? Installing Flink of course! Flink runs on all UNIX-like environments. Start by downloading Flink and ensuring you have Java 1.8.x or higher. There are tons of installation guides out there, and for me, this was the easiest part of using Flink. You can follow along with the official guide or just look online for other tutorials on installation. I opted for the standalone install for the first go-around. I ran this on my Windows machine using a virtual machine with Ubuntu 22.04 installed. Once you have Flink downloaded, just start the cluster using the “start-cluster” command, and you are ready to submit a job.
Next, I experimented with examples in both Java and Python from articles I found online. I encountered challenges stemming from version disparities between Flink 1.18 (my version) and online guides ranging from v1.6 to v1.15. Different versions have varying differences; some methods are deprecated, and others have been moved around. This resulted in wasted hours trying to figure out what still existed and where something had moved. This seems to be a more significant problem with PyFlink (Python) than Flink (Java), but both had similar issues.
So what did I try the first time? A bit of everything.
Once you figure out the basic layout for sources and sinks, the most challenging part, for me, was transformations, as they can get complex and confusing very quickly. Apache Flink has a large set of features like event time processing, windowing, and exactly-once processing end-to-end. Figuring out how to string these things together while transforming, cleaning, and enriching your data can be tricky, to say the least.
Trying to find examples for some simple use cases I was trying to solve seemed lacking, so I thought, why not use AI to help? I tried, but it didn’t work well. What I found was that ChatGPT 3.5 thinks the newest Flink version is 1.14.3, which is obviously not correct and resulted in many older examples that no longer worked and had several issues. One interesting finding was that the paid version 4.0 knew what the correct Flink version was:
I tried Google Bard and Microsoft Copilot as well. In general, none of them gave exemplary results. Some had code that was older with deprecated or moved libraries, while some got confused between Java and Python. In fact, I’ve even encountered incorrect code and ridiculously bad formatting on Google Bard. Overall, most of the AI I tried slightly helped point me in the right direction, but fell short resolving specific issues. I think Copilot was the most effective, but I’m sure others will catch up given time and more code example availability.
With limited online assistance and no luck exploring AI tools, I decided to buy a book and start with the basics. I purchased “Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri”. The book does a great job laying out some basic concepts and use cases. I recommend it for anyone starting to learn Flink. As I continue to read this book, I will post more in the future. Two sentences toward the beginning that caught my attention:
While the book has been extremely helpful so far, these two lines made me chuckle. That has not been the case, for me at least.
As I started getting my feet wet, I was hit with my next conundrum: Java or Python? I understand that, in the real world, a company may already use one of these, so the choice may be easy. For me, I was free to use either one without restraints. But trying to decide which route to take required some further investigation.
What did I find? Java seemingly won over Python. Flink is written in Java, so overall support and community should be better, and some even suggest it should be more performant due to cross-language overhead. I’m not sure how much we’re talking here but it’s good to be aware. Since Flink is written in Java, you will get all the shiny new toys first. Python additions always come after. So if the latest changes are relevant to you, keep that in mind.
Java again won when it came to code examples or internet searches for answers. It seems like StackOverflow and overall search results had more Java answers than Python. Next, and I think the most important difference, was the differences in the APIs themselves. Java has the full feature set, while Python has a subset (e.g., sending data through a socket or the Distinct keyword). Finally, Flink’s own documentation had examples in Java/Scala but not always in Python. So if you have the option, my scales are tipped in favor of Java and I will leave you with this breakdown I found:
Embarking on this onboarding journey has been an excellent introduction to Apache Flink and the world of data streaming. I am going to dive into more advanced topics as I work through use cases. I will write about that experience with code examples in future posts. One thing I will definitely use in the future is the Apache mailing list and Flink’s Slack channel. I found that late in the game, but have heard from others that they are responsive and helpful, so check that out when seeking assistance.
Overall, Flink stands out as a powerful tool. When compared to other frameworks, libraries, and languages I have used, I think the community could benefit from more real-world examples, StackOverflow support, and in-depth articles. While the learning curve may be steep, particularly due to documentation and example gaps, Flink’s potential becomes apparent with exploration. I look forward to more research and future releases.
What’s next: I will continue to blog about my experiences with Apache Flink.
Feel free to share your comments and ideas! Have you run into any of these issues? What aspects have proven critical or complex in your Flink projects? Let’s continue the conversation!
Here are some resources I used throughout my research:
Continuing on my Apache Flink Journey it’s time for some real world use cases. Now that I have
In this follow-up article (see part 1), building on my initial explorations with Apache Flink, I aim to dive into
Have you ever felt like your Flink Applications needed something more? Maybe you have wanted to take a
Fill out the short form below