Navigating Apache Flink: A Software Engineer’s Onboarding Journey

Table of Contents

In this article, I will recount my initial foray into Apache Flink, shedding light on my background, first impressions, onboarding experience, practical applications, and my overall insights on utilizing Apache Flink for streaming data processing.

My Journey

As a software engineer, I have dabbled in a wide variety of areas including smart homes and IoT to OCR for documents and checks. For most of my professional career, I’ve developed fintech and insurance applications using Dot Net. My roles have always had me writing code and dealing with data to one end or another. Getting data, using it, cleaning/transforming it, and sending it to be consumed/analyzed has always been a passion of mine. Over the years, I’ve tried, failed, succeeded, loved, and hated thousands of frameworks, libraries, languages, tools, databases, cloud providers, dev ops, and everything in between. Recently, my journey has led me to some solution engineering roles at two amazing companies. One of which is a company called Datrorios where our R&D team has recently set out to reimagine Flink. Considering this background, here are my first impressions and experiences with Apache Flink.

Apache Flink: In Short

What exactly is Apache Flink, and why all the hype? At its core, Flink is a Java framework designed for in-memory streaming data processing. If you are in need of fault-tolerant, large-scale, high-volume, high-throughput, speedy, distributed, or a combination of these capabilities, Flink is the tool to explore. It supports various languages, including Python, SQL, Scala, Java, and other JVM languages, and it can even handle batch processing. Being open source, Flink has many third party integrations and boasts a rapidly growing community. As far as in-memory speed, Apache claims that Flink can achieve this at any scale, and judging by the companies that use it, I would say it’s probably true.

Apache Flink’s Edge In Practice

So, what are some practical use cases for Flink? A quick internet search will yield consistent responses:

Fraud detection
Real-time alerting
Stock market quotes or trading
Anomaly detection
IoT applications

Distributed stream processing, speed/latency, statefulness, and fault tolerance, along with scalable capabilities, answer the ‘why.’ Flink has libraries for transformations, advanced windowing, and even features like pattern detection using CEP (Complex Event Processing). Additionally, you can write straight SQL, which in the world of data, is extremely powerful. Flink’s incorporation of watermarks and timestamps addresses progress tracking, especially with out-of-order data. These factors, among others, contribute to its widespread adoption by major corporations.

Honestly, anything event-driven or data-streaming could be a good use case for Flink. Consider manufacturing facilities or hospitals, where an infinite number of sensors generate simultaneous data, necessitating prompt actions, alerts, and logical decision-making — This is where Flink excels.

My First Thoughts on Apache Flink:

“Why haven’t I used this before?“: My initial reaction after reading a few sentences about Flink on various websites was, “This looks amazing and simple. Why haven’t I used this before?”.
“How many APIs?”: Cue 436+ browser tabs and a whole lot of reading. The first thing I noticed was the existence of different APIs. Flink is divided into two APIs: the table API and the streaming API. However, determining which one to use and when is still somewhat confusing, but I’ll dive deeper into that later. Googling around did not yield many results, which seems to be the case for the majority of questions I had. In some cases, you can use either depending on what you are trying to achieve, while in others, only one makes sense.
“Where’s the documentation?”: Diving in headfirst may have been a mistake. I was caught off guard by the steep learning curve and the scarcity of documentation on both Flink’s website and across the web. Code examples are very sparse, and Python examples are even more limited. Even Stack Overflow and Reddit seem to lack answers and questions compared to other frameworks and languages I have worked with over the years. Admittedly, coming from a computer science background rather than a data engineering background meant more of a learning curve for me. Couple that with more experience in C# than Java and Python, and I had my work cut out for me.
“We’re getting there”: With some more exploration, I must admit I was amazed at Flink’s extensive capabilities with handling state, batching and streaming, fault tolerance, and parallel processing. It seems like they have thought of everything you need for your data pipeline while still maintaining speed. Most tools in this arena seem to be missing something. Some excel at batch processing, but few handle streaming efficiently, often sacrificing speed for ease of use. But what does Flink sacrifice?

From what my experience has taught me so far, I would say Flink handles all the meat and bones remarkably. Compared to other tools, I think it has everything you need without compromising much. Where Flink starts lacking for me is the user experience, particularly the UI, documentation, and learning resources, this left much to be desired. All of this can be overcome and it seems like Flink was built from the ground up to be the best streaming engine around. I think it succeeded in that, but it has not yet been given the time it needs to improve the user experience. While it does offer some monitoring features, such as back pressure and the ability to submit jobs, I believe this can be drastically improved, and I look forward to what may be added in future iterations.

My Onboarding Journey With Apache Flink: Step By Step

Install

How did I start? Installing Flink of course! Flink runs on all UNIX-like environments. Start by downloading Flink and ensuring you have Java 1.8.x or higher. There are tons of installation guides out there, and for me, this was the easiest part of using Flink. You can follow along with the official guide or just look online for other tutorials on installation. I opted for the standalone install for the first go-around. I ran this on my Windows machine using a virtual machine with Ubuntu 22.04 installed. Once you have Flink downloaded, just start the cluster using the “start-cluster” command, and you are ready to submit a job.

Flink Install Guide

Design

Next, I experimented with examples in both Java and Python from articles I found online. I encountered challenges stemming from version disparities between Flink 1.18 (my version) and online guides ranging from v1.6 to v1.15. Different versions have varying differences; some methods are deprecated, and others have been moved around. This resulted in wasted hours trying to figure out what still existed and where something had moved. This seems to be a more significant problem with PyFlink (Python) than Flink (Java), but both had similar issues.

So what did I try the first time? A bit of everything.

Sources: Reading from CSV, JSON, Restful API, Socket (Stream), Collections, Working with Kafka
Sinks: Tables, CSV, JSON, Kafka.
Installation: Flink on Linux.
Transformations and others: reduce, map, flat map, filter, join, SQL, windowing, data types, and temp tables.
Python (Pyflink) and Java Flink.

Transform

Once you figure out the basic layout for sources and sinks, the most challenging part, for me, was transformations, as they can get complex and confusing very quickly. Apache Flink has a large set of features like event time processing, windowing, and exactly-once processing end-to-end. Figuring out how to string these things together while transforming, cleaning, and enriching your data can be tricky, to say the least.

Trying to find examples for some simple use cases I was trying to solve seemed lacking, so I thought, why not use AI to help? I tried, but it didn’t work well. What I found was that ChatGPT 3.5 thinks the newest Flink version is 1.14.3, which is obviously not correct and resulted in many older examples that no longer worked and had several issues. One interesting finding was that the paid version 4.0 knew what the correct Flink version was:

ChatGPT 3.5

ChatGPT 4:

I tried Google Bard and Microsoft Copilot as well. In general, none of them gave exemplary results. Some had code that was older with deprecated or moved libraries, while some got confused between Java and Python. In fact, I’ve even encountered incorrect code and ridiculously bad formatting on Google Bard. Overall, most of the AI I tried slightly helped point me in the right direction, but fell short resolving specific issues. I think Copilot was the most effective, but I’m sure others will catch up given time and more code example availability.

With limited online assistance and no luck exploring AI tools, I decided to buy a book and start with the basics. I purchased “Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri”. The book does a great job laying out some basic concepts and use cases. I recommend it for anyone starting to learn Flink. As I continue to read this book, I will post more in the future. Two sentences toward the beginning that caught my attention:

“In addition to these features, Flink is a very developer-friendly framework due to its easy-to-use APIs.”
“Layered APIs with varying tradeoffs for expressiveness and easy-to-use.”

While the book has been extremely helpful so far, these two lines made me chuckle. That has not been the case, for me at least.

Java or Python?

As I started getting my feet wet, I was hit with my next conundrum: Java or Python? I understand that, in the real world, a company may already use one of these, so the choice may be easy. For me, I was free to use either one without restraints. But trying to decide which route to take required some further investigation.

What did I find? Java seemingly won over Python. Flink is written in Java, so overall support and community should be better, and some even suggest it should be more performant due to cross-language overhead. I’m not sure how much we’re talking here but it’s good to be aware. Since Flink is written in Java, you will get all the shiny new toys first. Python additions always come after. So if the latest changes are relevant to you, keep that in mind.

Java again won when it came to code examples or internet searches for answers. It seems like StackOverflow and overall search results had more Java answers than Python. Next, and I think the most important difference, was the differences in the APIs themselves. Java has the full feature set, while Python has a subset (e.g., sending data through a socket or the Distinct keyword). Finally, Flink’s own documentation had examples in Java/Scala but not always in Python. So if you have the option, my scales are tipped in favor of Java and I will leave you with this breakdown I found:

Conclusion

Embarking on this onboarding journey has been an excellent introduction to Apache Flink and the world of data streaming. I am going to dive into more advanced topics as I work through use cases. I will write about that experience with code examples in future posts. One thing I will definitely use in the future is the Apache mailing list and Flink’s Slack channel. I found that late in the game, but have heard from others that they are responsive and helpful, so check that out when seeking assistance.

Overall, Flink stands out as a powerful tool. When compared to other frameworks, libraries, and languages I have used, I think the community could benefit from more real-world examples, StackOverflow support, and in-depth articles. While the learning curve may be steep, particularly due to documentation and example gaps, Flink’s potential becomes apparent with exploration. I look forward to more research and future releases.

What’s next: I will continue to blog about my experiences with Apache Flink.

Feel free to share your comments and ideas! Have you run into any of these issues? What aspects have proven critical or complex in your Flink projects? Let’s continue the conversation!

Here are some resources I used throughout my research: