What is the SITREP on Apache Kafka & Flink?

Adron

1 year ago

Apache Kafka and Apache Flink logos on black and white with a mention of hoping this lagniappe is useful.

I’ve worked with (** references at end of article) a number of Apache projects over the years, often pretty closely; Apache Cassandra, Apache Flink, Apache Kafka, Apache Zookeeper and numerous others. But the last few years I’ve not been immediately hands on with the technology. A few questions popped up recently, that fortunately I was able to answer based on existing knowledge, but it made me real curious about what the SITREP (Situational Report) is for the Apache Kafka and Flink Projects for TODAY, i.e. rolling into 2025! The following is a quick dive into the history and then the latest details (and drama?) with Apache Kafka, Flink, and tangentially some other projects (Zookeeper?).

Apache Projects – Context & Quick Details

If you’re unfamiliar with the Apache Projects in a general sense, I highly suggest going and checking out the Apache Project Directory and Apache Projects List. There you will find all sorts of fascinating information about the organization itself, how the projects are organized, and the trend of committees and related details. For example, I always love checking out the initial charts on retired and active that show on the directory page, as I’ve snapshotted below.

Another favorite chart is this one, which I’ll write some thoughts on in the very near future.

Tangential Projects

When checking up on Kafka & Flink and getting back up to speed, I’d be remiss if I didn’t include some of the key tangential projects, here is a list with links to important projects. These are related in the sense they either are used directly with or tangentially with Kafka or Flink. (Note: If I missed any, please leave a comment on the post and I’ll add it to this list!)

Projects Related to Apache Kafka

Apache Zookeeper
- Role: Originally used for managing Kafka’s metadata and ensuring distributed consensus.
- Relationship: Kafka is moving toward a Zookeeper-less architecture with the KRaft protocol, but Zookeeper remains important for legacy deployments.
Apache Kafka Connect
- Role: A framework for integrating Kafka with external systems (databases, cloud services, etc.).
- Relationship: Extends Kafka’s usability by enabling easy ingestion and egress of data.
Apache Avro
- Role: A data serialization system often used with Kafka for schema management.
- Relationship: Enables efficient serialization and deserialization of messages sent over Kafka topics.
Apache Camel
- Role: An integration framework that provides numerous connectors.
- Relationship: Used for integrating Kafka with diverse endpoints and implementing routing logic.
Apache Storm
- Role: A stream processing system.
- Relationship: While Flink has overtaken Storm in many scenarios, Storm can still work alongside Kafka in legacy pipelines.
Apache Samza
- Role: A stream processing framework developed by LinkedIn.
- Relationship: Like Flink, Samza can process Kafka streams but has a more niche adoption.
Apache NiFi
- Role: A data flow management system.
- Relationship: Frequently used with Kafka to design, schedule, and monitor complex data flows.

Projects Related to Apache Flink

Apache Hadoop
- Role: A distributed storage and processing framework.
- Relationship: Flink can read and write data from/to Hadoop Distributed File System (HDFS) for batch and stream processing.
Apache Beam
- Role: A unified programming model for defining batch and stream processing pipelines.
- Relationship: Flink serves as one of Beam’s execution engines, enabling portability across environments.
Apache Pulsar
- Role: A distributed messaging system and Kafka alternative.
- Relationship: Pulsar can act as an event source for Flink, much like Kafka, and supports Flink’s connectors.
Apache Cassandra
- Role: A distributed NoSQL database.
- Relationship: Frequently used as a sink for Flink’s stream processing jobs or as part of a pipeline originating in Kafka.
Apache Hudi
- Role: A data lake framework for managing datasets stored in HDFS or cloud storage.
- Relationship: Works well with Flink for streaming updates and incremental processing in a data lake.
Apache Iceberg
- Role: A table format for large-scale, petabyte-scale analytics.
- Relationship: Flink integrates with Iceberg for performing stream processing on large datasets.
Apache Druid
- Role: A real-time analytics database.
- Relationship: Often paired with Flink for real-time ingestion and with Kafka for data streaming into analytics workloads.
Apache Kafka Streams
- Role: A stream processing library for Kafka.
- Relationship: While Flink is more powerful for complex use cases, Kafka Streams is a simpler alternative for stream processing, often compared to Flink.

Projects Bridging Kafka and Flink

Apache Kafka Connectors for Flink
- Flink provides native connectors for reading from and writing to Kafka, enabling seamless integration between the two systems.
Apache Flink Stateful Functions (StateFun)
- A framework for building stateful applications with event-driven functions.
- Relationship: Integrates well with Kafka for triggering stateful computations and workflows.
Apache Pinot
- Role: A real-time distributed OLAP datastore.
- Relationship: Kafka provides real-time ingestion, while Flink processes and transforms data before storing it in Pinot for analytics.

Complementary Tools

Schema Management:
- Apache Avro, Apache Parquet, and Apache Arrow for defining and sharing schemas for Kafka and Flink pipelines.
Orchestration and Management:
- Apache Airflow for orchestrating workflows that include Kafka and Flink jobs.
- Apache Oozie for managing workflows, especially in Hadoop environments.
Observability:
- Apache SkyWalking for monitoring and analyzing the performance of distributed systems, including Kafka and Flink.

The Latest News & History of Apache Kafka & Apache Flink

Now for getting up to speed with the latest news of where Apache Flink and Apache Kafka are. Hopefully this section of my article will provide some insight for those looking to get into or get back into using or developing for or against these technologies. The idea, this will give you some context of where the tools have been in their development path and where they are now, and where they’re headed.

2010
- Apache Flink: Initiated as the “Stratosphere” research project, a collaboration between Technische Universität Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-Institut Potsdam.
2011
- Apache Kafka: Developed at LinkedIn to address scalability issues with real-time data feeds and open-sourced.
2012
- Apache Kafka: Graduated from the Apache Incubator on October 23 and became a top-level project.
2014
- Apache Flink: Forked from Stratosphere’s distributed execution engine and entered the Apache Incubator in March.
- Apache Flink: Graduated as a top-level Apache project in December.
2015
- Apache Flink: Released version 0.9.0 on June 24 and version 0.10.0 on November 16.
2016
- Apache Flink: Released version 1.0.0 on March 8, introducing backward compatibility for APIs. Released version 1.1.0 on August 8.
2017
- Apache Flink: Released versions 1.2.0 (February 6), 1.3.0 (June 1), and 1.4.0 (December 12).
- Apache Kafka: Released version 1.0.0, marking its maturity as a platform.
2018
- Apache Flink: Released versions 1.5.0 (May 25), 1.6.0 (August 8), and 1.7.0 (November 30).
2019
- Apache Flink: Released versions 1.8.0 (April 9) and 1.9.0 (August 22).
20* 20
- Apache Flink: Released versions 1.10.0 (February 11), 1.11.0 (July 6), and 1.12.0 (December 10).
2021
- Apache Flink: Released versions 1.13.0 (May 3) and 1.14.0 (September 29).
2022
- Apache Flink: Released versions 1.15.0 (May 5) and 1.16.0 (October 28).
2023
- Apache Flink: Released versions 1.17.0 (March 23) and 1.18.0 (October 24).
2024
- Apache Flink: Plans for version 1.19.0 in March * 2024 and Flink 2.0 with a disaggregated storage architecture for cloud-native capabilities.
- Apache Kafka: The community continues development toward Kafka 4.0, with plans to make KRaft mode the default and remove ZooKeeper dependencies.

2024 Highlights

In 2024, the data streaming ecosystem witnessed significant advancements, particularly surrounding Apache Flink and Apache Kafka. These developments encompassed major software releases, influential conferences, emerging trends, and notable industry movements.

Major Software Releases

Apache Flink 2.0 Preview: Marking its first major version change in eight years, the Apache Flink community unveiled a preview of Flink 2.0 in October 2024. This release introduced innovative features and improvements, setting the stage for the future of stream processing.
Confluent’s Integration of Flink and Iceberg: In March 2024, Confluent enhanced its hosted Kafka service by integrating Apache Flink and Apache Iceberg. This integration aimed to provide developers with robust tools for real-time data processing and analytics.

Conferences and Events

Flink Forward Berlin 2024: Celebrating the 10-year anniversary of Apache Flink, this conference took place from October 21-24, 2024. It featured speakers from industry leaders such as Apple, Alibaba, Mercedes-Benz, and Uber, highlighting Flink’s impact across various sectors.
Kafka Summit London 2024: Held in March 2024, this summit attracted over 3,500 attendees, both in-person and online. The event showcased the latest developments in Apache Kafka, with discussions on its integration with Apache Flink and the evolving data streaming landscape.
Current 2024: On September 17-18, 2024, Austin, Texas, hosted Current 2024, bringing together the data streaming community to discuss advancements in Apache Kafka and Apache Flink. The event featured insights from tech leaders and industry giants on real-time event streaming.

Emerging Trends

The year saw several key trends shaping the data streaming domain:

Data Sharing and Data Contracts: Emphasis on data sharing across business units and the implementation of data contracts became prevalent, aiming to enhance data governance and policy enforcement.
Serverless Stream Processing: The adoption of serverless architectures for stream processing gained traction, offering scalability and ease of deployment for real-time applications.
Integration with Generative AI: The convergence of data streaming platforms with Generative AI technologies, such as large language models, opened new avenues for intelligent data processing and analytics.

Industry Movements

Confluent’s Growth: Confluent, a key player in the data streaming industry, reported a 400% earnings growth in Q3 2024, with sales rising to $250.2 million. The company’s partnerships with tech giants like Amazon, Microsoft, and Google underscored its pivotal role in facilitating real-time data streaming and AI applications.

Stable and Development Versions

Apache Flink:
- Latest Stable Version:** 1.* 20.0, released August 2, * 2024.
- Development Version:** Flink 2.0 (preview release announced October 23, * 2024).
Apache Kafka:
- Latest Stable Version:** 3.8.1, released November * 2024.
- Development Version:** Kafka 4.0 (under active development).

References & Article Addendums

The ** at the beginning of this article I added, as to be specific. I have not directly, emphasis on directly contributed to the open source code bases of these Apache projects. I have however used, deployed, and even consulted on and discussed architecture and design of the projects over the years. One of these days I may sling some code into a repository or two!