Apache Flink is an open-source distributed stream processing framework designed for high-throughput, low-latency, and fault-tolerant real-time data processing. It natively supports both stream processing (unbounded data) and batch processing (bounded data), treating batch as a special case of streaming—often summarized as “batch is a bounded stream.”
Key Features
1. Unified Stream and Batch Processing
- Flink provides a single runtime for both streaming and batch workloads.
- The DataStream API handles unbounded (streaming) and bounded (batch) data uniformly.
- The legacy DataSet API (for batch) is deprecated; modern Flink favors DataStream API or Table API / SQL for all use cases.
2. Event Time Semantics & Watermarks
- Processes events based on event time (when the event actually occurred), not processing time.
- Uses watermarks to handle out-of-order events and manage late data in windowed computations.
3. Stateful Computations
- Supports rich stateful operations with:
- Keyed State: Scoped per key (e.g., per user ID).
- Operator State: Scoped per operator instance.
- State can be stored in memory or on disk (e.g., via RocksDB backend) for large-scale applications.
4. Fault Tolerance with Exactly-Once Guarantees
- Implements distributed snapshots (checkpoints) using the Chandy-Lamport algorithm.
- Provides exactly-once processing semantics—even across failures.
- Supports end-to-end exactly-once when used with compatible sources/sinks (e.g., Kafka with transactional writes).
5. High Performance
- Low-latency (milliseconds) and high-throughput (millions of events per second).
- Pipelined execution engine avoids batch-style scheduling overhead.
Core Architecture
| Component | Role |
|---|---|
| JobManager | Master node: schedules tasks, coordinates checkpoints, handles failures. |
| TaskManager | Worker node: executes tasks, manages task slots(resource units). |
| Client | Submits Flink jobs to the cluster (e.g., viaflink run). |
| State Backend | Determines how state is stored and accessed (e.g.,MemoryStateBackend,FsStateBackend,RocksDBStateBackend). |
APIs
- DataStream API
- For stream (and batch) processing in Java/Scala/Python.
- Supports windowing, time semantics, state, and custom functions.
- Table API & SQL
- Declarative, high-level API for both streaming and batch.
- SQL-compatible; integrates with catalogs (e.g., Kafka, JDBC, Hive).
- ProcessFunction
- Low-level API for fine-grained control over time and state (e.g., event-time timers).
Use Cases
- Real-time analytics (e.g., dashboards, monitoring)
- Fraud detection
- Recommendation engines
- Data pipeline ETL (with exactly-once guarantees)
- IoT and sensor data processing
Ecosystem Integration
- Sources/Sinks: Kafka, Kinesis, Pulsar, HDFS, S3, JDBC, Elasticsearch, etc.
- Deployment: Standalone, YARN, Kubernetes, Docker, cloud (AWS, GCP, Azure).
- Monitoring: Prometheus, Grafana, Flink Web UI.