An overview of ksqlDB

What is ksqlDB?

ksqlDB is a database that's purpose-built for stream processing applications.

What exactly does that mean? It consolidates the many components found in virtually every stream processing architecture.

That is important because almost all streaming architectures today are piecemeal solutions cobbled together from different projects. At a minimum, you need a subsystem to acquire events from existing data sources, another to store them, another to process them, and another to serve queries against aggregated materializations. Integrating each subsystem can be difficult. Each has its own mental model. And it’s easy to wonder, given all this complexity: Is this all worth it?

What if it were as easy to build a stream processing app as it is to build a CRUD app? That's what we wondered.

ksqlDB aims to provide one mental model for doing everything you need. You can build a complete streaming app against ksqlDB, which in turn has just one dependency: Apache Kafka®.

What use cases is ksqlDB a good fit for?

Why use ksqlDB?

These eventful times.

The concepts of events and event-driven programming are fundamental to the development of modern software. Event-centric thinking unlocks a host of new design techniques and applications that were previously out of reach.

Apache Kafka has emerged as the de-facto standard for working with events, offering the abstraction of a durable, immutable, append-only log. But working with streams of events directly in Kafka can be a bit low-level. In some ways, Kafka is like a file system for your events.

Streams need a friend.

To be productive, you need some way of deriving new streams of events from existing ones. Stream processing scratches that itch. It’s a paradigm of programming that exposes a model for working with groups of events as if they were in-memory collections.

Most stream processing frameworks offer a directed, acyclic graph API to express these programs. Some frameworks even expose a SQL API to raise the abstraction a bit higher.

An upside-down world, or right-side up?

If you’re new to the world of stream processing, everything might seem upside-down. In a traditional database, you execute queries that run from scratch on every invocation and return their complete result set. You can think of it like this: queries have an active role, while data has a passive role. In the traditional world, data waits to be acted upon.

With stream processing, the flow control is inverted. Instead, the data becomes active and flows through a perpetually running query. Each time new data arrives, the query incrementally evaluates its result, efficiently recomputing the current state of the world.

This is a fully asynchronous flow in which Kafka truly shines. Consider what it would be like if you attempted this with a normal database: you’d be able to query the data, but you'd have a weakened ability to natively process streams of data as they arrive, or subscribe to changes as they occur.

Too many cooks in the kitchen.

Yet stream processing all by itself isn’t enough to build an app. In the real world, you need other support pieces. You need to source events from other, external systems into Kafka. You need to process them with a stream processor. And finally, because many applications still need synchronicity, you need to pour them into an external database so that an application can synchronously query the final results. After all, asynchronicity isn’t a perfect fit for all designs.

In practice, a non-trivial deployment represents a large number of moving parts: often 3 to 5 separate distributed systems. The parts don't fit together as well as you'd hope; all of these systems are complex and each integration is a small project to figure out. It's like trying to build a car out of parts, but the parts come from different manufacturers who don't talk to each other.

Moreover, each system has a separate model for securing, scaling, and monitoring. Making any change requires thinking in several mental models at every step.

Divide and conquer.

Although this larger system can look big and scary, it’s less intimidating when it’s broken down into its core functions. In many ways, an event streaming architecture can be decomposed into a more capable ETL system. Events are extracted from source systems and stored durably. They’re transformed and aggregated into new shapes. They’re loaded into an external database which serves queries that lookup information for applications.

Each component became separate over time because it evolved as the streaming-native solution to what was otherwise considered a batch problem.

Cut from the same mold.

As the streaming world grew up, robust solutions for each piece of the puzzle emerged. Kafka Connect supports a vast range of connectors to source and sink data to external systems. Kafka itself became a progressively more capable storage layer for events. Kafka Streams and KSQL developed powerful runtimes and APIs for processing events. And, deep inside of Kafka Streams, the concept of a state store established humble beginnings as something that could be synchronously queried like a database table.

Reunited.

When you look at all of these pieces, it’s easy to question if this is all worth it. We think that building stream processing applications should be as easy building CRUD applications on top of a regular database. That’s why we’re building ksqlDB, a database that's purpose-built for stream processing applications.

The idea is that it can offer one SQL query language to work with both streams of events (asynchronicity) and point-in-time state (synchronicity). It requires no separate clusters for event capture, processing, or query serving, meaning it has very few moving parts. There is one mental model for working across the entire stack. Scaling, monitoring, and security can also follow a uniform approach.

With ksqlDB, you can build a complete real-time application with just a small set of SQL statements. Things can be simple again.