Stateful stream processing is not a new concept, but some approaches and best practices are still not straightforward and continuously changing. The state itself can be represented in a variety of different ways. I’ve recently spent quite a bit of time learning and building stream processing pipelines that use a particular type of state, and I’d love to share more thoughts on this topic.
I’ve been using Kafka Connect for a few years now, but I’ve never paid much attention to Single Message Transformations (SMTs), until recently. SMTs are simple transforms that are applied to individual messages before they’re delivered to a sink connector. They can drop a field, rename a field, add a timestamp, etc.
I always thought that any kind of transformation should be done in a processing layer (for example, Kafka Streams) before hitting the integration layer (Kafka Connect). However, my recent experience with configuring an Elasticsearch Sink connector proved me wrong! Complex transformations should definitely be handled outside of Connect, but SMTs can be quite handy for simple enrichment and routing!
A few months ago, I finished reading Flow: The Psychology of Optimal Experience by Mihaly Csikszentmihalyi. Wow, what an fantastic book!
Have you ever did something that made you focus on the subject so much that you forgot about your surroundings, forgot about time and just enjoyed doing what you were doing? This is known as Flow.
Kafka Streams is an advanced stream-processing library with high-level, intuitive DSL and a great set of features including exactly-once delivery, reliable stateful event-time processing, and more.
Naturally, after completing a few basic tutorials and examples, a question arises: how should I structure an application for a real, production use-case? The answer could be very different depending on your problem, however, I feel like there are a few very useful patterns that can be used for pretty much any application.
Kafka Connect is a modern open-source Enterprise Integration Framework that leverages Apache Kafka ecosystem. With Connect you get access to dozens of connectors that can send data between Kafka and various data stores (like S3, JDBC, Elasticsearch, etc.).
Kafka Connect provides REST API to manage connectors. The REST API supports various operations like describing, adding, modifying, pausing, resuming, and deleting connectors.
Using REST API for managing connectors might become a tedious task, especially when you have to deal with dozens of different connectors. Although it’s possible to use some web UI tools like lensesio/kafka-connect-ui, it makes sense to follow basic deployment principles: config management, version control, CI/CD, etc. In other words, it’s perfectly fine to start with manual, ad-hoc REST API calls, but ultimately any large Kafka Connect cluster needs some kind of automation for deploying connectors.
I want to describe the approach that my team uses to make Connect management simple and reliable.
I wrote a guest blog post for Qubole.
Once again I’m writing a “Year in Review” post, mostly focused on professional life & tech stuff. Check 2017 here.
For the last two years I’ve been working with Apache Kafka a lot. Everything including building infrastructure (and running clusters on bare metal, in VMs and containers), improving monitoring and alerting, developing consumers, producers and stream processors, tuning, maintenance, etc., so I consider myself a very proficient user.
Still, all these years I didn’t have a chance to read the ultimate “Kafka: The Definitive Guide” book. Finally, I’ve got one at Strata NYC earlier this year and finished it about a month ago. Surprisingly, while reading it, I left a lot of bookmarks and notes for myself that might be useful for beginners as well as experienced users. Obviously, they’re very subjective and specific.
It’s Q4 of 2018 and it’s really interesting to observe the change in Big Data Landscape, especially around open-source frameworks and tools. Yes, it’s still very fragmented, but the actual solutions and architectures start to slowly converge.
Right now I’m in the beginning of a huge platform redesign at work. We always talk about various frameworks and libraries (which is actually just an implementation detail), but I started to think: what qualities should modern data pipelines have going forward? The list that I came up with is below.
An interesting observation based on the recent conversations at work: the more strict the data format used in the API definition, the harder it is to change the API behaviour later. And vice versa, it’s easier to change the APIs that use flexible data formats.