Customizing Flink. Part 1: Forking

UPDATE: see the part 2 here.

Apache Flink is a powerful data processing framework, but sometimes you realize that it doesn’t always work the way you want. Maybe you want to adjust an existing feature a bit, and you don’t have time to go through the review process and wait for the next release. Maybe you have a very specific use case. Or maybe you’ve discovered a bug that hasn’t been fixed yet, and you want to fix it ASAP.

An ability to customize Flink in these situations (and in some others) can be invaluable.

Flink documentation describes how to build Flink from source, but it doesn’t really explain what to do after or how to maintain your fork. In this post, I’ll explain an end-to-end process for forking and maintaining Apache Flink.

Note: The strategy that I describe here works well if you want to make and keep up to ten updates or so in your fork, occasionally contributing them upstream and syncing the latest changes. It probably won’t work as well for active development, for example, if you want to contribute dozens of changes.

Prerequisites

Make sure you have the following installed (wherever you’re going to build Flink, e.g. locally and CI):

Git.
Docker.
Java 8 and Maven 3.8.6, as of Flink 1.18. Even though Flink now supports Java 11 (and even has experimental support for Java 17), Java 8 and a specific version of Maven are needed. See this for installation instructions and this for getting an older version.

Branching & Versioning

Let’s discuss versioning and the branching strategy first.

I propose to:

Never commit any changes to the master branch (with one exception described below). The main reason for this is to avoid merge conflicts. You want to be able to sync up with upstream anytime.
Always create custom branches based on the release branches and only commit custom changes there. For example, you may create a your-company-release-1.18 branch from release-1.18.

I also find it useful to append a git sha to the version string, so you can differentiate every commit. Your final version may look something like this:

1.18.0-your-company-1d6150f

which will be used for the 1d6150f commit in your-company-release-1.18 branch.

I hope this makes sense so far! You’ll see why this convention is important in the Maintenance section.

Now, let’s actually modify the version in the project files. You should run the following command in the project root, which will apply it to all Maven submodules recursively:

1VERSION="1.18.0-your-company-$GIT_SHA"
2
3mvn org.codehaus.mojo:versions-maven-plugin:2.8.1:set -DnewVersion=$VERSION

Building

Let’s say you have your custom branch ready, you committed at least one custom change, and you set the custom version. Now, we need to:

Build all Maven modules and publish them to your artifact repository of choice (like Artifactory).
Build a new Docker image that uses your custom build.

Modules

Run the following command to build all modules:

1mvn clean install -DskipTests -Pdocs-and-source -Pscala-2.12 -Dscala-2.12

You can ignore the Scala parameters if you don’t use Scala in your projects. NOTE: this command skips tests to speed up the build.

Even without the tests running, it’ll take a while. Grab a beverage of your choice.

Configure your CI environment to publish all generated JARs in the local Maven repository (e.g. from /root/.m2/repository). You’d want to upload JARs with sources; they’re very useful when developing in an IDE.

Once they’re published, you now must use a new Flink version for all your Flink dependencies (e.g. 1.18.0-your-company-1d6150f instead of 1.18.0).

Docker Image

We also need to build a new Docker image that uses the custom Flink distribution.

Run the following command to build the Flink distribution package:

1mvn clean package -DskipTests -Dscala-2.12 -Prelease -pl flink-dist -am -Dgpg.skip

Also, feel free to ignore the Scala parameter if you don’t use Scala.

Now, let’s build a custom Docker image. Clone this project and find the subfolder for your Flink and Scala versions. I’ll use this for example (Flink 1.18, Java 11, Scala 2.12).

Flink Docker images don’t have build args, so you can just use the standard docker build ... command using your build/CI tools. Make sure to use the same version we defined above for naming the image.

Flink distribution is located in $FLINK_HOME, usually /opt/flink. There are two ways to replace the default one.

If you build the Flink distribution and the Docker image in one step in CI, you just need to copy the files to /opt/flink replacing the existing ones.

But there is also a different, multi-stage approach, which doesn’t require modifying Flink Docker images; also keeping the distribution archive somewhere can be great for audit purposes.

You can archive the distribution with a command like this:

1tar czf flink-$VERSION-bin-scala_2.12.tgz -C ./flink-dist/target/flink-$VERSION-bin flink-$VERSION

Then set these two environment variables for the docker build step:

FLINK_TGZ_URL should point to that archive we built above. It’ll be downloaded and unarchived to /opt/flink.
CHECK_GPG should be set to false, this will simplify the rest of the build process. You can keep it as true if you want to verify the GPG signature. This process is a bit involved, though, and I won’t cover it here.

Congrats! You now have both the custom Flink JAR files and a custom Docker image.

By the way, as you probably noticed, we didn’t run tests anywhere. Flink has A LOT of tests, and just running unit tests alone can take a while. You can set up a CI step that does it for you, and you can choose to run it manually before releasing a new version. You can also just run tests locally, but only in the package you modified.

Deploying

This part is pretty straightforward!

Use the custom Flink JAR files in your project as dependencies. Build a project JAR and use the custom Flink Docker image to deploy it. You should see the custom version string in the Flink Web UI.

Maintenance

There are a few workflows to cover.

Submitting an upstream contribution

Fairly easy as long as you keep all custom changes as individual git commits. So, one commit = one contribution. You can create a patch file from a commit and then use it to create a PR.

Creating a new custom version

Let’s say Flink 1.19 is out. You’d create a new custom branch like 1.19.0-your-company-1d6150f and then start cherry-picking commits from the 1.18.0-... one. Ideally, you’ve contributed some of your changes, so you don’t need to cherry-pick all commits - some of your changes are already upstream.

Syncing from upstream

If your fork is a public repo, GitHub makes it very easy - just follow this guide. Since all your custom changes exist in special release branches, you never have any conflicts.

If your fork is private, you can use the following script for syncing:

 1#!/usr/bin/env bash
 2
 3set -o errexit
 4set -o pipefail
 5
 6readonly upstream_remote="upstream"
 7
 8if ! git remote -v | grep $upstream_remote >/dev/null 2>&1; then
 9	echo "Please add an upstream remote before using this script!"
10	exit 1
11fi
12
13# Syncing everything from the remote including tags
14git fetch --tags $upstream_remote
15
16# The magic command that should keep OUR commits on top
17git rebase $upstream_remote/master
18
19# Pushing updates, tags need a separate command
20git push origin HEAD --force
21git push --tags

This script assumes that you have upstream repo added as a remote one (e.g. git remote add upstream $git_repo). Any time you run it, it’ll pull the remote changes from upstream and push them to a local repo. You can actually commit this script and any other build/CI/CD changes to the master, and the script will always keep them on top of the branch (but don’t try to modify any existing files - this will lead to merge conflicts).

Outro

And that’s it! As you can see, setting up and maintaining a Flink fork is not that hard. It does require certain discipline, but, as a result, you’re rewarded with a Flink version that contains arbitrary changes that YOU want! Isn’t it amazing?

PS: if you still find the process complicated, wait for the part 2. It’ll show a different approach that allows you to customize Flink, but it doesn’t require maintaining a fork repo. Stay tuned!

UPDATE: see the part 2 here.

Customizing Flink. Part 1: Forking

Prerequisites Link to this heading

Branching & Versioning Link to this heading

Building Link to this heading

Modules Link to this heading

Docker Image Link to this heading

Deploying Link to this heading

Maintenance Link to this heading

Submitting an upstream contribution Link to this heading

Creating a new custom version Link to this heading

Syncing from upstream Link to this heading

Outro Link to this heading