UPDATE: see the part 2 here.
Apache Flink is a powerful data processing framework, but sometimes you realize that it doesn’t always work the way you want. Maybe you want to adjust an existing feature a bit, and you don’t have time to go through the review process and wait for the next release. Maybe you have a very specific use case. Or maybe you’ve discovered a bug that hasn’t been fixed yet, and you want to fix it ASAP.
An ability to customize Flink in these situations (and in some others) can be invaluable.
Flink documentation describes how to build Flink from source, but it doesn’t really explain what to do after or how to maintain your fork. In this post, I’ll explain an end-to-end process for forking and maintaining Apache Flink.
Note: The strategy that I describe here works well if you want to make and keep up to ten updates or so in your fork, occasionally contributing them upstream and syncing the latest changes. It probably won’t work as well for active development, for example, if you want to contribute dozens of changes.
Make sure you have the following installed (wherever you’re going to build Flink, e.g. locally and CI):
- Java 8 and Maven 3.8.6, as of Flink 1.18. Even though Flink now supports Java 11 (and even has experimental support for Java 17), Java 8 and a specific version of Maven are needed. See this for installation instructions and this for getting an older version.
Let’s discuss versioning and the branching strategy first.
I propose to:
- Never commit any changes to the master branch (with one exception described below). The main reason for this is to avoid merge conflicts. You want to be able to sync up with upstream anytime.
- Always create custom branches based on the release branches and only commit custom changes there. For example, you may create a
I also find it useful to append a git sha to the version string, so you can differentiate every commit. Your final version may look something like this:
which will be used for the
1d6150f commit in
I hope this makes sense so far! You’ll see why this convention is important in the
Now, let’s actually modify the version in the project files. You should run the following command in the project root, which will apply it to all Maven submodules recursively:sh
3mvn org.codehaus.mojo:versions-maven-plugin:2.8.1:set -DnewVersion=$VERSION
Let’s say you have your custom branch ready, you committed at least one custom change, and you set the custom version. Now, we need to:
- Build all Maven modules and publish them to your artifact repository of choice (like Artifactory).
- Build a new Docker image that uses your custom build.
Run the following command to build all modules:sh
1mvn clean install -DskipTests -Pdocs-and-source -Pscala-2.12 -Dscala-2.12
You can ignore the Scala parameters if you don’t use Scala in your projects. NOTE: this command skips tests to speed up the build.
Even without the tests running, it’ll take a while. Grab a beverage of your choice.
Configure your CI environment to publish all generated JARs in the local Maven repository (e.g. from
/root/.m2/repository). You’d want to upload JARs with sources; they’re very useful when developing in an IDE.
Once they’re published, you now must use a new Flink version for all your Flink dependencies (e.g.
1.18.0-your-company-1d6150f instead of
We also need to build a new Docker image that uses the custom Flink distribution.
Run the following command to build the Flink distribution package:sh
1mvn clean package -DskipTests -Dscala-2.12 -Prelease -pl flink-dist -am -Dgpg.skip
Also, feel free to ignore the Scala parameter if you don’t use Scala.
Flink Docker images don’t have build args, so you can just use the standard
docker build ... command using your build/CI tools. Make sure to use the same version we defined above for naming the image.
Flink distribution is located in
/opt/flink. There are two ways to replace the default one.
If you build the Flink distribution and the Docker image in one step in CI, you just need to copy the files to
/opt/flink replacing the existing ones.
But there is also a different, multi-stage approach, which doesn’t require modifying Flink Docker images; also keeping the distribution archive somewhere can be great for audit purposes.
You can archive the distribution with a command like this:sh
1tar czf flink-$VERSION-bin-scala_2.12.tgz -C ./flink-dist/target/flink-$VERSION-bin flink-$VERSION
Then set these two environment variables for the
docker build step:
FLINK_TGZ_URLshould point to that archive we built above. It’ll be downloaded and unarchived to
CHECK_GPGshould be set to
false, this will simplify the rest of the build process. You can keep it as
trueif you want to verify the GPG signature. This process is a bit involved, though, and I won’t cover it here.
Congrats! You now have both the custom Flink JAR files and a custom Docker image.
By the way, as you probably noticed, we didn’t run tests anywhere. Flink has A LOT of tests, and just running unit tests alone can take a while. You can set up a CI step that does it for you, and you can choose to run it manually before releasing a new version. You can also just run tests locally, but only in the package you modified.
This part is pretty straightforward!
Use the custom Flink JAR files in your project as dependencies. Build a project JAR and use the custom Flink Docker image to deploy it. You should see the custom version string in the Flink Web UI.
There are a few workflows to cover.
Fairly easy as long as you keep all custom changes as individual git commits. So, one commit = one contribution. You can create a patch file from a commit and then use it to create a PR.
Let’s say Flink 1.19 is out. You’d create a new custom branch like
1.19.0-your-company-1d6150f and then start cherry-picking commits from the
1.18.0-... one. Ideally, you’ve contributed some of your changes, so you don’t need to cherry-pick all commits - some of your changes are already upstream.
If your fork is a public repo, GitHub makes it very easy - just follow this guide. Since all your custom changes exist in special release branches, you never have any conflicts.
If your fork is private, you can use the following script for syncing:sh
3set -o errexit
4set -o pipefail
8if ! git remote -v | grep $upstream_remote >/dev/null 2>&1; then
9 echo "Please add an upstream remote before using this script!"
10 exit 1
13# Syncing everything from the remote including tags
14git fetch --tags $upstream_remote
16# The magic command that should keep OUR commits on top
17git rebase $upstream_remote/master
19# Pushing updates, tags need a separate command
20git push origin HEAD --force
21git push --tags
This script assumes that you have upstream repo added as a remote one (e.g.
git remote add upstream $git_repo). Any time you run it, it’ll pull the remote changes from upstream and push them to a local repo. You can actually commit this script and any other build/CI/CD changes to the
master, and the script will always keep them on top of the branch (but don’t try to modify any existing files - this will lead to merge conflicts).
And that’s it! As you can see, setting up and maintaining a Flink fork is not that hard. It does require certain discipline, but, as a result, you’re rewarded with a Flink version that contains arbitrary changes that YOU want! Isn’t it amazing?
PS: if you still find the process complicated, wait for the part 2. It’ll show a different approach that allows you to customize Flink, but it doesn’t require maintaining a fork repo. Stay tuned!
UPDATE: see the part 2 here.