Connecting technologies in the cloud

ETL tool comparison: Why we chose to use Apache NiFi?

What have we been using so far?

ETL was not our core business, but an increasing number of clients asked us to implement data pipelines for different reasons and in different contexts. Since our IT team is formed mostly by backend developers, our initial approach was to just create everything from scratch, write serverless functions and schedule them.

One problem was that without a well-maintained framework, the code was not sufficiently reusable to quickly apply to other projects. Especially since some clients asked us to use Azure, while others preferred Google Cloud, yet most of the clients demanded that our code runs on their on-premise infrastructure. Also orchestration, monitoring and alerting was limited and time-consuming to properly implement. But the main problem was, that clients and also non-technical colleagues from other departments wanted to get insights into aspects of the data pipelines, and showing them a bunch of code was not very helpful.

Our requirements for the ETL tool

So at the beginning of this year we decided that we want to use an externally maintained framework to create and manage our data pipelines. Here is the list of our requirements:

Intuitive GUI

Non-technical staff and clients should be able to follow the general logic of our pipelines, and ideally even be able to apply minor configuration adjustments within their scope of expertise (e. g. change Google Search Console request parameters).

Fallback code execution

As much as we wanted to use a GUI, we also wanted to make sure that we always have the possibility to run custom code - to solve particular issues or requirements - without this being a huge effort or to conflict with the tool's default way of operation.

System independence

The tool has to be able to run on any cloud provider, and also within on-premise infrastructure. Most mid-size companies in Europe (in Germany at least) are very focused on data ownership and refuse to use services from US cloud providers.

Long term support

We are looking for a tool with consistent and frequent progress. Nowadays there are lots of ETL software options available, but many disappear or are left abandoned as fast as they appeared. Others present slow progress or in case of open source projects receive very little contribution.

Nice-to-have

We embrace that many tools have tons of processors that work out-of-the-box. Since most of our projects are located in the SEO context, we would prefer those that have  ready-to-use connections for Google Search Console and Google Business or at least Google APIs in general. Although this was not a requirement.

Other options we have considered

In that regard we have checked a couple of different options. Some were discarded immediately since they did not match all the requirements. Here we mention some of the tools that came into closer consideration. They are all great, but allow me to briefly state what was the critical issue that made us pass on those.

Kettle

We were already using Pentaho in one of our projects, so this was our initial favorite. The main issue here was, that most plugins in the marketplace and on GitHub (at least the ones that were of our interest) had no update for years, or were not available at all. The community contribution for this product seemed to have stalled. Also there is little support for streaming data.

Open Studio

Even more than Pentaho it felt like essential features were left out or locked behind a paywall. I can imagine that this might be intentional to promote Talend's cloud product Stitch, which looks much more polished and complete, but unfortunately for us is cloud-only.

Knime

Actually this one was in the running until the end. It shares a lot of the qualities that NiFi has. But eventually we decided against it, because it's a German university project. Which is great, but since our IT team mostly resides in the Americas, it would have been very difficult to find new staff with experience on this particular software.

Advantages of Apache NiFi

So, there are many great tools out there, and I think many of them are more than capable of doing the stuff, that we are working on. But with Apache NiFi it immediately felt like it matches our needs and our work style perfectly. Here are the features we mostly appreciate:

Web UI

The core business of our IT team is web development. The fact that NiFi runs as a web application was therefore very appealing to us. Our IT team mainly uses Linux, our non-technical staff prefers Windows and stakeholders often own a Mac, so this feature guarantees identical availability, support and user experience for all users. Also it's  a big plus that noone has to install anything, a simple browser is the only requirement.

Docker

We are very happy to run NiFi in docker containers. This allows us to set up NiFi in any infrastructure, and on any cloud provider. It also makes NiFi scalable (up and down) to the needs of each client, easy to connect with other services and to integrate with our continuous deployment.

Processors

Not only does Apache NiFi come with a huge set of very divers and in detail configurable native processors (i. e. pre-built ETL components/connectors), but it also allows us to easily extend it with custom processors and services.

Documentation

The documentation is very comprehensive, up-to-date and complete. There is a detailed usage description for each component, and each configuration option.

Open source

Apache NiFi is not one of the most popular options that we were looking at. But about 4k stars on GitHub indicate that there is a solid mass of developers interested in and contributing to it. The open source nature also clears any doubts about what is happening with the data.

Apache Software Foundation

Finally the fact that Apache - with its huge data integration ecosystem - is behind this project makes it unlikely that it will get abandoned in the near future, and also facilitates the search for new developers.

Conclusion

After working with Apache NiFi for a couple of months we are very happy with it. We are still only scratching the surface on all the functionalities that Apache NiFi has to offer. Our ETL team is growing while we are translating our old 100% code solutions into NiFi pipelines. We intent to invest more time into custom processors and pull requests to the core repository. If I find the time, I will use this blog to comment on our experience and progress.