airflow·

Apache Airflow Part 1 - Why and Goals for a near Serverless ELT

Reasons I want to use Airflow for a Proof of Concept near Serverless ELT
A laptop with a data analytics

At work, at GE Aerospace, I work around supporting data ingestion into a Datalake. I'm not going to go into the details here, but I would love to use Airflow instead of the current stack we have today.

I've done a few proofs of concept with Airflow in the past. It is a solid solution, and with the hype of AI these days, quick and reliable data ingestion has never been more critical.

Why Apache Airflow?

  • It's open source
  • Hugely popular and used by many companies.
    • Features and integrations are available with nearly everything

Why not other solutions?

Over the last few years every provider seems to be reducing their on-premises options to only their hosted solution.

  • Talend and other features are already being deprecated
  • Fivetran moving to their own cloud based solution.
  • Databricks not in us-gov-east-1 as of (2024-07-10)
  • Prefect - lots of cloud-only features, like audit logs, Workspaces, and Automation
  • Dagster - Dagit lacks any authentication when self-hosted.

And I get why—for small to midsize companies, it's easier to just deploy a cloud-based solution. But when you have a really large and/or regulation-heavy environment, it's more important to be able to self-host and manage your own data. No shipping it off to a third party and trusting them with your data.

My goals

I am however going to put a few constraints in place around how I want to use Airflow.

  • Can't use AWS MWAA (it's not offered in AWS US Gov East)
  • Local Development
    • Changes to a job must be able to be tested locally before being pushed
  • CI/CD Pipeline
    • Changes to a job must be able to be deployed to production via a CI/CD pipeline.
  • Job Management
    • Offer Web UI to view and re-trigger jobs
    • Prefer code-first and config rather than UI-based.
    • Time to modify a job should be less than 5 minutes.
    • Time to create a job should be less than 30 minutes.
  • Cost
    • I'd like as close to zero cost as possible, ideally spinning down to near no resource usage when no jobs running.
  • Maintenance
    • Would like to be able to deploy new versions of the Airflow container on ECS.
      • This isn't a hard requirement—could use an EC2 instance and update in place, but that's another box to maintain long term.
        • If instead it was just ECS pointing at an RDS database, you could restore the DB from snapshot and test a deployment before release to production.

Repo

Just started with a repo at: https://github.com/ChrisTowles/airflow-playground

I don't usually post Proofs of Concept like this publicly, but I'm doing it on my own time, so let's see how this goes and where it takes me.