Solutions

Reliable Python apps, anyone?

How Tower Enables Reliable Execution of Python Apps on Iceberg Tables

Jan 16, 2025

A growing number of data teams are bringing (post-)modern Python frameworks such as Polars, DataFusion, and dltHub to production. They are drawn to these frameworks because they don't want to learn decades-old Big Data technologies. They don't want to use complex distributed programming APIs and are looking for simple, single-node, in-memory libraries to develop their apps. Libraries they can use to build and debug apps locally on their development boxes, where they can use modern editors like VSCode or Cursor and access local Apple Silicon GPUs.

Other data teams are looking to standardize their data storage on open technologies and use open table formats like Apache Iceberg and Delta Lake to build their data lakes and warehouses. They hope to escape vendor lock-in, so common in existing Big Data platforms, and create architectures where multiple engines can operate on the same set of tables, requiring no ETL. The data community even coined a term for this - Open Lakehouses.

The first group of data teams does not always care about open storage, but amazing things happen when they do. In our previous post, we showed how to set up an environment where Python data apps can read and write to open lakehouse tables stored in the Iceberg format. With just some S3 storage and a little Python code, you get a powerful way to run data processing in production. And these data apps can work alongside Big Data engines on the same storage too!

When we developed and tested our Python data apps, we did this locally on our dev machines, and it worked great. But how do we transition the apps from our development laptops to a managed service and make them production jobs? In other words, we want to turn our data apps into production pipelines (apps that read and transform inputs and write them into a persistent store) or queries (similar to pipelines, but without output persistence).

Tower was developed for precisely this case. You can give Tower your Python code, any Python code, and run it there. Tower will turn your queries, pipelines, and everything in between into reliable data apps.

pip install tower-cli

Tower provides examples for running data apps on Iceberg, so let's get them.

We will switch to a temp directory and git-clone the repo with the Tower examples:

cd /tmp 
git clone https://github.com/tower/tower-examples

Cloning into 'tower-examples'...
remote: Enumerating objects: 114, done.
remote: Counting objects: 100% (114/114), done.
remote: Compressing objects: 100% (77/77), done.
remote: Total 114 (delta 35), reused 102 (delta 23), pack-reused 0 (from 0)
Receiving objects: 100% (114/114), 129.45 KiB | 523.00 KiB/s, done.
Resolving deltas: 100% (35/35), done

If you get permissions errors with Git, you need to set up Git to access GitHub. A guide for setting up Git to access Github is available here.

The app we want to run is called "iceberg-analyze", and it is located in a subfolder of our just-cloned GitHub repo.

cd tower-examples/06-iceberg-analyze

Let's review how Tower apps look by listing the files in the folder.

ls -al

-rw-r--r--   1 datancoffee  wheel  2080 Jan 14 09:45 README.md
-rw-r--r--   1 datancoffee  wheel   780 Jan 14 09:45 Towerfile
-rw-r--r--   1 datancoffee  wheel  1476 Jan 14 09:45 iceberg_analyze_with_polars.py
-rw-r--r--   1 datancoffee  wheel    30 Jan 14 09:45 requirements.txt

Here, you will see the "requirements.txt" file typical for Python, the script "iceberg_analyze_with_polars.py" that contains our app, and something you haven't seen before—a Towerfile.

Let's go through them one by one:

1. requirements.txt

This is a regular Python requirements file that contains dependencies. As per requirements.txt, our app depends on 4 things:

cat requirements.txt

pyiceberg
polars
numpy
pyarrow

We use Pyiceberg because Iceberg requires a catalog to coordinate changes to table metadata (see our previous blog), and Pyiceberg is the interface to the catalog we are using.

We use Polars because it's a great Dataframe-compatible framework that allows us to easily write expressions that transform data in Iceberg tables.

Pyarrow is a library that works with in-memory data structures, and Numpy is frequently used to manipulate numeric arrays.

2. Python script

We recommend reviewing the 06-iceberg-analyze/iceberg_analyze_with_polars.py script (github). We won't go line by line here, as we already did that in our previous post.

Usually, to run a Python script, we would type:

python iceberg_analyze_with_polars.py

How do we run this script in Tower, though? That's what the Towerfile is for.

3. Towerfile

The Towerfile tells Tower what the app consists of. Here, you can give your app a name, tell Tower what Python script is the main one, and specify all other files that are part of the app.

[app]
name = "iceberg-analyze"
script = "./iceberg_analyze_with_polars.py"
source = [
   "./*.py",
   "./requirements.txt"
]


[[parameters]]
name = "iceberg_table"
description = "The name of the iceberg table"
default = "invalid_table"


[[parameters]]
name = "AWS_REGION"
description = "The region of S3 endpoint"
default = "us-east-1"


[[parameters]]
name = "PYICEBERG_CATALOG__DEFAULT__SCOPE"
description = "The principal role you configured"
default = "PRINCIPAL_ROLE:<principal_role_you_configured>"


[[parameters]]
name = "PYICEBERG_CATALOG__DEFAULT__URI"
description = "The URI of the REST catalog"
default = "<https://... for REST catalogs>"


[[parameters]]
name = "PYICEBERG_CATALOG__DEFAULT__WAREHOUSE"
description = "The ID of the catalog"
default = "<catalog_identifier>"

Since most apps will have parameters, you can also list parameters here. Schedules too.

For example, our app needs to know the table name on which we should run our Polars dataframe expression. That's what the iceberg_table parameter is for.

We will also pass the AWS_REGION because the AWS S3 library our app is using needs to communicate to the right S3 endpoint.

We also have 3 parameters (URI, Scope, Warehouse) that define how to talk to an Iceberg catalog, in our case, the Snowflake Open Catalog. All catalog parameters start with PYICEBERG_CATALOG in their names.

Now that we have our Towerfile, let's run our app in Tower.

Tower app lifecycle

In Tower, all apps go through 3 stages:

1. Creation

2. Deployment

3. Execution

After you create an app, you deploy it when you change its code or configs. Then, you run it ad hoc, on a scheduled basis, or when triggered by orchestrators.

Let's do this with our own app and use the Tower CLI to do the above.

To run apps on Tower you need to tell Tower who you are and log in.

Note: The next set of steps requires you to have an account with Tower. You don't need a Tower account to install the Tower CLI or to look at examples, but you need an account to run commands in the Tower cloud.
Tower is currently in private beta. >> Contact Tower founders << at tower.dev to get an invite code to the private beta.
Once you have the invite code, follow the invitation email instructions and set up your account.

After you set up your Tower account, return to the Tower CLI

You can now log in to Tower.

tower login

      _____ _____      _____ ___  
     |_   _/ _ \ \    / / __| _ \ 
       | || (_) \ \/\/ /| _||   / 
       |_| \___/ \_/\_/ |___|_|_\ 
                                  
✔ Waiting for login... Done!
Success! Hello, builder

The shell will open a browser login page where you can use your social logins, such as Google or Github, or email/password.

Creating and Deploying an App

Once you log in, create a new app and name it "iceberg-analyze".

tower apps create --name=iceberg-analyze

✔ Creating app Done!
Success! App "iceberg-analyze" was created

You can now deploy the source code and other files as a package to Tower.

tower deploy

✔ Building package... Done!
  Deploying to Tower... [00:00:00] [████████████████████████████████████████] 6.00 KiB/6.00 KiB (0s)
Success! Version `v1` of your code has been deployed to Tower

Running an App

Now comes the most exciting part - running the app.

You can do it by typing "tower run" inside the folder with a Towerfile, and Tower will run the app mentioned in the Towerfile. Or, you can explicitly run an app by name, e.g., "tower run <app-name>".

You might ask, though, after reviewing the "iceberg_analyze_with_polars.py" script, how this code knows how to connect to the Iceberg lakehouse, and aren't we supposed to pass the name of the Iceberg table so that the following two lines of code execute correctly?

iceberg_table_name= os.getenv("iceberg_table")
icetable = catalog.load_table(iceberg_table_name)

Let's briefly discuss how Tower passes application parameters and secrets to your app code.

Both secrets and parameters will be injected into your application as environment variables. Secrets are variables that will be passed to your application implicitly without you specifying them on the command line, while parameters are variables you must specify on the Tower command line when running the app.

When deciding which variables to pass as secrets or parameters, consider the following:

Best to pass as secrets:

Variables that your teammates should not be able to explicitly read (e.g. credentials to your databases and cloud access keys)
Variables that your app needs across many of its runs and not just for one run
Variables that all of your apps need to have access to
Variables that you want to differentiate by the environment, e.g., have different values for "production" and "development"

Best to pass as parameters:

Variables whose values change from app run to app run

Here is what our app needs to know:

To access S3, it needs to know the AWS credentials:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

To access the Iceberg catalog, the app needs another credential:

PYICEBERG_CATALOG__DEFAULT__CREDENTIAL

Let's create secrets for these 3 credentials:

tower secrets create --name=AWS_ACCESS_KEY_ID --value='AK...'
tower secrets create --name=AWS_SECRET_ACCESS_KEY --value='ABC...'
tower secrets create --name=PYICEBERG_CATALOG__DEFAULT__CREDENTIAL --value='o...=:L...='

When the app is run, these three secrets will now be passed implicitly to it.

If you remember from reviewing the Towerfile, we also need to pass 5 extra parameters:

three additional settings to access the Iceberg catalog (URI, Scope, Warehouse),
the name of the AWS Region, and
the name of the iceberg table.

And with this, our run command looks like this:

tower run \
 --parameter=PYICEBERG_CATALOG__DEFAULT__SCOPE=PRINCIPAL_ROLE:python_app_role \ 
 --parameter=PYICEBERG_CATALOG__DEFAULT__URI='https://<id>.<region>.snowflakecomputing.com/polaris/api/catalog' \
 --parameter=PYICEBERG_CATALOG__DEFAULT__WAREHOUSE='peach_lake_open_catalog' \
 --parameter=iceberg_table='default.japan_trade_stats_2017_2020' \
 --parameter=AWS_REGION='eu-central-1'

✔ Scheduling run... Done!
Success! Run #1 for app `iceberg-analyze` has been scheduled

When you run this command, Tower will execute the data app in Tower’s cloud environment. You can review its status and logs via the CLI or the Tower web UI.

Next Steps

In this post, we showed you how to transform a Python script into a data app you can reliably run in production.

We did not show some of Tower's more advanced features, like local development, production environments, scheduling, or observability, but we will address this in the following posts.

Tower is currently in private beta and Tower founders would love to hear from you and discuss your use case. Contact us to get an invite code to the beta and let us help you build your future data platform.

‹ AI auto-coders will replace data engineers. Or will they?

Ducks vs Pythons: How to write Iceberg tables using PyArrow and analyze them using Polars ›

Subscribe to linkedin newsletter

Subscribe to Substack newsletter

Python Platform for Data Scientists & Engineers.

Reach us at hello@tower.dev

WEBSITE

Company