Solutions

Ducks vs Pythons: How to write Iceberg tables using PyArrow and analyze them using Polars

Dec 19, 2024

This is the final post of our series "Building an Open, Multi-Engine Data Lakehouse with Python and S3" (see this post for motivation). In it, we will finally dive into using Python and Python only to read and write Iceberg tables stored in an S3 lakehouse. During the production of this movie, no ducklings, sparks, or snowflakes will be harmed or involved in any way.

All six posts of this 6-part series are now out:
Part 1: Building Open, Multi-Engine Lakehouse with S3 and Python
Part 2: How Iceberg stores table data and metadata
Part 3: Picking Snowflake Open Catalog as a managed Iceberg catalog
Part 4: How to create S3 buckets to store your lakehouse
Part 5: How to set up a managed Iceberg catalog using Snowflake Open Catalog
Part 6: How to write Iceberg tables using PyArrow and analyze them using Polars (you are here)

Let us clarify something very important.

We really, really like DuckDB. No, really.

Here are our co-founders, Serhii and Brad, taking a selfie, while Hannes Mühleisen, the creator of DuckDB, is listening to something undoubtedly important at a recent event.

At that same event, in a casual conversation about application platforms and code isolation, Hannes said something that resonated with us: "Processes are the best containers. But someone needs to solve loading libraries."

Of course, Hannes was thinking of DuckDB's own approach of embedding itself into the operating system process of the application that was using it.

But what he said also gave us some ideas for what we wanted to do with Python-based data applications.

Here you are, we said it. Not only did Hannes Mühleisen create DuckDB, but he also influenced the direction of the Tower founding team.

Now that's behind us, let's all agree that it would be silly to attempt to write another post about using DuckDB on top of Iceberg data. There are many, perhaps too many, quality posts about it already.

The situation with the Python ecosystem is not as simple, however, primarily because of the still-developing PyIceberg library that is supposed to provide the interface to Iceberg catalogs.

Therefore, we have focused on developing a set of recommendations for using PyIceberg with Snowflake's Open Catalog. This post provides some background on why we chose Open Catalog as our managed Iceberg catalog service.

Writing an Iceberg table by using PyArrow only

The minimal Python setup to write an Iceberg table is PyIceberg together with PyArrow. Let's set it up.

pip install "pyiceberg[pyarrow,s3fs]"

PyIceberg will be our interface to the catalog, and PyArrow will help us materialize the tables to S3 storage.

PyIceberg needs a bit of a setup to know which Iceberg catalog server or service to talk to. Review this Tabular guide on how to pass catalog configuration parameters. We have 3 options:

Use a local .pyiceberg.yaml file
Set 4 environment variables: URI, Credential, Scope, and Warehouse
Specify the configuration in code

We chose to pass the parameters via environment variables:

export PYICEBERG_CATALOG__DEFAULT__URI=https://<account-id>.<region>.snowflakecomputing.com/polaris/api/catalog;
export PYICEBERG_CATALOG__DEFAULT__CREDENTIAL=<client_id>:<client_secret>;
export PYICEBERG_CATALOG__DEFAULT__SCOPE=PRINCIPAL_ROLE:python_app_role;
export PYICEBERG_CATALOG__DEFAULT__WAREHOUSE

The catalog name you use in your code (e.g., "default") must also be in the environment variable names. That's why the variables in the above example are named PYICEBERG_CATALOG__DEFAULT__….

Alternatively, we could have also embedded the parameters in code, like here:

catalog = load_catalog(
    "default",
    **{
        "uri": "https://<account-id>.<region>.snowflakecomputing.com/polaris/api/catalog",
        "credential": “<client_id>:<client_secret>",
        "scope": "PRINCIPAL_ROLE:python_app_role",
        "warehouse": "peach_lake_open_catalog",
    }
)

We can now write a basic data app (get it from Github!) that first accesses our catalog by name (we can do this because we set environment variables and added "default" to their names).

catalog = load_catalog("default")

The app then downloads our Japan Trade Stats CSV file from S3 (see this post for details on our test dataset) and caches it locally using the fsspec library so that we don't have to redownload this 1GB file every time we debug the app.

The app then uses PyArrow to read the CSV file into an Arrow memory table, creates the table "japan_trade_stats_2017_2020" in the Iceberg lakehouse under the "default" namespace, and then uses the to_arrow() function of the Iceberg table to materialize it to our S3 bucket.

Finally, we print the number of rows in our trade stats table (~15 million!).

from pyiceberg.catalog import load_catalog
from pyarrow import csv
import fsspec
import os

# The catalog name "default" must be in the config file or in env variable names
catalog = load_catalog("default")

# get source and destination paths
csv_file_path = os.getenv("csv_file_path")
iceberg_table_name= os.getenv("iceberg_table")

# Download from s3 a large CSV file with Japan Trade Statistics 
# (import and export of goods by month)
of = fsspec.open("filecache://{csv_file_path}".format(csv_file_path=csv_file_path), mode='rb',
                 cache_storage='/tmp/cache1',
                 target_protocol='s3', target_options={'anon': True})

# Read the csv file into an PyArrow table so that we can take its schema 
# and pass it to iceberg
with of as f:
    artable = csv.read_csv(f)

# create the "default" namespace in our open lakehouse
catalog.create_namespace_if_not_exists("default")

# create or retrieve the table to hold our Japan trade stats
icetable = catalog.create_table_if_not_exists(iceberg_table_name, schema=artable.schema)

# add the contents of the csv file to our iceberg table
icetable.append(artable)

# The scan will materialize the table on S3
l = len(icetable.scan().to_arrow())

# print the number of records in the table
print(l)

Before running the app, let's set some environment variables to pass parameters to our app.

export csv_file_path='mango-public-data/japan-trade-stats/custom_2020.csv'
export iceberg_table='default.japan_trade_stats_2017_2020'

Let's also set the AWS Region setting on our development machine so that we can communicate with the right S3 endpoint (set the region where your lakehouse bucket is located).

export AWS_REGION='eu-central-1'

To run the data app (source available on Github), just type:

This app takes about 5-10 minutes to run, depending on how fast your connection is from your development machine to the AWS region of your lakehouse. Its output is the number of rows, 14,897,200, to be precise.

If you want to see the contents of the Iceberg table that was written, type:

aws s3 ls s3://<your-lakehouse-bucket> --recursive

If you read our post reviewing how Iceberg stores table data and metadata and downloaded a snapshot of an Iceberg lakehouse to your local drive, the downloaded snapshot contains the same number of files and folders as your data application produced—just the file names are different.

The lakehouse files will look like this:

# revisit the files you downloaded previously
ls -R s0

default

s0/default:
japan_trade_stats_2017_2020

s0/default/japan_trade_stats_2017_2020:
data		metadata

s0/default/japan_trade_stats_2017_2020/data:
00000-0-b0823e8d-5894-4b67-ae80-0ff4a480e1c0.parquet	00000-2-b0823e8d-5894-4b67-ae80-0ff4a480e1c0.parquet
00000-1-b0823e8d-5894-4b67-ae80-0ff4a480e1c0.parquet

s0/default/japan_trade_stats_2017_2020/metadata:
00000-6ae8bff7-eb6b-4c09-92ad-9ff0a7e93062.metadata.json
00001-5d511f22-9819-47fd-a010-97d75a964dfd

You should reread our previous post again if you want to understand what is inside those files.

Reading an Iceberg table and running an aggregation report by using Polars only

We are not done having fun with Python. Let's have more fun!

We have 4 years of Japanese Trade Stats in our lakehouse that we just added to it. Why don't we run some queries on that data.

Here is what the data looks like in the 'default.japan_trade_stats_2017_2020' table.

exp_imp,Year,month,ym,Country,Custom,hs2,hs4,hs6,hs9,Q1,Q2,Value
1,2017,01,201701,103,100,00,0000,000000,000000190,0,431793,1511679
1,2017,01,201701,105,100,00,0000,000000,000000190,0,1034578,3050141
1,2017,01,201701,106,100,00,0000,000000,000000190,0,326216,1336245
1,2017,01,201701,107,100,00,0000,000000,000000190,0,19959,5460
1,2017,01,201701,108,100,00,0000,000000,000000190,0,340805,2745685
1,2017,01,201701,110,100,00,0000,000000,000000190,0,221405,357296
…
2,2018,07,201807,118,200,61,6110,611030,611030030,203,111,337
2,2018,07,201807,122,200,61,6110,611030,611030030,10371,1114,3423
2,2018,07,201807,125,200,61,6110,611030,611030030,2901,1621,3175
2,2018,07,201807,105,200,61,6110,611030,611030099,624240,186121,454246
2,2018,07,201807,106,200,61,6110,611030,611030099,835,76,581
2,2018,07,201807,110,200,61,6110,611030,611030099,187594,50712,118238
2,2018,07,201807,111,200,61,6110,611030,611030099,16208,7334,14731
…

A deeper review of this dataset and its code values can be found here.

Summarized:

Column exp_imp indicates if it's export or import: 1 is export, 2 is import
Columns hs2 to hs9 are goods classification codes
Columns Q1 and Q2 measure quantity (see here for details)
Column Value is measured in 1000s of Japanese Yen

How about calculating the value of exports by month? Without using SQL or DuckDB, of course!

Polars to the rescue!

Let's first pip install Polars.

Here is a data app (get it from Github) that will acquire a reference to our Iceberg table, create a Polars dataframe, populate it with the Iceberg table, and then run an aggregation query using Polars.

from pyiceberg.catalog import load_catalog
import polars as pl
import pyarrow.compute as pc

catalog = load_catalog("default")

iceberg_table_name= os.getenv("iceberg_table")

icetable = catalog.load_table(iceberg_table_name)

# create a Polars dataframe
df = pl.scan_iceberg(icetable)

df2 = (
   df.filter(
       (pl.col("exp_imp") == '1')
   ).collect().select([
       pl.col("ym"),
       pl.col("Value")
   ]).group_by(['ym']).sum().sort("ym")
)

print(df2)

Before running the app, we must set the environment variable iceberg_table to pass the table's name to the app.

export iceberg_table='default.japan_trade_stats_2017_2020'
export AWS_REGION='eu-central-1'

To run the app, type

The whole app, including data download from the lakehouse to our dev box, took about 20 seconds. It prints a time series of Japanese imports by month over four years.

python iceberg_analyze_with_polars.py
shape: (46, 2)
┌────────┬────────────┐
│ ym     ┆ Value      │
│ ---    ┆ ---        │
│ i64    ┆ i64        │
╞════════╪════════════╡
│ 201701 ┆ 5420942231 │
│ 201702 ┆ 6347122559 │
│ 201703 ┆ 7227998277 │
│ 201704 ┆ 6330189902 │
│ 201705 ┆ 5851616236 │
│ …      ┆ …          │
│ 202006 ┆ 4862354455 │
│ 202007 ┆ 5369179220 │
│ 202008 ┆ 5233104509 │
│ 202009 ┆ 6054141065 │
│ 202010 ┆ 6565808457 │
└────────┴────────────┘

Bringing it all together

Congrats on finishing our series "Building an Open, Multi-Engine Data Lakehouse with Python and S3"!

Setting everything up took several tools and six posts, but we were able to build a realistic multi-engine environment for a lakehouse. Data is stored in S3, an Iceberg catalog is managed by a reputable party, and Spark and DuckDB are not used at all. We used Python and Python only, with PyArrow for table creation and Polars for data analysis.

There are many examples that show how to create local Iceberg catalogs (running on your laptop) or how to use Spark or DuckDB to work with lakehouses, but we managed to build our lakehouse without any of that. If you are a Pythonista who wants to run a set of data apps on top of an S3 lakehouse, you should know how to do so now.

The sample code used for this post is available on GitHub here and here. Please star it if you like what you read.

What did we think of Snowflake’s Open Catalog?

We wrote a post about it!

What are our future areas of investigation?

Until now, the discussion of Iceberg's performance has centered on its ability to run big data loads. Iceberg can store many petabytes of data, and engines running on top of Iceberg support queries that can scale into these sizes. All of that is great.

However, there haven't been many evaluations around lakehouses' ability to run High Concurrency and Low Latency workloads, e.g., a BI workload that serves many users and delivers data to dashboards, PowerBI, and Tableau client applications.

Further, what if you have multiple engines (e.g., Spark, Snowflake, and plain old Python) running thousands of queries (and mixing them with DML operations) on your lakehouse per second? Will this architecture and the catalogs that underpin it be able to handle the very high rate of operations by potentially non-homogenous engines?

When we worked on improving Snowflake's ability to run HCLL workloads, the most significant gains we got were from improving the handling of table metadata, eliminating access bottlenecks, and improving resource scheduling. What kind of improvements will be required in the Iceberg standard or catalog implementations to support HCLL and Multi-Engine is yet to be seen.

Another concern we had was around the programming model currently imposed on Python developers wanting to interact with an Iceberg lakehouse. Imagine if a SQL developer, to write a SELECT query that joins two tables, had to first load the database catalog, then load both tables individually, and then do the joining manually. More must be done to make Iceberg from Python data apps as simple as writing a SQL query. That's why there are so many DuckDB examples out there. The developer experience is so much better there.

We hope you found this post and the whole series useful. If you want to talk to a Tower founder about reliably and securely running Python data apps in production, please get in touch with us.

We are working on a new set of posts on simplifying Python data app development on Iceberg data. Follow us on LinkedIn or sign up for our Substack newsletter to get notified when we publish new research.

‹ How Tower Enables Reliable Execution of Python Apps on Iceberg Tables

How to setup a managed Iceberg catalog using Snowflake Open Catalog ›

Subscribe to linkedin newsletter

Subscribe to Substack newsletter

Hassle-free Platform for Data Scientists & Engineers.

Reach us at hello@tower.dev

WEBSITE

Company