Check out our new case study with Taktile and dltHub
Check out our new case study with Taktile and dltHub
Check out our new case study with Taktile and dltHub
Planting lakehouse tables into object storage
How to create S3 buckets to store your lakehouse data and metadata
Dec 17, 2024
In this post, we will do something critical for building a lakehouse. We will create storage buckets in S3!
All six posts of this 6-part series are now available:
Part 1: Building Open, Multi-Engine Lakehouse with S3 and Python
Part 2: How Iceberg stores table data and metadata
Part 3: Picking Snowflake Open Catalog as a managed Iceberg catalog
Part 4: How to create S3 buckets to store your lakehouse
Part 5: How to set up a managed Iceberg catalog using Snowflake Open Catalog
Part 6: How to write Iceberg tables using PyArrow and analyze them using Polars
In one of our previous posts, we shared instructions for setting up a local development environment for exploring and manipulating Iceberg storage. We also reviewed an Iceberg lakehouse's cactus-like (or tree-like, if you prefer) folder structure.
Our expectation from here on is that you will be able to run the "aws" CLI tool and interact with S3 storage to copy files and sync folders.
Try running the following command to see what data you have access to:
aws s3 ls
If you created a new user per our previous instructions, you should be able to run the command without errors but have access to no data. We will fix that in the next step.
Next, we suggest that you do the following two things:
Create a cloud bucket to store the data that you will use to populate your actual lakehouse S3 bucket.
Create a cloud bucket to store the lakehouse data and metadata.
Creating an S3 bucket to store input data
Creating an S3 bucket to store input data is just for your convenience. You don't have to make it, and you could use your development machine for storing data that will end up in your lakehouse or use other people's cloud resources to download data from (for example, you could use our public read-only data bucket s3://mango-public-data).
However, having an extra S3 bucket to store your input data could make data access consistent during development (you can place input data in the same cloud region as the lakehouse storage, making loads faster, and you won't be surprised when someone shuts down access to their resources).
In the AWS Management Console, you can log in with your newly created Data Engineer user.
Select S3.
Click on Create bucket.
Give the bucket a name (we will refer to your bucket as <your-input-data-bucket>).
Under Object Ownership, choose the recommended "ACLs disabled"
Disable "Block All Public Access" if the data you will store in this bucket and use for populating the lakehouse is not confidential. For example, you might have some datasets you downloaded from Kaggle or elsewhere that you want to park in this bucket. Review the AWS commentary on opening up your bucket for public access. In a later step, you will use a policy to set access to "Read-Only."
Select Create bucket.
To enable read access to this bucket by users outside of your AWS account, let's set a policy that allows that (this is also discussed here):
In the S3 console, find our bucket and click on it
Go to the Permissions tab and find the "Bucket policy"
Click Edit and enter the following policy
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::<your-input-data-bucket>/*" } ] }
If you need some seed data to populate your lakehouse, we recommend the Japan Trade Data dataset on Kaggle. This dataset has various sizes, from a single year of trade to 32 years of trade stats. We recommend downloading the 4-year dataset (Years 2017-2020) (1GB uncompressed CSV and 20 million rows) and perhaps the 32-year dataset (Years 1988-2020) (4.5GB uncompressed CSV and 100 million rows).
Once you have custom_2017_2020.csv and custom_1988_2020.csv files on your development machine, you can use the following commands to upload the files to your S3 bucket. We recommend using a separate folder in the S3 bucket for your Japan Trade Dataset. The AWS CLI will automatically create this folder in the below command if it does not exist yet.
aws s3 cp custom_2017_2020.csv s3://<your-input-data-bucket>/japan-trade-stats/ aws s3 cp custom_1988_2020.csv s3://<your-input-data-bucket>/japan-trade-stats/
You can copy (write) to the S3 bucket from your development machine because your user (that you created in this step) is part of the DataEngineersGroup, which has the PowerUserAccess policy attached to it. This policy is pretty powerful and allows creating, deleting, and using various AWS services.
To list what you have in the bucket, run the s3 ls command:
aws s3 ls s3://<your-input-data-bucket>/japan-trade-stats/ --recursive
You should see something like this:
2024-12-15 18:15:17 4544707885 japan-trade-stats/custom_1988_2020.csv 2024-12-15 18:15:17 938925623 japan-trade-stats/custom_2017_2020.csv
You now are ready to create your Iceberg lakehouse!
Creating an S3 bucket to store Iceberg lakehouse data and metadata
In this step, you will create the S3 bucket to store your future Iceberg lakehouse! It will contain the lakehouse's data and metadata. While the process will look similar to the previous S3 bucket creation step, you will protect this bucket better than the public bucket you created before.
We suggest using the Snowflake guide to create the S3 bucket for your lakehouse.
By the end of it, you will have a bucket, possibly a folder inside this bucket (in case you want to have multiple lakehouses in the same bucket), and you will write down the S3 URI (we will refer to it as <your-lakehouse-bucket>) that you need to copy into the Snowflake configuration later.
Next Steps
The next step is to set up a managed Iceberg catalog using Snowflake Open Catalog (step-by-step instructions are available now).
In a few days, we will also publish a post that ties everything together and shows how to use common Python tools (PyArrow and Polars) to run data applications on top of an S3-based open lakehouse.
Follow us on LinkedIn or sign up for our Substack newsletter to get notified when we publish the next post in our series.
In this post, we will do something critical for building a lakehouse. We will create storage buckets in S3!
All six posts of this 6-part series are now available:
Part 1: Building Open, Multi-Engine Lakehouse with S3 and Python
Part 2: How Iceberg stores table data and metadata
Part 3: Picking Snowflake Open Catalog as a managed Iceberg catalog
Part 4: How to create S3 buckets to store your lakehouse
Part 5: How to set up a managed Iceberg catalog using Snowflake Open Catalog
Part 6: How to write Iceberg tables using PyArrow and analyze them using Polars
In one of our previous posts, we shared instructions for setting up a local development environment for exploring and manipulating Iceberg storage. We also reviewed an Iceberg lakehouse's cactus-like (or tree-like, if you prefer) folder structure.
Our expectation from here on is that you will be able to run the "aws" CLI tool and interact with S3 storage to copy files and sync folders.
Try running the following command to see what data you have access to:
aws s3 ls
If you created a new user per our previous instructions, you should be able to run the command without errors but have access to no data. We will fix that in the next step.
Next, we suggest that you do the following two things:
Create a cloud bucket to store the data that you will use to populate your actual lakehouse S3 bucket.
Create a cloud bucket to store the lakehouse data and metadata.
Creating an S3 bucket to store input data
Creating an S3 bucket to store input data is just for your convenience. You don't have to make it, and you could use your development machine for storing data that will end up in your lakehouse or use other people's cloud resources to download data from (for example, you could use our public read-only data bucket s3://mango-public-data).
However, having an extra S3 bucket to store your input data could make data access consistent during development (you can place input data in the same cloud region as the lakehouse storage, making loads faster, and you won't be surprised when someone shuts down access to their resources).
In the AWS Management Console, you can log in with your newly created Data Engineer user.
Select S3.
Click on Create bucket.
Give the bucket a name (we will refer to your bucket as <your-input-data-bucket>).
Under Object Ownership, choose the recommended "ACLs disabled"
Disable "Block All Public Access" if the data you will store in this bucket and use for populating the lakehouse is not confidential. For example, you might have some datasets you downloaded from Kaggle or elsewhere that you want to park in this bucket. Review the AWS commentary on opening up your bucket for public access. In a later step, you will use a policy to set access to "Read-Only."
Select Create bucket.
To enable read access to this bucket by users outside of your AWS account, let's set a policy that allows that (this is also discussed here):
In the S3 console, find our bucket and click on it
Go to the Permissions tab and find the "Bucket policy"
Click Edit and enter the following policy
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::<your-input-data-bucket>/*" } ] }
If you need some seed data to populate your lakehouse, we recommend the Japan Trade Data dataset on Kaggle. This dataset has various sizes, from a single year of trade to 32 years of trade stats. We recommend downloading the 4-year dataset (Years 2017-2020) (1GB uncompressed CSV and 20 million rows) and perhaps the 32-year dataset (Years 1988-2020) (4.5GB uncompressed CSV and 100 million rows).
Once you have custom_2017_2020.csv and custom_1988_2020.csv files on your development machine, you can use the following commands to upload the files to your S3 bucket. We recommend using a separate folder in the S3 bucket for your Japan Trade Dataset. The AWS CLI will automatically create this folder in the below command if it does not exist yet.
aws s3 cp custom_2017_2020.csv s3://<your-input-data-bucket>/japan-trade-stats/ aws s3 cp custom_1988_2020.csv s3://<your-input-data-bucket>/japan-trade-stats/
You can copy (write) to the S3 bucket from your development machine because your user (that you created in this step) is part of the DataEngineersGroup, which has the PowerUserAccess policy attached to it. This policy is pretty powerful and allows creating, deleting, and using various AWS services.
To list what you have in the bucket, run the s3 ls command:
aws s3 ls s3://<your-input-data-bucket>/japan-trade-stats/ --recursive
You should see something like this:
2024-12-15 18:15:17 4544707885 japan-trade-stats/custom_1988_2020.csv 2024-12-15 18:15:17 938925623 japan-trade-stats/custom_2017_2020.csv
You now are ready to create your Iceberg lakehouse!
Creating an S3 bucket to store Iceberg lakehouse data and metadata
In this step, you will create the S3 bucket to store your future Iceberg lakehouse! It will contain the lakehouse's data and metadata. While the process will look similar to the previous S3 bucket creation step, you will protect this bucket better than the public bucket you created before.
We suggest using the Snowflake guide to create the S3 bucket for your lakehouse.
By the end of it, you will have a bucket, possibly a folder inside this bucket (in case you want to have multiple lakehouses in the same bucket), and you will write down the S3 URI (we will refer to it as <your-lakehouse-bucket>) that you need to copy into the Snowflake configuration later.
Next Steps
The next step is to set up a managed Iceberg catalog using Snowflake Open Catalog (step-by-step instructions are available now).
In a few days, we will also publish a post that ties everything together and shows how to use common Python tools (PyArrow and Polars) to run data applications on top of an S3-based open lakehouse.
Follow us on LinkedIn or sign up for our Substack newsletter to get notified when we publish the next post in our series.