Check out our new case study with Taktile and dltHub
Check out our new case study with Taktile and dltHub
Check out our new case study with Taktile and dltHub
To use Snowflake as an Iceberg catalog or Not to use: that is the question.
Picking Snowflake Open Catalog as a managed Iceberg catalog for Open Lakehouse
Dec 17, 2024
If you came here after reading the first post of our series "Building an Open, Multi-Engine Data Lakehouse with Python and S3", you already know of our shocking (SHOCKING!) decision to declare Snowflake's Open Catalog the winner in our evaluation of managed Iceberg catalog providers for our open lakehouse (if you need a refresher what open lakehouses are, our first post is a great place to start).
All six posts of this 6-part series are now available:
Part 1: Building Open, Multi-Engine Lakehouse with S3 and Python
Part 2: How Iceberg stores table data and metadata
Part 3: Picking Snowflake Open Catalog as a managed Iceberg catalog
Part 4: How to create S3 buckets to store your lakehouse
Part 5: How to set up a managed Iceberg catalog using Snowflake Open Catalog
Part 6: How to write Iceberg tables using PyArrow and analyze them using Polars
To be clear - we are not discussing Snowflake's data warehousing product here. We are talking about a new managed Iceberg catalog service that Snowflake launched in GA in October 2024.
Why would we pick a product from a vendor potentially challenged by the emergence of open lakehouses? Would this vendor be serious about continuing to invest in it? Is it another attempt to lock one into a closed data ecosystem? These are all good questions, and we first had to ponder our decision, but in the end, we decided the benefits outweighed the risks.
Iceberg Catalog options
There is a large variety of Iceberg catalogs out there, and we would waste valuable pixel space by rehashing their comparisons. If you want to read up on some comparisons, look at this deep dive by Alex Merced or review the PyIceberg catalog docs.
Here is a list of options that you have:
Vendor- or Cloud-specific catalogs
Amazon DynamoDB
Snowflake (built-in version, not the Polaris one)
Unity Catalog (built-in version, not the OSS one)
AWS Glue
Hive Metastore
Family of JDBC-based Relational DB Engines (like Postgres)
SQLite
Nessie
Family of REST interface catalogs
Vendor and Cloud-specific Catalogs
If you are trying to build a truly open, multi-engine lakehouse (see the first post of the series for the motivation for this), then using data platform- or cloud-specific catalogs is not a good idea. Catalogs like Amazon DynamoDB, the built-in Snowflake catalog, and the built-in Unity Catalog are, therefore, less appealing. We say this with full respect to the AWS, Snowflake, and Databricks teams that built them, especially because we have personally worked with many of them.
Note: AWS changed its Glue catalog a few days ago to support the REST interface. Glue is now much more interesting from an interoperability point of view.
Hive, JDBC, and SQLite Catalogs
You probably also don't want to use the Hive Metastore catalog because the contributor community momentum is away from this catalog and not to it.
The idea of using relational DB engines via the JDBC catalog type was interesting, especially because some older Stackoverflow posts from Tabular employees recommended Postgres as the backend. However, rumor has it that interest in JDBC catalog backends is declining.
Using SQLite as the catalog access engine is a cute idea, but this would work only in a development setup when the catalog is queried from a dev machine, so our goals of a realistic data storage setup and multi-engine support would be thrown out of the window.
Nessie
Nessie is a cool and modern catalog from Dremio that deserves more research. And it is available as a managed service! However, Dremio announced that they would eventually merge Nessie into Polaris, so we did not put it at the top of our evaluation list.
REST Catalogs
The catalog options we reviewed previously do not allow us to build open, multi-engine lakehouses. This leaves us with only one choice - catalogs that conform to the REST catalog interface. The REST catalog interface is a new-ish thing introduced in the October 2023 0.14 release, and it's being promoted in the Iceberg community as the holy grail of catalog types.
The best argument for choosing a catalog that conforms to the REST spec is that it will be easier for you to exchange your catalog provider down the road, although if history can provide any guidance, many vendors will try to add features to their catalogs that will bind you to their implementations. But this is a battle we won't have to fight for another 2-3 years, so for now, REST catalog it is.
Among the REST Catalogs, the options are grouped into
Self-managed REST Catalogs
Managed REST Catalogs
Tabular (no public sign-up, by invitation only)
Snowflake Open Catalog (based on Polaris) (in GA, how to sign up)
Dremio Enterprise Data Catalog (based on Nessie)
AWS Glue (available since December 2024! See Glue REST docs)
Self-managed REST Catalogs
The self-managed REST catalogs assume that you are willing to run the catalog server somewhere, on your laptop, an EC2 VM, or inside a Kubernetes cluster. They either provide instructions for starting that server process or give you a Docker container to deploy in the cloud.
Unity Catalog OSS
Total Contributors: 77 (top 10 work mostly at Databricks)
Github Stars: 2.3k
Claim to fame:
Supports Hudi and Delta Lake in addition to Iceberg.
Is more than just a technical catalog: it manages AI models and unstructured data, provides governance.
Apache Polaris
Total Contributors: 50 (top 10 work mostly at Snowflake and Dremio)
Github Stars: 1.1k
Claim to fame:
Support by more diverse list of engines, including Flink and Dremio (although Unity is catching up).
Dremio’s Nessie is going to be merged with Polaris, giving it additional unique features.
Apache Gravitino
Total Contributors: 115 (top 10 work mostly at datastrato.ai)
Github Stars: 1k
Claim to fame:
Largest community of contributors of all REST catalogs, concentrated primarily in the APAC region.
Recently introduced RBAC (Role-based Access Control).
Lakekeeper
Total Contributors: 7 (top 10 work at Hansetag)
Github Stars: 0.2k
Claim to fame:
Smaller community of developers, concentrated in Germany.
Planning to add Fine Grained Access Control.
Of the four self-managed catalog options, the first two—Unity OSS and Polaris—provide the best combination of breadth of support and industry momentum. The Gravitino catalog, while having many contributors, seems to have a regional focus, and Lakekeeper is still too early.
If we get more time, we will evaluate the first two catalogs in later posts, but we first decided to see what's available in the managed catalog category. There are better uses of an average user's time than requiring users to operate and scale additional data infrastructure.
Managed REST Catalogs
As a reminder, we have four options for managed REST Catalogs:
Tabular
Snowflake Open Catalog
Dremio Enterprise Data Catalog
AWS Glue (its new REST API)
Here, the choices were clear. The Tabular catalog is now by invitation only after the acquisition by Databricks, and Dremio's Enterprise Data Catalog, based on Nessie, will eventually merge with Apache Polaris. The news that the AWS Glue catalog added REST support is so fresh that we did not have time to evaluate it. It's something we want to do in the future.
This left us with Snowflake's Open Catalog as the first in line for evaluation, which made us a little worried that we would be locking ourselves in something specific to a particular data platform (Snowflake). Still, we decided to try it, seeing that the catalog went into GA more than a month ago, in October 2024, which typically means high production readiness. In our evaluation, we decided to pay close attention to the ease of on- and offboarding from this catalog to reduce the lock-in concern.
Our evaluation of Snowflake’s Open Catalog
Ease of Onboarding
This post walks through setting up and using Snowflake's Open Catalog step-by-step. We found the setup experience mostly intuitive and well-guided, although we were at one point confused by the mixup of the terms "service principals" and "service connection."
Governance
We liked that Open Catalog provides solid RBAC (role-based access control) governance functionality in the catalog, in addition to the minimally necessary REST interface implementation and Iceberg metadata management. It seems that Snowflake copied their RBAC implementation from core Snowflake warehouses to the Open Catalog product. That's great!
Managed Service
We also like that it is fully managed and that we do not have to run a Docker container on a box under our desk. Of course, the managed aspect means that Snowflake will eventually monetize this service. Open Catalog is currently free, but Snowflake will start charging for it in mid-2025. Their pricing model is based on the number of API calls (0.5 credits for 1 million API requests, as per the official Credits Consumption Table).
Ease of Offboarding
Once you start relying on Open Catalog's RBAC functionality, and we predict you will, migrating off it will become harder. RBAC is not yet in the Iceberg spec, and each vendor will implement it differently. As a possible off-ramp path, you might consider migrating to Unity OSS (which is working on adding RBAC) or Gravitino (which already offers RBAC).
What’s missing
What is missing in this offering, and technically, it is not part of the Iceberg spec, is table maintenance (here is what the Iceberg creators have to say about it).
Table maintenance is a big deal for keeping lakehouses performing at their best. Over time, when you delete or add sets of rows in the tables, your Iceberg table metadata (the manifest list and manifest files) will become complex to analyze to determine how to prune datafiles during queries. If you worked with Snowflake, Databricks SQL, or Fabric before, you know that the performance of analytical queries largely depends on the query engine's ability to minimize the number of data files it needs to scan. If the table's metadata does not make it simple to prune (eliminate from scanning) data files, your performance will be in trouble.
This is why data warehouses like Snowflake have background processes that constantly optimize the metadata files to achieve the best pruning effectiveness. This is also why the new S3 Tables offering from AWS claims 3x lower query latency and up to 10x higher QPS rates. But these maintenance operations do not come for free. S3 Tables charges for table compaction, and if you look at the pricing example, the expected maintenance costs are 20% of overall storage and maintenance costs, which is reasonable.
What’s coming
Dremio's recent announcement of plans to merge their catalog, Project Nessie, into Apache Polaris is a big deal and hopefully will address the lack of table maintenance functionality. "We will treat Polaris as our catalog, and we will merge Nessie into Polaris," as per Read Maloney, Dremio's chief marketing officer.
Dremio's commercial managed service based on Nessie offers several advanced features, such as column- and row-level access control (also known as Fine-Grained Access Control) and automated table maintenance. If these two features indeed migrate over to Polaris, and if Snowflake's Open Catalog, based on Polaris, ends up offering them as well, the resulting managed service will have a robust governance and data security solution (RBAC + FGAC) similar to Unity Catalog's and provide a chance of high performance.
Next Steps
If you liked our review of Snowflake's Open Catalog, follow us on LinkedIn or sign up for our Substack newsletter to get notified when we publish the next post in our series. In that post, we will tie everything together and show how one can use common Python tools (PyArrow and Polars) to run data applications on top of an S3-based open lakehouse.
If you can't wait to start building an Iceberg lakehouse, here are two things that you absolutely must do to get going:
Create S3 buckets to store your lakehouse (read our new post now)
Set up a managed Iceberg catalog using Snowflake Open Catalog (step-by-step instructions and witty commentary in this post, available now)
If you came here after reading the first post of our series "Building an Open, Multi-Engine Data Lakehouse with Python and S3", you already know of our shocking (SHOCKING!) decision to declare Snowflake's Open Catalog the winner in our evaluation of managed Iceberg catalog providers for our open lakehouse (if you need a refresher what open lakehouses are, our first post is a great place to start).
All six posts of this 6-part series are now available:
Part 1: Building Open, Multi-Engine Lakehouse with S3 and Python
Part 2: How Iceberg stores table data and metadata
Part 3: Picking Snowflake Open Catalog as a managed Iceberg catalog
Part 4: How to create S3 buckets to store your lakehouse
Part 5: How to set up a managed Iceberg catalog using Snowflake Open Catalog
Part 6: How to write Iceberg tables using PyArrow and analyze them using Polars
To be clear - we are not discussing Snowflake's data warehousing product here. We are talking about a new managed Iceberg catalog service that Snowflake launched in GA in October 2024.
Why would we pick a product from a vendor potentially challenged by the emergence of open lakehouses? Would this vendor be serious about continuing to invest in it? Is it another attempt to lock one into a closed data ecosystem? These are all good questions, and we first had to ponder our decision, but in the end, we decided the benefits outweighed the risks.
Iceberg Catalog options
There is a large variety of Iceberg catalogs out there, and we would waste valuable pixel space by rehashing their comparisons. If you want to read up on some comparisons, look at this deep dive by Alex Merced or review the PyIceberg catalog docs.
Here is a list of options that you have:
Vendor- or Cloud-specific catalogs
Amazon DynamoDB
Snowflake (built-in version, not the Polaris one)
Unity Catalog (built-in version, not the OSS one)
AWS Glue
Hive Metastore
Family of JDBC-based Relational DB Engines (like Postgres)
SQLite
Nessie
Family of REST interface catalogs
Vendor and Cloud-specific Catalogs
If you are trying to build a truly open, multi-engine lakehouse (see the first post of the series for the motivation for this), then using data platform- or cloud-specific catalogs is not a good idea. Catalogs like Amazon DynamoDB, the built-in Snowflake catalog, and the built-in Unity Catalog are, therefore, less appealing. We say this with full respect to the AWS, Snowflake, and Databricks teams that built them, especially because we have personally worked with many of them.
Note: AWS changed its Glue catalog a few days ago to support the REST interface. Glue is now much more interesting from an interoperability point of view.
Hive, JDBC, and SQLite Catalogs
You probably also don't want to use the Hive Metastore catalog because the contributor community momentum is away from this catalog and not to it.
The idea of using relational DB engines via the JDBC catalog type was interesting, especially because some older Stackoverflow posts from Tabular employees recommended Postgres as the backend. However, rumor has it that interest in JDBC catalog backends is declining.
Using SQLite as the catalog access engine is a cute idea, but this would work only in a development setup when the catalog is queried from a dev machine, so our goals of a realistic data storage setup and multi-engine support would be thrown out of the window.
Nessie
Nessie is a cool and modern catalog from Dremio that deserves more research. And it is available as a managed service! However, Dremio announced that they would eventually merge Nessie into Polaris, so we did not put it at the top of our evaluation list.
REST Catalogs
The catalog options we reviewed previously do not allow us to build open, multi-engine lakehouses. This leaves us with only one choice - catalogs that conform to the REST catalog interface. The REST catalog interface is a new-ish thing introduced in the October 2023 0.14 release, and it's being promoted in the Iceberg community as the holy grail of catalog types.
The best argument for choosing a catalog that conforms to the REST spec is that it will be easier for you to exchange your catalog provider down the road, although if history can provide any guidance, many vendors will try to add features to their catalogs that will bind you to their implementations. But this is a battle we won't have to fight for another 2-3 years, so for now, REST catalog it is.
Among the REST Catalogs, the options are grouped into
Self-managed REST Catalogs
Managed REST Catalogs
Tabular (no public sign-up, by invitation only)
Snowflake Open Catalog (based on Polaris) (in GA, how to sign up)
Dremio Enterprise Data Catalog (based on Nessie)
AWS Glue (available since December 2024! See Glue REST docs)
Self-managed REST Catalogs
The self-managed REST catalogs assume that you are willing to run the catalog server somewhere, on your laptop, an EC2 VM, or inside a Kubernetes cluster. They either provide instructions for starting that server process or give you a Docker container to deploy in the cloud.
Unity Catalog OSS
Total Contributors: 77 (top 10 work mostly at Databricks)
Github Stars: 2.3k
Claim to fame:
Supports Hudi and Delta Lake in addition to Iceberg.
Is more than just a technical catalog: it manages AI models and unstructured data, provides governance.
Apache Polaris
Total Contributors: 50 (top 10 work mostly at Snowflake and Dremio)
Github Stars: 1.1k
Claim to fame:
Support by more diverse list of engines, including Flink and Dremio (although Unity is catching up).
Dremio’s Nessie is going to be merged with Polaris, giving it additional unique features.
Apache Gravitino
Total Contributors: 115 (top 10 work mostly at datastrato.ai)
Github Stars: 1k
Claim to fame:
Largest community of contributors of all REST catalogs, concentrated primarily in the APAC region.
Recently introduced RBAC (Role-based Access Control).
Lakekeeper
Total Contributors: 7 (top 10 work at Hansetag)
Github Stars: 0.2k
Claim to fame:
Smaller community of developers, concentrated in Germany.
Planning to add Fine Grained Access Control.
Of the four self-managed catalog options, the first two—Unity OSS and Polaris—provide the best combination of breadth of support and industry momentum. The Gravitino catalog, while having many contributors, seems to have a regional focus, and Lakekeeper is still too early.
If we get more time, we will evaluate the first two catalogs in later posts, but we first decided to see what's available in the managed catalog category. There are better uses of an average user's time than requiring users to operate and scale additional data infrastructure.
Managed REST Catalogs
As a reminder, we have four options for managed REST Catalogs:
Tabular
Snowflake Open Catalog
Dremio Enterprise Data Catalog
AWS Glue (its new REST API)
Here, the choices were clear. The Tabular catalog is now by invitation only after the acquisition by Databricks, and Dremio's Enterprise Data Catalog, based on Nessie, will eventually merge with Apache Polaris. The news that the AWS Glue catalog added REST support is so fresh that we did not have time to evaluate it. It's something we want to do in the future.
This left us with Snowflake's Open Catalog as the first in line for evaluation, which made us a little worried that we would be locking ourselves in something specific to a particular data platform (Snowflake). Still, we decided to try it, seeing that the catalog went into GA more than a month ago, in October 2024, which typically means high production readiness. In our evaluation, we decided to pay close attention to the ease of on- and offboarding from this catalog to reduce the lock-in concern.
Our evaluation of Snowflake’s Open Catalog
Ease of Onboarding
This post walks through setting up and using Snowflake's Open Catalog step-by-step. We found the setup experience mostly intuitive and well-guided, although we were at one point confused by the mixup of the terms "service principals" and "service connection."
Governance
We liked that Open Catalog provides solid RBAC (role-based access control) governance functionality in the catalog, in addition to the minimally necessary REST interface implementation and Iceberg metadata management. It seems that Snowflake copied their RBAC implementation from core Snowflake warehouses to the Open Catalog product. That's great!
Managed Service
We also like that it is fully managed and that we do not have to run a Docker container on a box under our desk. Of course, the managed aspect means that Snowflake will eventually monetize this service. Open Catalog is currently free, but Snowflake will start charging for it in mid-2025. Their pricing model is based on the number of API calls (0.5 credits for 1 million API requests, as per the official Credits Consumption Table).
Ease of Offboarding
Once you start relying on Open Catalog's RBAC functionality, and we predict you will, migrating off it will become harder. RBAC is not yet in the Iceberg spec, and each vendor will implement it differently. As a possible off-ramp path, you might consider migrating to Unity OSS (which is working on adding RBAC) or Gravitino (which already offers RBAC).
What’s missing
What is missing in this offering, and technically, it is not part of the Iceberg spec, is table maintenance (here is what the Iceberg creators have to say about it).
Table maintenance is a big deal for keeping lakehouses performing at their best. Over time, when you delete or add sets of rows in the tables, your Iceberg table metadata (the manifest list and manifest files) will become complex to analyze to determine how to prune datafiles during queries. If you worked with Snowflake, Databricks SQL, or Fabric before, you know that the performance of analytical queries largely depends on the query engine's ability to minimize the number of data files it needs to scan. If the table's metadata does not make it simple to prune (eliminate from scanning) data files, your performance will be in trouble.
This is why data warehouses like Snowflake have background processes that constantly optimize the metadata files to achieve the best pruning effectiveness. This is also why the new S3 Tables offering from AWS claims 3x lower query latency and up to 10x higher QPS rates. But these maintenance operations do not come for free. S3 Tables charges for table compaction, and if you look at the pricing example, the expected maintenance costs are 20% of overall storage and maintenance costs, which is reasonable.
What’s coming
Dremio's recent announcement of plans to merge their catalog, Project Nessie, into Apache Polaris is a big deal and hopefully will address the lack of table maintenance functionality. "We will treat Polaris as our catalog, and we will merge Nessie into Polaris," as per Read Maloney, Dremio's chief marketing officer.
Dremio's commercial managed service based on Nessie offers several advanced features, such as column- and row-level access control (also known as Fine-Grained Access Control) and automated table maintenance. If these two features indeed migrate over to Polaris, and if Snowflake's Open Catalog, based on Polaris, ends up offering them as well, the resulting managed service will have a robust governance and data security solution (RBAC + FGAC) similar to Unity Catalog's and provide a chance of high performance.
Next Steps
If you liked our review of Snowflake's Open Catalog, follow us on LinkedIn or sign up for our Substack newsletter to get notified when we publish the next post in our series. In that post, we will tie everything together and show how one can use common Python tools (PyArrow and Polars) to run data applications on top of an S3-based open lakehouse.
If you can't wait to start building an Iceberg lakehouse, here are two things that you absolutely must do to get going:
Create S3 buckets to store your lakehouse (read our new post now)
Set up a managed Iceberg catalog using Snowflake Open Catalog (step-by-step instructions and witty commentary in this post, available now)