
Hey everyone, I’m Kote. I recently joined Tower, where my role is to understand our customers' needs. Stepping into data engineering means picking up a lot of new terms and ideas. It can feel overwhelming at first, with words like serverless compute, ETL, and data lakehouse popping up everywhere. Honestly, I’d much rather be at a real lake in a house.
Since the best way to learn about something is to try to explain it, I decided to explain, in my own terms, what some of these concepts mean to anyone who might need a bit of clarification.
In this post, I’ll explain what data lakehouses are and how they stack up against data warehouses and data lakes. I’ll do my best to use simple language, break down the differences, and talk about why data lakehouses are getting so much attention lately. If, by the end of this blog, you want to try to build your own data lakehouse, I suggest you try Tower.
What is a Data Warehouse?
Let’s start with what a data warehouse is. Imagine it as a library or storage space for data, much like a well-organized wardrobe. Everything has its place: shelves are like rows, clothes next to each other are columns, and shelf blocks are like pages. The rack is the table, and the floor is the schema. The comparison isn’t perfect, but it helps convey the idea.
In the case of a data warehouse, everything is in tables, and everything in tables is in rows and columns. It is a centralized repository for structured data. Information such as tables of sales figures and customer records is stored this way.
Before data enters a warehouse, it usually gets cleaned up and transformed to fit a predefined schema (a set structure).
Now, we’re about to get a bit heavy. Maybe Google some of these terms, or ask your local AI companion. Characteristics of a data warehouse:
Structured and Relational: Data is stored in relational tables with clear schemas. This makes it easy to run SQL queries and generate reports because everything is in a known format.
Governance and Quality: Because the data is modeled and cleaned before loading, you get high-quality, consistent data. It’s straightforward to manage access, apply security, and ensure everyone is analyzing the same, correct information. Imagine washing your clothes, folding them, and then putting them in the wardrobe. No germs are making it in there, and you know what is what.
Use cases: Data warehouses are optimized for fast analytical queries. Since the data is pre-organized and indexed, you can run complex reports (e.g., monthly sales trends) quickly. Perfect for business intelligence, reporting on study findings, etc.. In the case of a wardrobe, you just have a quick look at the neatly folded clothes, and know exactly what is where and how you can use it.
But, turns out, data warehouses have some limitations. Unstructured data (data not organized into tables and rows, e.g. video or images) or semi-structured data is hard to store. Free-form text, images, logs, etc., must be organized into tables before they can be stored in the warehouse. This requires data transformation. (I’ve no idea how a clothes metaphor could work here, don’t ask.)
Scaling a warehouse to handle massive amounts of data can become quite costly. These days, companies deal with petabytes of information and need more flexible ways to store all their raw data. That’s where data lakes and data lakehouses come in.
What is a Data Lake?
Next, a data lake. If a warehouse is like a wardrobe, a data lake is like the chair in your bedroom: piled high with clothes that don’t need to be in the laundry. It's a big pile of everything you vaguely know, but don’t have any structure to know exactly what is in there.
Maybe a better metaphor is a reservoir for water, where the water represents data, so, basically, a lake. You pour in data just as it is, without organizing or cleaning it first. A data lake is a central place to store huge amounts of information, whether it’s structured, semi-structured, or unstructured, and always in its original form. There’s no strict structure at the start. Data lakes are built to handle lots of data and keep costs down, which makes them very flexible.
Examples of data lakes are Azure Data Lake Storage, Google Cloud Storage, and more. You can also build one yourself.
Again, let’s get heavier with terms. Characteristics of a data lake:
Toss in all kinds of data: databases dumps, clickstream logs, images, JSON files, you name it. Pile the chair high with clothes, unread books, a water bottle, maybe your laptop that you watch Youtube on before sleep. It’ll hold it.
Scalability and Cost-Effectiveness: Data lakes typically use cheap, scalable storage (like Amazon S3). It’s cheaper overall than storing the same in a data warehouse.
Use Cases: Great for machine learning, data discovery, and big data analytics. Data scientists can examine the data lake for raw data to find new patterns. Lakes are also useful for archival storage of data that might be useful someday but doesn’t fit neatly into tables yet.
But there’s a downside. A data lake can become a messy, disorganized place where it’s hard to find or trust the data. People sometimes call this a data swamp, full of redundant, outdated, or trivial information. Imagine a chair so piled up, you can’t even see it anymore. With so many different files, it’s tough to keep data quality high and manage everything properly. It can also be slower to analyze data directly from a lake, since you have to process large raw files each time.
What is a Data Lakehouse?
If the warehouse is a wardrobe and the lake is the chair with clothes piled on it, then the lakehouse is the room where both are found. There’s also someone there to sort and organize things. In simple terms, a data lakehouse is a way to store data that combines the low-cost, flexible storage of a data lake with the organization and speed of a data warehouse. It also has the computing power to process raw data directly within the lakehouse.
In short, it’s the best of both worlds. A data lakehouse uses the affordable storage of a data lake to keep lots of raw data, then adds the management and speed of a warehouse. You get the flexibility to store any kind of data in any format, plus structure, organization, and quick analysis. It’s like picking up the clothes from the chair and sorting them into piles on the floor. It might look a bit messy, but it’s much easier to work with.
Here are the characteristics of a data lakehouse, in ascending complexity:
Multi-format Storage: Like a lake, a lakehouse can hold structured, semi-structured, and unstructured data all in one place. You don’t need separate systems for different data types.
Compatible with multiple vendors, including Snowflake, Databricks, and Microsoft Fabric.
Affordable: Cheaper than proprietary data warehouses like Redshift or an SQL server.
Schema Flexibility and Enforcement: Lakehouses often use a mix of ways to organize data. You can bring in raw data using a method called schema-on-read, similar to a data lake. The lakehouse also has a metadata layer, so you can set rules and structures when you need them.
ACID Transactions and Governance: Unlike a pure data lake, a lakehouse supports ACID transactions and data governance features. These ensure that updates to the data are reliable and consistent. Governance tools (like data catalogs or unified security controls) are built in, preventing the data swamp problem and keeping data organized and trustworthy.
Performance and Analytics: A lakehouse is built for fast queries and analytics. With tools like indexing, caching, and query optimization, it can reach speeds comparable to those of a warehouse, even with large datasets. This means you can run BI reports or SQL queries directly on the lakehouse data and get quick results. You can also use the same data for machine learning, eliminating the need for separate copies.
Use Cases: Unified analytics and AI. Lakehouse platforms are used to power everything from dashboard reporting to advanced analytics in one place. For example, a healthcare company could use a lakehouse to store patient records (structured data like in a warehouse and unstructured data like doctors’ notes or medical images) and analyze all of it together. Teams can run traditional BI dashboards and also train machine learning models off the same source of data.
The biggest challenge with data lakehouses is their complexity. This architecture is still new and can be tough to set up. It often takes more technical skill than using a basic warehouse or lake. But the benefits are worth it: you get a flexible, all-in-one data platform. You no longer have to manage separate systems or complicated pipelines to move data around.
Is a Data Lakehouse right for you?
Now that we’ve covered all the chair analogies, the real question is: what’s the best choice for your organization?
At Tower, we like to dogfood what we offer to our customers, that’s why we created our own lakehouse to understand product usage metrics. You can read about how we did it in our blog series.
In my next post in the lakehouse series, I’ll discuss why data lakehouses have become so popular in the industry and why so many organizations are picking them up. I might even cover why it might be the right thing for you. You can also try Tower and build a lakehouse for your own org.
There will also be a part 3, where we’ll discuss which option suits different types of organizations. I’ll invite some experienced industry experts to join that conversation. In the meantime, we’d love to hear your thoughts on what might work best -join our Discord and let’s chat.