The rise of cloud computing, data mesh, and especially data lakehouses all reflect the massive efforts to adopt architectures that will keep pace with the exponential of data constantly growing.
But the industry is still seeking new alternatives. While solutions such as the data lakehouse typically leverage an open-source processing engine and a table format for data governance and performance improvement, some vendors are already innovating new business intelligence tools that supplement metadata architecture with the critical addition of the managed semantic layer.
Here’s what this newly added offering – and the resulting data structuring around it – means for the future of data analysis.
Table of Contents
How Far We’ve Come
The advent of data warehouses in the 1980s was a critical development for enterprise data storage–storing data in a single location made it more accessible, allowed users to query their data with greater ease, and aided enterprises in integrating data across their organizations.
Unfortunately, “greater ease” often comes at the expense of quality. Indeed, while data warehouses made data easier to store and access, it did not make it easier to move data efficiently – sometimes transfer queues would be so long that the queries in question would be outdated by the time engineers finished them.
Subsequently, a slew of new data warehouse variations have come about. Yet the inherent nature of data warehouse structure means that even with reconfigurations, not enough can be done to alleviate overcrowded pipelines or to keep overworked engineers from simply chasing their tails.
That’s why data innovators have largely turned away from the data warehouse altogether, leading to the rise of data lakes and lakehouses. These solutions were designed not only for data storage, but with data sharing and syncing in mind–unlike their warehouse predecessors, data lakes aren’t bogged down by vendor lock-in, data duplication challenges, or single truth source complications.
Thus, a new industry standard was born in the early 2000s.
But as quick as the industry has been to embrace data lakes, the explosion of new data is once again outpacing these new industry standards. To achieve the infrastructure necessary for adequate data transferring and usable open-format file management, a semantic layer–the table-like structure that improves performance and explainability when performing analytics–must be integrated into the data storage.
Blueprinting the Semantic Layer Architecture
Though the semantic layer has existed for years as open-standard table formats, its applications have remained largely static. Traditionally, this layer was a tool configured by engineers to translate an organization’s data into more straightforward business terms. The intention was to create a “data catalog” that consolidates the often-complex layers of data into usable and familiar language.
Now, the creators of open table formats Apache Iceberg and Apache Hudi are proposing a new approach–”designing” metadata architecture where the semantic layer is managed by them, resulting in improved processing performance and compression rates and lower cloud storage costs.
What exactly does that mean?
The concept is similar to how data lakehouse vendors take advantage of open-source processing engines. A semantic layer architecture takes the same open-source table formats and gives solution vendors permission to provide external management of an organization’s data storage, eliminating the need for manual coding configuration while improving performance and storage size.
The process of creating this semantic layer architecture goes as follows:
- An organization’s cloud data lake is connected to the managed semantic layer software (i.e., giving permission to a vendor to manage their storage);
- The now-managed data, stored in a table format, is connected with an open-source processing engine or a data warehouse with external table capabilities;
- Now, data pipelines can be configured so that they continuously improve the quality of data insights as the data grows and relate every managed table to corresponding actionable business logic.
Table formats are notoriously difficult to configure, so the recent performance improvement is an important trend to watch within the analytics industry. Table formats were not widely utilized until recently, and many enterprises still lack the infrastructure or capabilities to support them. Accordingly, as data lakehouses gain popularity and momentum, enterprises must improve their table format capabilities if they hope to keep pace.
With the generative AI revolution upon us, tools such as Databricks Dolly 2.0 can already be trained on data lakehouse architecture in exactly this way–and the recent strides in AI is only the beginning of what this technology can offer.
Data Down the Line
It is increasingly critical for data reliant companies to find ways to stay ahead of the curve.
The future of a data lakehouse architecture will likely separate the semantic layer from the processing engine as two independent components and can easily be leveraged as a paid feature for improved performance and compression. We can also expect table formats to support a more diverse number of file formats, not only columnar and structured data.
By focusing on a singular aspect of the data lakehouse concept (i.e., simulating the “warehouse”), enterprises can significantly improve the overall performance of their metadata architecture.
Because the ability to do more with your data means your data will do more for you.
About the author: Ohad Shalev is a product marketing manager at SQream. Having served for over eight years as an officer in the Israeli Military Intelligence, Ohad received his Bachelors degree in philosophy & Middle Eastern Studies from the University of Haifa, and his Masters in Political Communications from Tel Aviv University.
A Truce in the Cloud Data Lake Vs. Data Warehouse War?
Semantic Layer Belongs in Middleware, and dbt Wants to Deliver It
Open Table Formats Square Off in Lakehouse Data Smackdown