Dremio, which for almost four years now, has offered a platform designed to facilitate BI analysis over data lakes (first in Hadoop clusters and now in the cloud) is today announcing a multi-month initiative to evolve its platform’s performance to the point of parity with dedicated data warehouse platforms.
Also read: Startup Dremio emerges from stealth, launches memory-based BI query engine
The initiative, called Dart (itself a reference to Dremio’s core Apache “Arrow” technology), is delivering certain performance gains immediately and will continue to improve the platform over the course of the next year or so. Tomer Shiran, founder and chief product officer at Dremio, briefed ZDNet and explained that Dart will get the Dremio platform to outperform SQL engines like Apache Hive and Presto, and match the performance of the Snowflakes and Redshifts of the world, while still allowing customers to keep their data in open formats and stored in cloud object storage (or HDFS, for that matter).
Also read: Apache Arrow unifies in-memory Big Data systems
Optimization hit parade
Shiran would understand intimately the extent to which standalone SQL engines leave something to be desired. As VP Product Management at the erstwhile MapR (whose platform is now the HPE Ezmeral Data Fabric), Shiran was a major force behind one such engine, Apache Drill. While that engine delivered on the promise of universal SQL query access to data in numerous sources, its performance and adoption were somewhat lackluster. When Shiran left MapR to co-found Dremio with fellow MapR alumnus Jacques Nadeau, he understood that smart optimization was the key to business intelligence (BI)-scale interactive querying of what we now call data lakes.
Also read: Dremio releases Data Lake Engines for AWS and Azure
Dart is true to that mission. It introduces industrial-class query planning and expanded native code query execution, via Dremio’s open source Gandiva toolset. Dart also brings better ANSI SQL support, including nearly universal support for read-oriented query operations. In addition, by ditching the Hive metastore, and placing metadata directly in the lake, Dremio can dispatch large metadata operations during execution instead of up front, further accelerating queries. Dremio says the result is up to 8x faster query planning, an up to 6x faster processing rate and up to 8x faster execution.
Also read: Open source “Gandiva” project wants to unblock analytics
Potato, potahto
Despite the headline for this post about Dart converging the warehouse and lake paradigms, the headline for Dremio’s press release pushed the premise that Dart accelerates the obsolescence of cloud data warehouses. Clearly, different parties see the question differently. Vendors like Dremio and Databricks wish to convince you that the lake supersedes the warehouse. Vendors like Snowflake wish to do the opposite. Then there’s Microsoft, which offers both a warehouse and an Apache Spark-based data lake in its Azure Synapse Analytics service (and on-premises does essentially the same with SQL Server Big Data Clusters).
So what gives? The answer is that the technology matters less than the use case. Most warehouses are meticulously modeled, and operated with a high barrier to entry of new data, with strict curation. Most lakes seek to be inclusive of data to allow analysis of the “unknown unknowns.” Warehouses tend to use columnar, relational database technology and lakes tend to consist of CSV, JSON and Parquet files in cloud storage.
But one could argue here that Dremio is implementing warehouse technology rather than obsolescing it. The real difference is that in the Dremio case, the data is stored in open formats with which many other analytics engines are compatible. Most data warehouses, meanwhile, use proprietary formats optimized for, but captive to, their own platform.
Just don’t call me late to query
Regardless of the storage medium and proprietary or open source approach, the coexistence of curated and modeled data with more inclusive, casually structured data must be accommodated. Use whatever labels you want. Just make sure you can accommodate both use cases and that the mission-critical queries run fast.
Also read: Data lake-focused Dremio raises $135M series D funding round