This post was originally published by Jordan Volz at Medium [AI]
The Modern Data Stack is quickly picking up steam in tech circles as the go-to cloud data architecture, and although its popularity has been quickly rising, it can be ambiguously defined at times. In this blog post, we’ll discuss what it is, how it came to be, and where we see it going in the future. Regardless of whether you’re new to the modern data stack or have been an early adopter, there should be something of interest for everyone.
What is the Modern Data Stack?
The Modern Data Stack commonly refers to a collection of technologies that comprise a cloud-native data platform, generally leveraged to reduce the complexity in running a traditional data platform. The individual components are not fixed, but they typically include:
- A Cloud Data Warehouse, such as Snowflake, Redshift, BigQuery, or DataBricks Delta Lake
- A Data Integration Service, such as Fivetran, Segment, or Airbyte
- A ELT data transformation tool, almost certainly dbt
- A BI layer, such as Looker or Mode
- A Reverse ETL tool, such as Census or Hightouch
The goal is to make data actionable by reducing the time it takes for data to become useful to data workers in an organization. Gone are the days where it takes weeks for data to land in your company’s analytical warehouse after creation. Now it happens in hours or minutes. Companies that go down the path of the modern data stack adopt the technology as it fits their needs — i.e. you don’t necessarily need every component, and some may opt for other technologies, like Airflow, Dagster, or Prefect for an orchestration layer. A simple sample architecture is illustrated below.
Simply having a data platform in the cloud does not make it a “modern data stack.” In fact, I would wager to bet that most cloud architectures really fail to meet the categorization. Things like lift-and-shifted platforms, cloud data lakes, and bespoke solutions often fail to really capture the essence of the modern data stack and often feel as clunky as their on-premises cousins. So what makes something part of the modern data stack? If we look latitudinally across technologies in this ecosystem, we’ll begin to notice that they share some common properties that get at the core of the modern data stack. I’ll propose the following as key capabilities of technology in the modern data stack:
- Offered as a Managed Service: Requires no or minimal setup and configuration required from users and absolutely no engineering required. Users can get started today, and it’s not a vapid marketing promise.
- Centered around a Cloud Data Warehouse (CDW): Everything “just works” off-the-shelf if companies use a popular CDW. By being opinionated about where your data is, you eliminate messy integrations.
- Democratizes data via a SQL-Centric Ecosystem: Tools are built for data/analytics engineers and business users. These users often know the most about a company’s data, so it makes sense to try to upskill them by giving them tools that speak their language.
- Elastic Workloads: Pay for what you use. Scale up instantly to handle large workloads. Money is the only scale limitation in the modern cloud.
- Focus on Operational Workflows: Point-and-click tools are nice for low-tech users, but it’s all kind of meaningless if there’s not a viable path to production. Modern data stack tools are often built with automation as a core competency.
Users of the modern data stack routinely sing its praises. By adopting the modern data stack, companies get a low-cost platform that’s easy to set up, easy to use, and requires little expertise to churn out production workflows. It’s easy to see why so many have jumped on this trend and are never going back.
How it Started / How it’s Going
In the beginning, we stored data by drawing pictures on the walls of caves. Sometime later (1970), Edgar F. Codd invented the relational data model and published A Relational Model of Data for Large Shared Data Banks, which is credited with starting the RDBMS craze. By modern standards, this new technology was slow to get off the ground, but over the next couple of decades, many companies started offering databases to customers: IBM, Oracle, Microsoft, Teradata, etc. New technologies also emerged that made working with databases easier, such as data integration and reporting tools, and a new language was created, SQL, that made working with data in your database relatively straightforward (i.e. no coding necessary). And, for decades, on-prem databases were perfectly sufficient for the vast majority of use cases that companies were trying to solve with the small amounts of data they stored.
Over time, Moore’s Law, the creation of the Internet, and user-generated content made it such that it was not a totally crazy or ridiculously expensive idea that the average company would try to store and analyze large amounts of data. This was great for businesses, but not so great for RDBMS systems, which were not designed to handle large-scale data operations. Enter Big Data; the 2000s saw the advent of many new types of technology systems that were tailor-made to handle large volumes of data: Hadoop, Vertica, MongoDB, Netezza, etc. These systems were typically of the distributed SQL or NoSQL variety, focused on parallelizing data operations across clusters of servers, and they evolved quickly to handle a variety of use cases like their predecessors. Finally, enterprises had a viable option for handling large volumes of data.
Big Data’s reign lasted less than a decade before it was disrupted by burgeoning cloud technology in the early and mid-2010s. The traditionally on-premises-anchored big data technologies struggled to shift into the cloud, and the complexity, cost, and expertise required to operate and maintain these platforms couldn’t compete with the much more nimble and agile cloud platforms. Soon, cloud data warehouses were back; they offered up the same simplicity and ease of use as prior iterations of the RDBMS, but now you could go forego the team of DBAs and, additionally, many of these were constructed to handle big data type workloads. The shift started with smaller companies who lacked the manpower required for big data solutions, as the SaaS-oriented cloud environment drastically reduced the barrier for entry, but quickly larger companies also hopped on board the movement to simplify and reduce costs via elastic workloads. Separating compute and storage became fashionable once again when you only paid for the compute when you used it. Seemingly overnight, everyone was migrating their data platforms to the cloud.
It started in 2010 with Google’s BigQuery, soon after Redshift and Snowflake followed (2012). Almost immediately, an ecosystem of cloud-native adjacent technologies began to emerge: BI (Chartio — 2010, Looker — 2011, Mode — 2012); Data Integration (Fivetran — 2012, Segment — 2013, Stitch — 2015), Data Transformations (dbt — 2016, Dataform — 2018), and Reverse ETL (Census — 2018, High Touch — 2018). These cloud-native technologies had many of the advantages listed above in comparison to their older counterparts. When organizations are cleaning house and prioritizing ease-of-use and usage-based pricing when adopting new data technologies, it’s difficult for legacy vendors to continue telling a compelling story with server-based or user-based pricing. If nothing else, it’s likely difficult to justify signing yet another six- or seven-figure contract to shift your standard tooling set into the cloud when the new guy on the block has serious momentum and will let you get started for thousands of dollars a month, or less.
A cloud data warehouse, in and of itself, is useful, but not transformative. The resulting ecosystem that sprung up around these platforms is what really makes the modern data stack what it is. It’s actually feasible to go from zero to production in the matter of a week, without breaking the bank or spending months in architecture reviews or pipeline integration work. Data leaders who have gone down this path feel the tailwind of the modern data stack as their organization quickly begins executing on its data initiatives. A half-century since the initial ideas of the data warehouse were laid down, it’s now become a robust and easy enough platform to harness that any company can pick it up and be as competitive as the best high-tech companies in data analytics.
The Need for Innovation
Tristan Handy recently blogged about the state of the modern data stack. His fantastic post is well worth a read and highlights the cyclic nature of innovation that typically accompanies large paradigm shifts. In our case, many pieces of the current modern data stack were launched in the early/mid-2010s, while the next few years slowed down a bit. The reasoning behind this is that companies were busy adopting the technology and executing on use cases, so some time is needed to fully assess productivity and the ROI before the next wave of innovation occurs. Likewise, innovators also need to see positive signals from users before they begin putting the effort and resources into newer innovations; otherwise, they risk innovating on a dead platform. Once positive reinforcement is received that the technology works, this opens the gates for fresh ideas to be explored.
Handy concludes that the next five years should see another round of innovation in the modern data stack, and he highlights five key areas that he believes are ripe for exploration: Governance, Real-Time, Completing the Feedback Loop, Democratized Data Access, and Vertical Analytical Experiences. I agree with Handy’s general premise that the modern data stack will undergo significant change in the years to come, but I will offer up a different set of five hotspots in the next section. I also believe that it is imperative for the modern data stack to evolve if it wishes to survive long term; the growth of the modern data stack isn’t merely a nice-to-have for users, it’s essential.
The need for innovation is key for several reasons. Historically, the modern data stack has been quickly adopted by smaller companies and cloud-savvy startups. Its strengths play into the resources of these companies; they don’t have teams of people to build and maintain a data platform, and they’re not bogged down by decades of tech debt and weary of migration efforts. With every company that adopts this approach, there’s growing evidence that this is the right way to do data in the cloud. However, for large enterprises with a variety of data, complex data teams, and decades of old tech in the closet, they may find the current state of the modern data stack too narrow in its use cases and lacking enticing capabilities, like AI, data governance, and streaming. The modern data stack needs to appeal to large enterprises in order for it to survive past being just the latest data platform trend. To do so, vendors must address key enterprise gaps in order to bring large companies under its wings. If it’s able to do so, it will solidify its place as the premier cloud data platform. If not, I suspect large enterprises will largely elect to follow other architectural patterns in their cloud adoption.
Secondly, the modern data stack must evolve to handle more complex use cases outside of analytical use cases. The analytical foundations of the modern data stack are a great starting point, and it’s certainly true that data teams need to learn to walk before they can run, but if more complex use cases require implementing different platforms, people will quickly lose interest in the modern data stack as business leaders demand solutions to more complex use cases. The idea that a single platform can easily and efficiently handle the majority of your data use cases has certainly been a hot marketing concept in the past decade or two, but it appears that the modern data stack could provide on this front.
Lastly, it’s difficult to talk about the growth of the modern data stack and not mention the largest two infrastructure-independent vendors: Snowflake and Databricks. Snowflake, of course, fits nicely into the modern data stack and the vast ecosystem that has spun up around it, with its “Data Cloud” product. Meanwhile, Databricks (and the Delta Lake) was really created out of the big data technology space, but it suitably differentiated itself by being cloud-native and has been reaping the rewards as a result. The companies have starkly different approaches: Snowflake is founded upon SQL technology and has succeeded because it has been the easiest CDW to implement and get going with. Partners bring in opinionated products that work with minimal setup and reduce the time to value for customers. Databricks, on the other hand, believes in a more general approach where skilled resources can take the building blocks the company provides and build out efficient data pipelines. Recent announcements by both companies have seen each sprinting towards the other’s territory, so there’s an inherent understanding that these approaches are destined to merge into a platform that works for all enterprises: those seeking an easy, paint by number approach, and also those wanting more knobs and control over the underlying machinery.
Modern Data Stack v2
In this current time of growth in the modern data stack, I believe there are five key areas ripe for innovation:
- Artificial Intelligence
- Data Sharing
- Data Governance
- Application Serving
These represent a number of complex use cases that naturally evolve out of existing use cases, enhanced enterprise-readiness, and future-proofing the platform from inevitable disruptors. We’ll discuss each in turn below.
Anyone who has been working in data science in the past decade is probably familiar with the “Data Science Hierarchy of Needs,” which looks something like the following:
For the uninitiated: the idea here is that each step builds upon the one below it, and this diagram illustrates the dependencies necessary to get to a point where you can even begin solving AI problems. If a company doesn’t have a solid story for how they are collecting, storing, and modifying data, any data science project is doomed before it starts because the foundations upon which it rests is rapidly shifting. Conveniently for the modern data stack, the first four layers of the diagram perfectly describe the tooling needed to get started, we’re only missing the AI layer:
AI represents a huge growth opportunity for many businesses. We’re confident that the leading companies in virtually every sector in 2030 will be the companies that prioritized enabling AI across their enterprise. However, there is no shortage of articles out there bemoaning how the average company is failing to get their data science efforts into a meaningful production environment. As this diagram from Andreesson Horowitz makes clear, this is not getting any easier over time:
So why is the modern data stack an ideal platform for artificial intelligence? We plan to get into the gory details in a future article, but the gist is that the modern data stack is a prime landing spot for a new type of AI platform, a declarative data-first AI platform that radically simplifies operational AI. We’ve been building such a platform at Continual, and we believe that this will be the main way enterprises elect to execute AI use cases in the future. This data-first approach has been pioneered by companies like Apple (Overton) and Uber (Ludwig), and they offer a different approach to AI than traditional ML platforms that are pipeline or model-centric. Data-first systems prioritize automation and operationalization of AI, drastically reducing the time it takes to build and maintain new use cases, and push them into production.
Michael Kaminsky wrote a great blog last year on what a modern data science stack would look like, and we believe this is most fully realized in a declarative system. AI represents such a huge growth vector for companies that the modern data stack will struggle to stay relevant without innovation in this area. At the same time, legacy ML systems are both incongruent with the main principles of the modern data stack and have also demonstrated that they quickly drown in complexity. We believe a new wave of vendors are primed to disrupt the AI landscape with platforms that make operational AI both easy and robust. To learn more about our vision for AI on the modern data stack, I’ll refer you to our launch blog.
Data Sharing (Data-as-a-Service)
Companies like Census and Hightouch provide an invaluable service to customers that allow users to easily and quickly move data out of their cloud data warehouse and into downstream applications that need it. For a lot of use cases, this is essential, and, without purpose-built tooling, companies would be stuck writing custom integration scripts or bespoke tooling that are difficult to maintain and implement.
There’s another related use case for companies: sharing data. Maybe it’s something as simple as keeping a database of product descriptions that distributors can access and use on their own websites. Today, access to the data is probably provided as an API, which requires engineering work to set up and maintain and likewise for each company that consumes the API. In the modern cloud era, why not simply make this database available to others on the same platform? That’s certainly the idea behind Snowflake’s Data Marketplace. Databricks also recently announced Delta Sharing, which seeks to accomplish a similar goal. By providing access to the data directly in the data platform, these companies have taken the data integration requirements down to zero and users can immediately make the data actionable in their own organizations.
These solutions are great for your customers who happen to be on the same data platform as you, but in reality you’re likely to interact with companies working across a variety of platform vendors. It would be excellent to have a tool that would broker access to your data across the different platforms you opt to participate in and make it easy for you to upload and manage your datasets. Historically, these data-as-a-service companies have focused their efforts on automating the construction of APIs for external use, but with the rise of the modern data stack, we can envision there’s a new need for companies who are working within the native sharing mechanism in each platform to take advantage of these in-platform options.
Returning to the topic of making the modern data stack ready for large enterprises: the importance of Data Governance cannot be overstated. Some companies have so many different datasets that it would be virtually impossible to track and make sense of their data without governance tools. This is something that’s easy for the individual to lose sight of — one’s own purview is often fairly limited compared to a company’s total data landscape — but good data leaders consistently reinforce the importance of good data governance tools across discovery, observability, cataloging, lineage, auditing, etc.
Without good governance tools, the modern data stack will likely be too chaotic and unwieldy for large enterprises and their behemoth data volumes. Governance imposes order upon an organization’s data and often breaks down natural barriers that can make discovery and collaboration actually possible!
It’s also much more likely that large enterprises will continue to elect a multi-vendor approach with respect to cloud vendors, meaning that metadata tools that provide support across data platforms will provide great ROI and be resistant to whatever homegrown tools the platform vendors themselves come up with.
There’s no shortage of companies attempting to solve this problem within the context of the modern data stack. Although this race is far from over, a few vendors who are showing great promise are Monte Carlo Data, Stemma, and Metaplane.
Streaming data is really the holy grail of the cloud data warehouse. When talking to modern data stack users, the most common response I get when asking about real-time use cases is: “We can’t even fathom doing things in real-time.” But, inevitably there’s an acknowledgment that this capability would be amazing if it was made easy enough for the average company. If a CDW vendor is able to compellingly offer real-time access to data, it would likely be a huge game-changer for many companies.
As companies progress through their data use cases, however, they’ll inevitably arrive at the need to execute on streaming data. Maybe not today, maybe not tomorrow, but eventually we’ll come to a point where companies who can’t handle streaming data will lose to those who can. So, again, we’re at a fork in the road where the modern data stack can either provide an elegant solution that fits nicely alongside the other technology, or it will force customers to re-platform. The latter begins to erode confidence in the overall platform and thus, we need to find a way forward.
Like AI, streaming use cases also tend to be very complicated and this represents a large opportunity for vendors to simplify the use case for CDW-centric use cases. Current streaming technology seems to assume that you’re either a software developer or some kind of time wizard, and we really haven’t yet seen the streaming world revolutionized the way we have with data warehouses. Someone who can offer a solution where I don’t need to think about infrastructure, move my data into a new platform, and can simply and reliably get live data via a standard SQL query will be greatly celebrated.
Today, functionality already exists across the various vendors: Snowflake has Snowpipe and has hinted at new capabilities; BigQuery, Redshift, and Snowflake have materialized views (albeit with serious limitations); and Databricks supports structured streaming. Outside of your standard CDW vendors, Confluent offers ksqlDB, Decodable is working to reimagine streaming data engineering, and Materialize is a new company positioning itself similarly but more native to the modern data stack (with a dbt plugin no less!). I inevitably struggle to see how a separate platform for streaming is going to fit in nicely with the larger modern data stack story, but it’s possible.
The last item on our wish list for the modern data stack is application serving. Yes, eventually your data platform is so awesome that people will realize that it contains all the data necessary for the killer application they’ve been writing, and they’ll want to hook up to it. This is generally where our ride comes to a grinding halt. Cloud data warehouses are squarely in the category of OLAP, whereas live applications want something with high concurrency/low latency, which is in the OLTP category. This does not compute.
There are workarounds, particularly for read-only workloads. Today the sensible thing to do is to copy your data into a system suitable for application serving. Redis or Cassandra, or Memcached, or SingleStore, or … Again, we’re introducing a new platform for a new problem, which ups the complexity and burden of the system. It’s easy to see how innovation within the modern data stack would be welcome to enable application serving directly out of your existing data.
Netflix had a unique take on this dilemma recently with a tool they created called Bulldozer, one that would actually fit very nicely into the modern data stack. The key is that they abstracted the underlying caching layer and implemented a declarative workflow to move the data; it would be easy to imagine how you could apply this to your favorite CDW to allow users to “stage” data for application serving with just a simple click or two.
Snowflake has lately been making a lot of noise around using the Snowflake Data Cloud for data applications, so this appears to be a use case they are focused on. Details pending.
If you’ve adopted a modern data stack, or are considering getting started, you’re in for an exciting next few years of innovation. A critical task will be assembling the set of best-of-breed components that drive business impact while reducing development and maintenance costs. At Continual, we believe that operational AI is a mandatory component of any truly modern data stack. Whether you’re maintaining a customer churn prediction to personalize marketing or a demand forecast to improve margins, it shouldn’t require a pipeline jungle and complex infrastructure to deliver. To experience how Continual makes operational AI easy on the modern data stack, you can request early access.
This post was originally published by Jordan Volz at Medium [AI]