Five predictions for the future of the modern Data Stack

The Modern Data Stack is quickly picking up steam in tech circles as the go-to cloud data architecture, and although its popularity has been quickly rising, it can be ambiguously defined at times. In this blog post, we’ll discuss what it is, how it came to be, and where we see it going in the future. Regardless of whether you’re new to the modern data stack or have been an early adopter, there should be something of interest for everyone.

Read More

Highlights from Data + AI Summit NA 2021

Data + AI: highlights + notes

One of the biggest conferences in the data field — Data + AI Summit North America 2021 happened last week and this time I didn’t contribute with my own talk, but the more I enjoyed the sessions as a listener. In this short report, I want to summarize my notes related to the new features in Spark 3.1 and upcoming 3.2 that were discussed in the Apache Spark internals and Best Practices topic.

Read More

Deployment should be a priority in any commercial data science project


In this article, I want to give some of the reasons why I became convinced that every data scientist should learn some data engineering skills (or become friends with some data engineers). I want to present my argument from two points of view: a more technical view and a user experience focused view.

Read More

Get started with MLOps

A comprehensive MLOps tutorial with open source toolsPhoto by Stephen Dawson on UnsplashGetting machine learning (ML) models into production is hard work. Depending on the level of ambition, it can be surprisingly hard, actually. In this post I’ll go over my personal thoughts (with implementation examples) on principles suitable for the journey of putting ML models into production within a regulated industry; i.e. when everything needs to be auditable, compliant and in control — a situation where a hacked together API deployed on an EC2 instance is not going to cut it.

Read More

MLOps Vs Data Engineering: A guide for the perplexed

MLOps vs Data Engineering

Machine learning involves multiple stages and calls for a broad spectrum of skills. Advances in ML have led to the creation of new specialisations. The ML scene has many specialist roles, and their functionalities overlap to the extent that these designations are sometimes used interchangeably. Case in point — MLOps and data engineering.

Read More

Top 6 CI/ CD practices for End-to-End development pipelines

Continuous deployment

In this article, we’ll talk about some often-misunderstood development principles that will guide you to developing more resilient, production-ready development pipelines using CI/CD tools. Then, we’ll make it concrete with a tutorial about how to set up your own pipeline using Buddy.

Read More

Data Software-as-a-Service: the case for a hybrid deployment architecture


As founders of companies that build solutions designed to help teams deliver on the promise of data, we knew we wanted to build great products that are easy to deploy and manage for our customers. We also knew that since we would be integrating with our customers’ data stacks, we would need to offer the highest level of security and compliance. The question was: how are we going to build them? SaaS? On-prem? Something else? To meet these goals, we chose a hybrid deployment architecture, a new approach that marries on-prem security with SaaS convenience. Here’s why.

Read More

7 reasons why you should consider a Data Lake (and Event-Driven ETL)

Man by a lake

A data lake doesn’t need to be the end destination of your data. Data is constantly flowing, moving, changing its form and shape. A modern data platform should facilitate the ease of ingestion and discoverability, while at the same time allowing for a thorough and rigorous structure for reporting needs.

Read More

A brief Introduction to 5 predictive Models in Data Science

Predictive Modeling in Data Science answers the question “What is going to happen in the future, based on known past behaviors?” Modeling is an essential part of Data Science and it is mainly divided into predictive and preventive modeling. Predictive modeling is a process of using data and statistical algorithms to predict outcomes with data models.

Read More

Simplify data access and publish model results in Snowflake using Domino Data Lab

Arming data science teams with the access and capabilities needed to establish a two-way flow of information is one critical challenge many organizations face when it comes to unlocking value from their modeling efforts. Part of this challenge is that many organizations seek to align their data science workflows to data warehousing patterns and practices.

Read More

Data quality from First Principles

The right way to think about Data Quality, from Kimball and Uber’s points of view. If you’ve spent any amount of time in business intelligence, you would know that data quality is a perennial challenge. It never really goes away. For instance, how many times have you been in a meeting, and find that someone has to vouch for the numbers being presented?

Read More

Mythbusting the analytics journey

This isn’t your typical recruiting story. I wasn’t actively looking for a new job and Netflix was the only place I applied. I didn’t know anyone who worked there and just submitted my resume through the Jobs page 🤷🏼‍♀️ . I wasn’t even entirely sure what the right role fit would be and originally applied for a different position, before being redirected to the Analytics Engineer role. So if you find yourself in a similar situation, don’t be discouraged!

Read More

From Data Lakes to Data Reservoirs

Good ideas take hold and quickly spread like wildfire. Recently the data community has standardized on at least one core data format that is good enough to get behind. That is the file storage format Parquet and we are going to learn a little more about why this is such an excellent choice for our data at rest. Data at rest just means it isn’t currently in active memory.

Read More