This post was originally published by Ajay Khanna at Towards Data Science
The key to solving any analytical problem is to have the right data. Data is an asset. One that forward-thinking organizations seek out just as actively as they would revenue streams or new customers. And for good reason — with relevant data, organizations make smarter decisions and solve critical business challenges. Today, “good analytics” means going well beyond a few commonly available algorithms or a dashboard to showcase internal data.
Enterprises looking to make the most out of the intelligence revolution must evolve and look beyond their four walls to get the data they need. Having access to relevant external data is a competitive advantage. Companies must build effective strategies to find, acquire, deploy, and understand new alternative data assets. An external data acquisition strategy can help solve complex problems much faster and makes companies genuinely data-driven.
So if external data is the source of competitive advantage, how can companies get better at its acquisition? Turns out, the same continuous improvement processes so prevalent in the manufacturing sector during the nineties can be applied to modern day data engineering. There are two in particular: the Kaizen quality framework (Why, What, Where, Who, When, and How) can help plan your data acquisition program; and the 5S technique offers a framework that can be used to manage and organize external data.
Kaizen Quality Framework
Kaizen’s focus on continuous improvement focuses on six fundamental questions: Why, What, Where, Who, When, and How. Let’s explore how they relate to external data acquisition.
Why do we need external data?
The key is to start by defining the problems that you are ultimately trying to solve. You either have a specific question that you need to answer (for which you need new, better, or complementary data). You’re struggling to optimize an existing solution (fine-tuning a model and improving the accuracy of your predictions.)
What data do we need?
The first step is to look for relevant data that helps you uncover insights. External data such as foot traffic, pricing, firmographics, technographic, and other marketing and financial attributes can improve predictive models and business outcomes.
Where to get the data?
Finding the right data presents its own set of challenges. There are hundreds of thousands of premier and public data sources. Searching and evaluating data for your use cases could take time. Timely access to the data is key to organizational agility.
Who is going to use this data?
There is a wide variety of external data use cases, meaning the types of users vary. Users have different needs and skillsets. Users can range from marketing and sales operation members to customer insight teams, fraud and risk officers, business analysts, and data scientists. Understanding their needs and providing data that they can quickly evaluate and consume is critical to getting the maximum ROI from your data investments.
When should the data be refreshed?
Based on your use case and the data signals it requires, the frequency of refreshing the data needs to be determined. Ensure you have the correct data onboarding frequency to keep your data current, accurate, and relevant. Predictive model performance or drift should also be continuously monitored to see if the model is still performing according to business expectations. If any signal loss occurs (such as loss of third-party cookies), you should immediately seek alternative data sources for your business continuity.
How to use this data for analytics and machine learning?
After acquiring, preparing, and integrating the external data, the next step is to evaluate how you will consume the data within your analytics or machine learning platforms. There are various connection options to consider. See if you have access to the data via export, API, and connection capabilities with storage such as S3 or Snowflake. Also, plan for connections to platforms or applications such as Google Big Query, Databricks, and Salesforce, according to your business needs.
The 5S Technique
5S refers to the following Japanese words that can make your data program efficient and effective:
- Seiri (Sort) — sorting through all items and removing those that are unnecessary
- Seiton (Order) — putting all necessary items in the optimal place for fulfilling their function
- Seisō (Clean) — cleaning and inspecting regularly
- Seiketsu (Standardize) — standardize the processes used to sort, order, and clean
- Shitsuke (Process) — discipline and process orientation
Many times a sixth S is added — Safety. You can probably already see the link. Let’s see how we can use this framework for external data. Before you start buying the external data think about:
Seiri (Sort) — data preparation for consumption
One major challenge of utilizing external data is that it is often not consumption-ready. The data preparation process may include data cleaning, data transformation, and organizing it properly for data consumption.
Seiton (Order) — data matching and integration
To use the external data to solve your specific business problem, you will need to match it with your internal or training data sets. Matching, or identity resolution, is a complex activity. If you do not have the right tools, it can be very resource-intensive. Plan how you will put an order in your data, so it is ready for consumption.
Seisō (Clean) — data quality and validation
Understanding the data source, its data quality, coverage, gaps, recency, frequency of updates, risks, and most importantly, the relevance to your business context is critical. Another risk of data purchase is its lack of ROI. Make sure you know the return your data investment will deliver. Understand the quality and risks before investing in data products
Seiketsu (Standardize) — standardize the processes to sort, order, and clean
Data acquisition is not a one-and-done event. Standardization of data acquisition, evaluation, cleaning, matching, and integration methods will help efficiently use the data on an ongoing basis.
Shitsuke (Process discipline) — process-orientation, monitoring and improvement
Put mechanisms in place where you can monitor the data for predictive model drifts. Make data acquisition a part of your overall data management strategy and ensure data onboarding is evaluated for relevance, quality, coverage, risks, and readiness before purchase.
Safety — privacy, compliance, regulations
Think about privacy, security, and compliance requirements. Privacy regulations such as GDPR and CCPA make external data management harder than ever. Check external data for compliance, safeguarding policies, and incorporate best practices to take risks out of putting data to work. Compliant data will help you realize your advanced analytics and ML ambitions faster.
Analytics is now foundational to competitive performance, just as manufacturing ability was before digital transformation. The techniques used to enhance the manufacturing process can be directly applied to improving the organization’s analytical capabilities. Specifically, techniques such as 5S and Kaizen provide strategic questions every organization should ask to enhance its external data acquisition process. The means of production are changing but these disciplines still hold true in today’s ML and analytics powered organizations.
This post was originally published by Ajay Khanna at Towards Data Science