V is for Data

towards-data-science

This post was originally published by Rod Castor at Towards Data Science - Medium Tagged

Data science, machine learning, and all forms of artificial intelligence have data at their core.

We often focus the bulk of our attention on formulas or code when it comes to these disciplines and that makes sense for researchers in those areas of knowledge. But most professionals and hobbyists alike are practitioners of data science and machine learning instead of researchers.

For a practitioner, the formula, the code, and the necessary platforms are mostly borrowed. We can find the code on Github or some code-sharing space online. We can read and study the formulas in books or blogs. And we simply must master the mechanics of our chosen platform’s interface. For the practitioner, the challenge of succeeding in these disciplines is found elsewhere — it’s in the data.

Data is the most challenging aspect in today’s data-driven world. Our data is frequently incomplete, dirty, corrupted, or in some rare circumstances — not collected at all. Preparing data for the work of data science or artificial intelligent use takes the bulk of work — not the data science or ML itself.

Most companies have trudged along with mediocre data cleaning for decades. It hasn’t caught the attention of the executes controlling the budget strings because the ill effects of bad data get lost in the sloppiness of our decision-making processes. Historically, the only data issues noticeable enough to demand attention are based on the simple data-in / data-out test.

If you enter data into the software, the next time you attempt to recall that data it needs to be the same. If not, it’s clearly wrong. But today, data is more vast than the simple in-out test. A mountain of data may be collected and stored for later. Or user behaviors are recorded and then analyzed. The output is just the analysis. The what-goes-in-must-come-out method of loosening the pursestrings no longer covers the most valuable bits of our data store.

When a company decides to push the data it’s been mining into a black-box labeled ML, the result on the other end of the magic process entirely depends on the correctness of the data fed to the model. Data is the crux of artificial intelligence work and data science work. Properly preparing our data is the key to success.

Since data is such a pivotal concern for our work, let’s consider the factors involved when sizing up the data we are using. Typically referred to as the six Vs, these terms give us a description of the data and inform us on selecting the platform and processes required to wield it properly.

Depending on who you ask, the actual number of Vs varies. We will stick with the basic six V version.

  • Volume: This indicates the amount of data. We might have a few hundred terabytes of data or be shooting for the moon with petabytes of data at your disposal. The current and projected total volume of data will help dictate the type of storage solutions we should consider for our work.
  • Velocity: The speed at which we need to process the data for it to remain valuable. I’ve worked with companies that only needed new data processed once per month. Others needed it daily and the new data coming into the system was akin to someone turning on a firehose. Processing 10 million new rows per day requires a plan. The velocity will determine your process from start to finish: bringing new data into your system, processing it, analyzing it, and reporting on it.
  • Variety: Data comes in a variety of forms and from a variety of sources. In today’s market, data will come from internal systems, partner systems, public systems, and a plethora of devices. This multitude of sources also means a smorgasbord of data structure methods. Some will be well-formed and highly structure. This kind of data may come from an internal database. Other forms might be completely unstructured like a paragraph of customer feedback you need to parse and process. Also, our data structures will live anywhere between these two extremes. The mix of variety determines the complexity of our ETL process, Extract-Transform-Load.
  • Veracity: Veracity concerns itself with data’s error-prone nature. This can encompass bias, noise, abnormalities, authenticity, and ultimately its reliability and trustworthiness. Errors are the bane of the data engineer’s existence. Some data is more prone to error than other data. Understanding our data’s propensity for error will inform our testing methodology. What testing needs to be performed to ensure accuracy for the data used in our models? This is the ultimate outcome of understanding the veracity of our data.
  • Value: If we cannot extract value from our data, it’s meaningless. What value can we get from which data? This is one of my favorite Vs. This litmus test can be used to reduce the size of our dataset. If we have data from our source systems unable to yield value, we should eliminate it. If IT wants to store it somewhere in their large vault of data, let them. But we will delete this data in the warehouse used by our data scientists and machine learning algorithms.
  • Variability: How long is your data valid? How long should you retain it? Does its meaning or shape change? Similar to weighing data’s value, determining its shelf life can also reduce the dataset we manage. Sometimes time renders data useless. Other times, aged data changes the grain we are interested in analyzing. Recent sales matter on a daily or perhaps hourly level — grain. But sales from five years ago might only matter on a monthly grain.

While the six Vs are just the beginning of data management, they are absolutely necessary considerations. If we head down the road of data science or artificial intelligence with just a bucket full of data that hasn’t been weighed, counted, sorted, sifted, and analyzed under a microscope, we are setting ourselves up for failure.

We can think of it as similar to eating a meal on the go from a fast-food chain. Envision the food tossed together in a paper-lined styrofoam container. It’s a mess and hard to juggle. Now compare it to dining in a five-star restaurant. The courses are delivered one at a time on fine dishware, presented in a beautiful fashion of color and position on the plate.

However, there is one difference between the food analogy and real-life data. Food eaten from either the styrofoam or fancy restaurant will ultimately get the job done, either way. But poorly prepared data will not deliver the same results as well-prepared data. So, please, let’s take the time to assess our data from the beginning.

Spread the word

This post was originally published by Rod Castor at Towards Data Science - Medium Tagged

Related posts