This post was originally published by Editorial Team at Inside Big Data
Feature engineering occupies a unique place in the realm of data science. For most supervised and unsupervised learning deployments (which comprise the majority of enterprise cognitive computing efforts), this process of determining which characteristics in training data are influential for achieving predictive modeling accuracy is the gatekeeper for unlocking the wonders of statistical Artificial Intelligence.
Other processes before and after feature generation (like data preparation or model management) are requisite for ensuring accurate machine learning models. Yet without knowing which data traits are determinative in achieving a model’s objective—like predicting an applicant’s risk for defaulting on a loan—organizations can’t get to the subsequent data science steps, rendering the preceding ones useless.
Consequently, feature engineering is one of the most indispensable tasks for building machine learning models. The exacting nature of this process hinges on:
- Labeled Training Data: The large quantities of training data for supervised and unsupervised learning are one of its enterprise inhibitors. This concern is redoubled by the lack of labeled training data for specific model objectives.
- Data Preparation: Even when there is enough available training data, simply cleansing, transforming, integrating, and modeling that data is one of the most laborious data science tasks.
- Engineering Manipulations: There’s an exhaustive array of data science tools and techniques for determining features, which require a copious amount of work as well.
Each of these factors makes feature engineering a lengthy, cumbersome process—without which, most machine learning is impossible. As such, there are a number of emergent and established data science approaches for either surmounting this obstacle or rendering it much less obtrusive.
According to Cambridge Semantics CTO Sean Martin, “In some ways feature engineering is starting to be less interesting, because nobody wants to do that hard work.” This sentiment is particularly meaningful in light of graph database approaches for hastening the feature engineering process, or eschewing it altogether with graph embedding, to get the same results quicker, faster, and cheaper.
The Embedding Alternative
Graph embedding enables organizations to overcome feature engineering’s difficulties while still discerning data characteristics with the greatest influence on the accuracy of advanced analytics models. With “graph embedding, you don’t need to do a lot of feature engineering for that,” Martin revealed. “You essentially use the features of the graph sort of as is to learn the embedding.” According to Martin, graph embedding is the process of transforming a graph into vectors (numbers) that correctly capture the graph’s connections or topology so data scientists can do the mathematical transformations supporting machine learning.
For example, if there’s a knowledge graph about mortgage loans and risk, data scientists can employ embedding to vectorize this data, then use those vectors for machine learning transformations. Thus, they learn the model’s features from the graph vectors while eliminating the crucial need for labeled training data—one of the core machine learning roadblocks. Frameworks like Apache Arrow can cut and paste graph data into data science tools that do the embedding; eventually users will be able to perform embeddings directly in competitive knowledge graph solutions.
Swifter Feature Engineering
The underlying graph environment supporting this embedding process is also useful for transforming the effectiveness of traditional feature engineering, making it much more accessible to the enterprise. Part of this utility stems from graph data modeling capabilities. Semantic graph technology is predicated on standardized data models all data types adhere to, which is crucial for accelerating facets of data preparation phase because “you can integrate data from multiple sources more easily,” Martin observed. That ease of integration is directly responsible for including greater numbers of sources for machine learning training datasets and determining their relationships to one another—which provides additional inputs not gleaned from the individual sources.
“You now get more sources of signal and the integration of them may give you signal that you wouldn’t receive in separate data sources,” Martin mentioned. Moreover, the inherent nature of graph settings—they provide rich, nuanced contextualization of relationships between nodes—is immensely helpful in identifying features. Martin commented that in graph environments, features are potentially links or connections between entities and their attributes, both of which are described with semantic techniques. Simply analyzing these connections leads to meaningful inputs for machine learning models.
Speeding Up Feature Engineering
In addition to graph embedding and scrutinizing links between entities to ascertain features, data integration and analytics prep platforms built atop graph databases provide automatic query capabilities to hasten the feature engineering process. According to Martin, that process typically involves creating a table of attributes from relevant data and “one of those columns is the one you want to do predictions on.”
Automatic query generation expedites this endeavor because it “allows you to rapidly do feature engineering against a combination of data,” Martin acknowledged. “You can quickly build what are essentially extractions out of your graph, where each column is part of your feature that you’re modeling.” Automated queries also allow users to visually build wide tables from different parts of the graph, enabling them to use more of their data quicker. The result is an enhanced ability to “more rapidly experiment with the features that you want to extract,” Martin indicated.
Automatic Data Profiling
Tantamount to the capacity to automatically generate queries for feature engineering is the ability to automatically profile data in graph environments to accelerate the feature selection process. Data profiling “shows you what kind of data is in the graph and it gives you very detailed statistics about every dimension of this data, as well as samples,” Martin remarked. Automated data profiling naturally expedites this dimension of data science that’s often necessary to simply understand how data may relate to a specific machine learning use case. This form of automation naturally complements that pertaining to generating queries. A data scientist can take this statistical information “and that can be used as you start to build your feature table that you’re going to extract,” Martin specified. “You can do that sort of hand in hand by looking at the profiling of the data.”
The Future of Features
Features are the definitive data traits enabling machine learning models to accurately issue predictions and prescriptions. In this respect, they’re the foundation of the statistical branch of AI. However, the effort, time, and resources required to engender those features may become obsolete by simply learning them with graph embedding so data scientists are no longer reliant on hard to find, labeled training data. The ramifications of this development could potentially expand the use cases for supervised and unsupervised learning, making machine learning much more commonplace throughout the enterprise than it currently is.
Alternatively, graph platforms have other means of quickening feature engineering (based on their integration, automatic data profiling, and auto query generation mechanisms) so that it requires much less time, energy, and resources than it once did. Both approaches makes machine learning more practical and utilitarian to organizations, broadening the value of data science as a discipline. “The biggest problem of all is to get the data together, clean it up, and extract it so you can do the feature engineering on it,” Martin posited. “An accelerator for your machine learning project is key.”
This post was originally published by Editorial Team at Inside Big Data