The tabular data will not give us as much information as it contains; the messy format and large numbers of entries make it difficult to do further analysis. So here comes the birth of different data visualisation tools and techniques. Data visualization is the art of presenting data in different graphical charts so that non-technical people can understand it easily. Using a perfect combination of elements like colors, dimensions, and labels can create a masterpiece of visual reports that can reveal surprising insights, making businesses more growth.Read More
News items grouped by category such as M&A activity, people movements, funding news, industry partnerships, customer wins, rumors and general scuttlebutt floating around the big data, data science and machine learning industries including behind-the-scenes anecdotes and curious buzz.Read More
Validate the correctness and performance of machine learning systems through the ML product lifecycle.Photo by Tolga Ulkan on UnsplashTesting in the software industry is a well-researched and established area. The good practices which have been learned from the countless number of the failed projects help us to release frequently and have fewer opportunities to see defects in production. Industry common practices like CI, test coverage, and TDD are well adopted and tailored for every single project.However, when we try to borrow the SWE testing philosophy to machine learning areas, we have to solve some unique issues. In this post, we’ll cover some common problems in the testing of ML models (systems) and discuss potential solutions.The ML system here stands for a system (pipeline) that generates prediction (insights) which can be consumed by users. It may include a few machine learning models. For example, an OCR model(system) could include one ML model to detect text region, one ML model to tell which current text region class is ( car plate vs road sign), and one model to recognize the text from a picture.A model is composed of the code (algorithm, pre-process, post-process, etc), data, and infrastructure which facilitates the runtime.The scope of ML system testing, Image by authorDifferent types of testings cover the quality assurance for different components of the system.Data testing: ensuring new data satisfies your assumptions. This testing is needed before we train a model and make predictions. Before training the model, the X and y (labels)Pipeline testing: ensuring your pipeline is set up correctly. It’s like the integration tests in SWE. For the ML system, it may measure consistency (reproducibility) as well.Model evaluation: evaluating how good your ML pipeline is. Depends on the metrics and dataset set you’re using, it could refer to different things.Evaluation on holdout/cross-validation dataset.Evaluation of deployed pipelines and ground truth(continuous evaluation ).Evaluation based on the feedback of system users (the business-related metrics, not a measurable ML proxy)There are a bunch of techniques that can be applied in the process, like slice-based evaluation, MVP(a critical subset of data) groups/samples analysis, ablation study, user subgroup-based experiments (like Beta testing, and A/B testing).Model testing: involves explicit checks for behaviors that we expect our model to follow. This type of testing is not for telling us the accuracy-related indicators, but for preventing us from behaving badly in production. Common test types include, but are not limited to:Invariance(Perturbation) Tests: perturbations of the input without affecting the model’s output.Directional Expectation Tests: to achieve we should have a predictable effect on the model output. For example, if the loss of blood within a surgery goes up, the blood for transfusion should go up as well.Benchmark regression: use predefined samples and accuracy gate to ensure a version of the model won’t introduce insane issues.When do we perform tests? Image by authorSome people may ask why we need to use holdout evaluation and continuous evaluation to measure almost the same metrics in CI and serving time.One reason is that we can’t fully estimate model performance by seeing metrics on a predefined holdout dataset is that the data leakage sometimes is hard to detect than it looks. For example, some features which were expected to exist in the serving time turn out to have high latency to acquire, so our trained models can’t get used to seeing this feature always being empty.Sometimes model evaluation could be very expensive, so a full cycle holdout evaluation is not feasible integrated into CI. In this case, we can define a subset regression evaluation within the CI, and only do the full evaluation before important milestones.Model testing is not a one-off step, instead, it should be a continuously integrated process with the automation setup. Some of the test cases can be performed with the CI process, so each code commit will trigger them and we can guarantee the code/model quality in the main branch of a repo. Others can be conducted in the serving environment, so we won’t be blind to how well our system performs, and we can have relatively sufficient time to fix issues when we have them. Sometimes the ongoing tests within the serving environment can be seen as a part of the monitoring component, and we can integrate with alerting tool to close the loop.Machine learning systems are not straightforward to test, not only because it includes more components (code + data) to verify, but also it has a dynamic nature. Although we didn’t change anything our models can be stale because of the data change (data drift) or the nature of things change (concept drift) over time.Automated testing is an essential component in CI / CD to verify the correctness of pipelines with a low footprint. While manual tests and human-in-the-loop verification are still crucial steps before we say a new ML pipeline is production-ready. After a pipeline has been released to production, continuous monitoring and evaluation can ensure we’re not flying blind. Finally, customer feedback-based tests (.i.e A/B tests) are able to tell us if the problem we are trying to solve is actually getting better.There is no silver bullet in the ML system testing, continuously trying to cover edge cases would help us have fewer opportunities to make mistakes. Hope one day we can figure out a simple metric like code coverage to tell if our system is good enough.Read More
The data science lifecycle (DLSC) has been defined as an iterative process that leads from problem formulation to exploration, algorithmic analysis and data cleaning to obtaining a verifiable solution that can be used for decision making. For companies creating models to scale, an enterprise Machine Learning Operation (MLOps) platform not only needs to support enterprise-grade development and production, it needs to follow the same standard process that data scientists use.Read More
For companies investing in data science, the stakes have never been so high. According to a recent survey from New Vantage Partners (NVP), 62 percent of firms have invested over $50 million in big data and AI, with 17 percent investing more than $500 million. Expectations are just as high as investment levels, with a survey from Data IQ revealing that a quarter of companies expect data science to increase revenue by 11 percent or more.Read More
Time Series is a unique field. It is a Science in itself. Experts quote ‘A good forecast is a blessing while a wrong forecast can prove to be dangerous’. This article aims to introduce the basic concepts of time series and briefly discusses the popular methods used to forecast time series data.Read More
The versatility of a data labeling tool can make or break your data quality. And the data quality can make or break your algorithms. And what happens when our algorithms misinterpret or fail? — Karthik Vasudevan, Founder at Traindata Inc. This post will guide you to ask five questions to help you choose the best data labeling tool.Read More
Modern AI/ML systems’ success has been critically dependent on their ability to process massive amounts of raw data in a parallel fashion using task-optimized hardware. Can we leverage the power of GPU and distributed computing for regular data processing jobs too?Read More
Image by authorIt’s Time that Leaders Unite Machine Learning and Causal InferenceThroughout the last months, I had the chance to enable various organizations and leaders leveraging their large databases with machine learning. I was particularly engaging with member organisations which struggle with rising dropout rates (churns) — an issue that became even more serious throughout the pandemic when individual income has been on a declining and the fear of job loss on a rising path.With machine learning, we used very large membership databases with individual-level information (e.g. age, gender, occupation, marital status, postal code, etc.) to identify the ones with high dropout risks to target them ex ante. A classification problem par excellence.Machine Learning tells us the “What”, Causal Inference the “Why”Despite the overall good performance of the machine learning models, our clients were always interested in one obvious question: Why does an individual member leave? Unfortunately, machine learning models are not suited to identify the causes of things but rather they are built to predict things.However, knowing the reason for leaving is of immense business value as it determines the strategic decisions that leaders have to take. For instance, if someone ends her membership because she is moving abroad, offering lower membership fees will do little to nothing to keep her as a member.These experiences made me realize that combining the predictive power of machine learning to know the “What” (e.g. individuals with high probability of leaving) with the methods of causal inference to understand the “Why” (e.g. reasons of leaving) is essential to use the massive datasets within organisations to their fullest potential.Measuring correlation is easy, measuring causality is notTo stick to our churn example, an organisation might be interested whether an increase in its membership fees — one potential explanatory variable — may lead members to leave — the outcome variable. To estimate this causal relationship is anything but trivial. As we all know from Statistics 101, correlation is not causation — and, in fact, the absence of correlation is also not the absence of causation. Thus, the underlying issue of measuring causality cannot be solved with more or even better data; it is also not a (predictive) modelling problem per se which machine learning has turned out so successful in recent years.[photo credit]Instead, the reason why causal effects are hard to measure is endogeneity. And one of the most substantive sources is the omitted variable bias, which arises when an unknown omitted factor influences the independent variable of interest and the outcome variable at the same time.For instance, let’s imagine an organisation comes up with a new explanatory factor — apart from membership fees — and would like to investigate if the age of an individual increases the likelihood of leaving an organisation. A commonly used method such as the ordinary least square (OLS) regression will not deliver meaningful results because both age and the probability of leaving may be influenced by an unknown confounding factor such as individual income. This is because younger people usually earn lower salaries which we often do not have information about.Unfortunately, the omitted variable bias is present in almost all empirical analysis when dealing with observational data. For machine learning applications, this is not much of a problem per se because all we want is a good prediction of the outcome variable — the “What”. However, since we often need to know the causes of the “What”, understanding the “Why” is essential to enable organisations to adopt strategic actions in their favour. Therefore, beyond conventional ways of measuring associations, alternative approaches are highly needed to reliably quantify the causality.Many organisations run the best experiments for causal inference without even knowingTo answer this question, let’s dive a bit into the broad spectrum of approaches to credibly measure causal relationships. Nowadays, most empirical scientists would argue that randomized control trials (RCTs) are somewhat of the gold-standard of causal inference. The logic behind RCTs is rather simple: the researcher splits a sample of individuals randomly into two groups and then gives one group the treatment of interest — called “treatment group” — but not the other one — called “control group”. The differences in the outcome between the treated and the control group is then considered as the causal effect of the treatment. If you closely followed the news about the efficacy of the COVID-19 vaccines, you will have noticed that these studies usually use the same research design: clinical RCTs.[photo credit]Despite their massive strength and credibility in measuring causality, conducting such controlled experiments is often very expensive, sometimes ethically questionable, and most of the time even impossible. For instance, if we want to understand whether the age has a causal effect on the likelihood of ending the membership using an RCT, we will need to change the age of randomly selected individuals in the treatment group. Obviously, we cannot change the age of a person. Therefore, we need different approaches.Most of the time, such approaches are broadly called quasi-experimental research designs which are often the only way to measure causality with observational data. One of the earliest and most astonishing applications of these methods dates back to the ingenious London doctor John Snow who has used water-supply data to find out that Cholera was transmitted via water instead of air (which was the dominant view in 1854) — without requiring one single look at the virus through the microscope. His discovery has most probably saved millions of lives.While researchers nowadays are desperately trying to live up to the ingenuity of Dr. Snow, applying such quasi-experimental methods is anything else than trivial. However, the good news is: many organisations are in fact conducting very large-scale quasi-experiments — often without even knowing. The analysis of many of these datasets enabled astonishing discoveries in business, economic, and political sciences in the last quarter of a century.For instance, Hartmann and Klapper (2018) and Stephens-Davidowitz et al. (2017) quantified the returns to television advertising using the super bowl as a natural experiment. Anderson and Magruder (2012) were able to find that an “extra half-star rating [on Yelp] causes restaurants to sell out 19 percentage points (49%) more frequently” by adopting a clever regression-discontinuity design. And in a very recent study, Garz and Martin (2020) measured the causal impact of media reporting on vote choice for the current government which is highly relevant for political campaigning for parties in democracies around the world.[photo credit]Leaders need to ensure we use the best of both worldsAll these examples highlight that understanding causal relationship can have large implications for the strategy an organisation adopts to compete successfully in the future. While AI-based solutions have already received a lot of interest beyond academic borders, causal inference has not yet been extensively leveraged for data driven decision making.However, to predict the “What” and to understand the “Why” is — in my opinion — the most comprehensive and valuable way of exploit the wealth of information our organisations are sitting on nowadays. The challenges ahead are huge. But so are the datasets. We have the computational resources. And the (quasi-)experimental setups.Leaders need to ask themselves: How can we combine the predictive power of machine learning with the strengths of causal inference to leverage our datasets in a post-pandemic world? To stay ahead, we should no longer ask for the “What” without the “Why”.Reading suggestions- The Book of Why, by Judea Pearl- Causal Inference — The Mixtape, by Scott Cunningham- Mostly Harmless Econometrics: An Empiricist’s Companion, by Joshua D. Angrist and Jörn-Steffen PischkeWant to try out how to find the “What” and the “Why” to your data? We at LEAD Machine Learning are experts in the field of Data Science, Machine Learning and Causal Inference and are happy to connect with you on how you can exploit your large data resources and create real value for your missions ahead.Read More
To get a better intuition for marketing mix models, this section will walk through building a marketing mix model from scratch in Python. This marketing mix model is going to be built off of this dataset from Kaggle.Read More
The term ‘Data scientist’ was nonexistent when I started my journey in Data analytics space but now it is so called the ‘sexiest job after the decade’: probably after space crews in SpaceX and Virgin Galactic! Data has always fascinated me and I am sure it will continue to do so for many years to come. Throughout this journey I have seen many projects flying off as well as falling apart at various stages. VentureBeat’s quote of 2019, still stays true: ‘87% of data science projects never make it to production’, and there are several reasons which need serious intervention and fixes, to improve this number.Read More
We’ve heard about explainability when it comes to understanding the decisions made by our data science models. But what about explainability during the development process? As data scientists we often work in interdisciplinary teams, where not everyone may be as familiar with our specialty, just as we are not familiar with theirs. These environments allow us to build the best possible…Read More
This article will explore what the possible future of data science capabilities for risk management could entail What will the future of data science for risk management hold?
Data science has been vital in enhancing risk management operations in recent times. With cyber attacks, including phishing and ransomware, on the rise since the Covid-19 pandemic took hold, managing and mitigating the effects of such incidents, with the aid of network visibility, is key to business continuity. Additionally, there are IT outages and insider threats to contend with, which also require a strong risk management strategy.
In this article, we explore how the future of data science’s role in risk management initiatives will take shape.
With incidents that can bring operations to a stand-still becoming more diverse, it’s vital that those risk management measures are as agile as possible to avoid being caught out. Data science can help businesses to better analyse short-term and long-term trends, and respond to possible risks and disruption quickly, and this is set to be focused on more going forward.
“Whether in marketing, sales, demand, pricing or operations, the key to risk management is not only in spotting the potential risks, but in understanding their likelihood, scale and impact and then reacting accordingly,” said Matt Andrew, partner & UK managing director of Ekimetrics.
“In retail, for example, we’ve seen the impact of not having a thorough enough understanding of market, category and consumer trends and risks with mitigations in place soon enough to react in the face of a market-changing pandemic. For the likes of Arcadia Group and Debenhams, factors such as the high cost of brick and mortar stores and a failing offer, including poor e-commerce, became increasingly impossible to deal with. Those that had already begun to invest in this area of data science will have had a better chance to regroup quickly and make better decisions, from big pivots to the ability to capitalise on micro opportunities.
“By understanding the potential range of outcomes and how they interact through data analytics, businesses can support greater agility in their decision-making about where and how to invest, and help to future-proof against other risks that are yet to emerge.”
Hot topics and emerging trends in data science
We gauged the perspectives of experts in data science, asking them about the biggest emerging trends in data science. Read here
Minimising reconciliation error through automation
A key aspect of data science that has a bright future is automation. This decreases strain on data scientists while speeding up processes, and when it comes to mitigating risks, automation can minimise errors when it comes to data reconciliation — the movement and alignment of critical company data between systems.
Douggie Melville-Clarke, head of data science at Duco, explained: “As businesses move towards making more data-first decisions, the emphasis on data automation is growing, with companies automating as much of the data reconciliation process as possible to speed up process, help businesses scale and crucially mitigate risk.
“Data reconciliation has traditionally cost financial firms significant sums of money through man hours and regulatory fines. Automation takes away the human error element from data reconciliation. Manual tasks can often become tedious to a human brain leaving room for error, but a computer can’t get bored or show up to work tired. It’s consistent. And this consistency is crucial when dealing with large datasets.
“Repeatable tasks can be delegated to a computer to handle more efficiently – and with a lower error rate – freeing up the workforce to do jobs that add more value to the business, such as new product offering or adapting to regulatory changes.
“Data automation platforms also enable businesses to get a full view of the data transformation process, end to end. Through automated data lineage, businesses can track the cleansing and manipulation processes the data undergoes, giving them a holistic view of the data in a structured way, as opposed to an unstructured one. This aids with error spotting and reporting, both internally and to regulatory boards.”
Handling more data, and looking to the future
According to Trevor Morgan, product manager at comforte AG, the value-add that data science is set to bring to risk management in the near future is two-fold: the ability to manage more data in one go, and looking to the future rather than past events.
“Enterprise data is growing nearly exponentially, and it is also increasing in complexity in terms of data types,” said Morgan.
“We have gone way past the time when humans could sift through this amount of data in order to see large-scale trends and derive actionable insights. The platforms and best practices of data science and data analytics incorporate technologies which automate the analytics workflows to a large extent, making dataset size and complexity much easier to tackle with far less effort than in years past.
“The second value-add is to leverage machine learning, and ultimately artificial intelligence, to go beyond historical and near-real-time trend analysis and ‘look into the future’, so to speak. Predictive analysis can unveil new customer needs for products and services and then forecast consumer reactions to resultant offers. Equally, predictive analytics can help uncover latent anomalies that lead to much better predictions about fraud detection and potentially risky behaviour.
“Nothing can foretell the future with 100% certainty, but the ability of modern data science to provide scary-smart predictive analysis goes well beyond what an army of humans could do manually.”
Worldwide security and risk management spending to exceed $150 billion in 2021 — Gartner
Gartner has forecasted that security and risk management spending worldwide will grow 12.4% to reach $150.4 billion in 2021
Higher regulation of AI
While AI has demonstrated the capability of helping to increase the agility of organisations’ decision making, there is also the matter of higher regulation of the technology to consider, with legislation in the EU being a notable example. To stay compliant, risk management aided by data science is likely to be the way forward.
“Data science and risk management professionals will work hand in hand to ensure risk and governance procedures are at a high standard,” said Theresa Bercich, director of product strategy and principal data scientist at Lucinity.
“AI compliance will be more regulated, as evidenced by the EU creating legislation around this topic. This means that new job titles, positions and people will join the world of AI (which has already started), that will create frameworks for governance and risk.
“The power of AI and the demand for its value proposition is driving significant changes in the technology space including the breakdown of traditional silos and the development of intelligent software deploying data in a productive manner.”
Data science trends in healthcare – as identified by experts in the field
Data science trends in banking – leveraging data science capabilities in order to accelerate operations and increase flexibility
How to embark on a data science career – the key factors to consider.
Dimensionality Reduction is the process of transforming a higher-dimensional dataset to a comparable lower-dimensional space. A real-world dataset often has a lot of redundant features. Dimensionality reduction techniques can be used to get rid of such redundant features or convert the n-dimensional datasets to 2 or 3 dimensions for visualization.Read More
A global survey published today finds nearly a third (31%) of respondents consider the social impact of bias in models to be AI’s biggest challenge. This is followed by concerns about the impact AI is likely to have on data privacy (21%). More troubling, only 10% of respondents said their organization has addressed bias in AI, with another 30% planning to do so sometime in the next 12 months.Read More
A sneak peek into your Slack space’s emotions. Ever wondered how engaging was the content you delivered? Was it clear or confusing? Or if people misunderstood your message at that company-wide meeting? Remote environments give very little chance for teachers and leaders to gain feedback and optimize their content towards better performance.Read More
Machine learning systems are complicated. And sometimes, it’s not the fault of the engineers who build them. It’s the nature of machine learning systems. Here is what I mean… Let’s say that you did a great job at finding good data, you prepared it reasonably well, and your model made great predictions. Everything is pretty cool at the moment!Read More
Data science, machine learning, and all forms of artificial intelligence have data at their core. We often focus the bulk of our attention on formulas or code when it comes to these disciplines and that makes sense for researchers in those areas of knowledge. But most professionals and hobbyists alike are practitioners of data science and machine learning instead of researchers.Read More
At the end of the day, it is important to remember that ML model code is only a small part (~5–10%) of a successful ML system, and the objective should be to create value by placing ML models into production.Read More
Every few years, data science and technical terms enter the business lexicon, only to get popularised, over-hyped, and then retired from popularity. Machine learning, Artificial Intelligence, and so many other technologies are following these patterns. Unfortunately, even the most essential ideas can fall victim to this cycle. The latest victim: data-driven.Read More