Each new generation of computer-driven algorithms and data structures produces its own set of challenges for data scientists. The pioneers of each new technology struggle to establish themselves as a leader in their space.
However, over the last few decades, there has been a shift in how companies approach Machine Learning research, training and implementation. These changes have happened because of two key paradigms that have emerged:
1) Theory-based vs. Data-centric Approach and
2) Data Driven vs. Rules-Based Approach.
This article will outline the history and use cases of these two machine learning paradigms by giving examples from different industrial sectors. Then we will explore how current trends in the industry are impacting the adoption of each paradigm, ultimately leading to the third one – a collaboration between theory and data at every level — which is the approach taken by Google when designing its online machine learning library called TensorFlow .
What is Machine Learning?
Machine Learning is a growing field that studies the design and implementation of algorithms that can learn from data. Computer algorithms that can use data to make decisions are called ML algorithms.
Machine Learning can be used to generate new data that has not been previously observed, or predicted, or generated by the algorithms used by the machine. In short, machine learning is a field that applies information theory and computer science to predict and/or make decisions based on data. Machine Learning can be divided into two general categories — predictive and interventional.
Predictive Modeling: Building a Model to Predict Future Events
In predictive modeling, the data-generating process generates a model that is meant to solve a specific problem. It is the process that is supervised, having been designed to make correct predictions.
A good example of a machine learning algorithm that can be used to generate data to make a prediction is Neural Networks. A neural network consists of an input layer, which receives an image or other sensory data, and a hidden layer, which is influenced by the data and is meant to produce a predicted output.
Interventional Modeling: Using the Predicted Data to Make Changes
In both predictive and intervention models, the goal is to make changes to the data to better suit the needs of the model. This is the work of a clinicopathologic, who, for example, might want to update hospital discharge criteria for a given disease to take into account the latest research.
In interventional modeling, the clinician makes the change to the data. It could be an adjustment to the predicted data, like removing a given tumor from a person’s list, or more invasive, like changing the predicted data to remove a certain organ from the list.
An important difference between the two is that in intervention modeling, the data is used to make a decision, whereas in predictive modeling, the algorithm makes the decision based on the data.
Advantages of the Theory-Based Approach
The most obvious advantage of the theory-based approach is that it is based on facts. Facts are static; they do not change as the algorithm learns, making it very easy to verify that the model generated the facts as predicted.
A particularly useful property of facts is that they are independent of the person’s cultural and religious beliefs. This alone could make theory-based machine learning a great fit in many situations, especially in the areas of finance, where one must forecast the performance of companies based on their financial statements.
Disadvantages of the Theory-Based Approach
However, the theory-based approach has some significant disadvantages as well. First, it is static, and thus it cannot be used to make rapid changes to the output, like with intervention models. This approach is also expensive, due to the time it takes to train the algorithms and produce fresh data.
Data-Centric Approach
In the data-centric approach, the algorithm is focused on producing accurate outputs, not on making changes to the data. The goal is to make the model as efficient as possible, which often means producing the same output regardless of the data’s context.
In the data-centric approach, a model’s inputs are static, while the model itself is dynamic, meaning that it is designed to change with the data.
In data-driven versus rules-based modeling, the goal is to make the model as “sophisticated” as possible, but in doing so, make the model’s decision-making process as transparent as possible (i.e., let the model decide for you).
Some notable uses of data-driven approaches in machine learning are evolutionary algorithms, neural networks, and support vector machines.
Collaboration between Theory and Data
A fascinating aspect of machine learning is the collaboration between theory and data at every level. This might seem like a surprising thing to say about a field that studies how models learn, but it is a crucial part of the process.
For example, neural networks work best when the number of inputs is small, while the number of outputs is large. However, sometimes an output may contain no useful information, such as the “not proven” category in the machine learning model of “cellular automata.”
In this case, the output is a “void,” which means there is no information about whether or not the thing exists. These are known as “dark data” problems, and they are central to the theory-based approach.
Conclusion
Machine learning is a rapidly emerging technology that is used to make computers capable of specific tasks by learning from experience. The two main paradigms that have shaped machine learning are the theory-based approach and the data-driven approach. In this article, we have discussed the history and use cases of these two machine learning paradigms and the key differences between them.