A few days ago I found the following question on Quora: "Is machine learning so popular because it is so easy to understand and work with?" and although the question is not new (dating from 2 and a half years ago), only 3 people answered it. In their own way, they all said that machine learning (ML) is not easy at all, and while I agree with the difficult aspects that they pointed out, I believe that there are still things to be said about this topic, especially as there are people who think the opposite (for example, Ben McRedmond) and, moreover, they are trying to convince others about this.
So, what is ML? As a primer, ML is the ability to teach a computer how to recognize patterns, and learn from its accuracy. Being a method of learning from examples, the more patterns you feed to the machine, theoretically the smarter it gets at spotting those patterns. Think of Amazon.com, for example: the more you buy from that website, the better recommendations it will give you afterwards.
And how about the ML popularity? Well, I don't think that ML is popular because it's easy to understand and work with, but because of the possibilities that it offers. In other words, instead of wasting hours after hours at identifying different patterns, you simply teach computers how to do that. If you think that this is easy, think of how you would teach your kid to do this kind of stuff and then consider the fact that your child actually understands what you're saying, while the computer doesn't (you need to formalize this knowledge in such a way so that that box full of electronic devices would be able to work with it). Of course, there are some good parts in training the computer instead of the kid, as the computer won't forget what you told him or ignore you because (s)he wants to play with LEGO, but still.
As I started speaking of knowledge, this is actually the starting point of why is ML difficult: ML is an interdisciplinary domain, one needing a lot of skills from at least 3 different domains: math, statistics and computer science. If you want to find out exactly which parts of these domains are related to ML, read Joseph Misiti's blog and you'll find there further details about what's needed to work in this field.
From my point of view, any ML task has to deal with 3 different elements, each posing specific problems: the data, the algorithms and the validation of the results. In this post I will only delve into the data, revealing some of the problems that may appear. The other two elements will be debated in other posts.
So, regarding the data, there is not a single ML task without the question "how much data is needed?". The answer to this question is at most ambiguous, if not worse. Most of the time, the answer that you'll receive is "as much data as you can". However, data acquisition is not easy at all, a lot of work, knowledge and trust being required for this. The problem is even more difficult for startups, where you haven't made a name of yourself yet and still need to convince others to give you their data, while obtaining nothing in return, since ML usually needs lots of data before providing useful results. The problem was also noticed by Boris: "if startups want to succeed in machine learning, their top priority should be building proprietary data sets".
The answer that I can give to this question is that the quantity of data depends both on the model and on the data. But what is a model? A model is the information hidden in the dataset, usually characterized by some features that are determined from the data. The model is the result of the ML algorithm and represents the generalization of the analyzed data, while a feature is "a property on which a model is trained".
For example, in legal documents, in order to find the jurisdiction under which a specific contract is judged, some features that might be investigated are the words found in that document. Each such word might represent a feature. However, not all of them are relevant for the contract jurisdiction and thus, in order to build the model that identifies the contract's jurisdiction, only some features are used. For example, one might use as features the word jurisdiction, along with words expressing different locations as a starting point. Still, although it looks easy, this is not a simple task because of all the different ways you can set a jurisdiction without actually using those words. By combining these features we obtain the model that will be used to evaluate new contracts. The way the features are combined is given by the ML algorithm that is used and is derived from the data that was used for training.
Researchers empirically determined that the data needed for obtaining good results from a ML algorithm should be at least 50 + 8 * number of features. However, this number is still approximate, as it assumes having "good data".
In reality, the data is "ugly, and unstructured". One problem is that most of the time, the available data is affected by noise (small errors caused by different measurements faults) and it needs to be cleaned. However, that is not as easy to be done as to write about it, mainly because the noise is not always present and doesn't have the same impact over the data all the time. Besides, the ML task is trying to "guess" what the real data should look like, so sometimes we don't even know what a correct data should look like. Moreover, besides noise, sometimes the data contains outliers (data that may on its face appear OK, but in actuality it is not) that represent an issue for some ML algorithms. To continue the previous example, an outlier might be a jurisdiction clause that is annotated by somebody as being non-relevant for the considered model. The outliers identification is extremely difficult and it is usually done by plotting the data and manually eliminating the wrong instances. However, this method cannot be applied when the training data is huge, as required for modeling complex phenomena. Another issue is that most of the time the data won't come on a silver plate ... you'll have to dig deep inside the data corpus in order to extract the relevant features from it. As for our example, you'll have to parse the text, find the instances of the considered words, stem them, normalize the text, eliminate stop words and so on, in one word, applying techniques that are not from the ML field, but from the natural language processing domain. This is in fact an important practical aspect of the ML - you may need supporting skill sets from the computer science field just to pre-process the data (to clean it up and to store and manage it).
Another problem related to the data is the fact that it should cover the whole data space. What I mean by that is that the problem space might be divided in multiple areas, each one having the dimensionality given by the number of used features. If the data is concentrated in a sub-space of the problem space, it means that the ML algorithm that is used for building the model will have no idea as how to evaluate those areas and thus it will be prone to errors. This problem is called the curse of dimensionalityand the easiest way to explain it requires hyper-dimensional oranges, so let’s leave that for another time. And by the way, don't confuse the curse of dimensionality with the cures of dimensionality, which according to my mother involves cloves, garlic and hot water.
Finally, the test data should be consistent with the training data: the ML algorithm is trying to build the best model based on the data that is provided for training, but if the test data comes from a different distribution than the training one, then everything is done for nothing. The algorithm will correctly predict what's required to, but since the two models are not similar, the prediction will be faulty.
In this post I've only covered the most important issues related to the data used in the ML tasks. In a future post, I will present the difficulties of modeling a problem, choosing the appropriate features, along with their numbers, and identifying the best algorithm for obtaining the desired model. Afterwards, I will try to present why the evaluation of the results obtained by these algorithms is also problematic. Until then, I'll end this part with a quote from Misiti: "being a good data scientist (usually) takes years of experience. It requires more than just knowing how machine learning algorithms work. It's knowing what questions to ask and how to convey the answers to investors, management, and customers."