It's intuitive that if you take the topics you know the least and you learn those, you'll get a better test score. However, when we switch this idea to mathematical function optimization in a very large hyper dimensional space (also known as Machine Learning), things tend to not go by our intuition anymore. Thus, we better test the assumption.
The thing with hyper-dimensionality is that everything changes a lot in a way we can't naturally picture. Our imagination goes up to, say, four dimensions (e.g. 3D heat map, where the fourth dimension is the color), maybe even 5 or 6D if we specifically train our brain for that. We could add sound to that graph or other crazy ideas, but the brain needs time to acquire the intuition for that "dimension". So, there's no way for us to imagine a 10.000D space and everything that it entails.
Unforeseeably, one thing that changes a lot when we add just one more dimension to our space is the distance between points in the space (representing data samples in our machine learning framework here). Two samples that were close in 100D could be very far apart in 101D and that's hard to predict. What does distance between samples actually mean? It is a similarity metric. Pretty important.
Back to our assumption: is picking the lowest confidence sample as our next sample to train on better than picking at random? TL;DR, yes.
Say we have a classifier that we train incrementally when the user provides more data (for the current experiment, the online Passive-Aggressive algorithm is used). If we can, we want to maximize the impact of each new sample, therefore we're looking for the most significant insight we can get from the user. Thus, the lowest-confidence assumption above translates into "we should ask the user to annotate the sample that our classifier is the most confused about".
For the purpose of testing and validating this assumption we are using an annotated data set (fairly small, ~200 training samples). Let the classifier be in some state where a part of the data set has been already used to train it. We can ask the classifier to give predictions together with a confidence score for all the samples in the unused part of the training set. Now we pick the sample with the lowest confidence score and we train on it. Then we repeat the process until no more training samples are available.
Note: we refrain from taking advantage of the true label of each sample. These are available in the data set, but won't be in a real-life situation.
All along the process we are using the test data set to evaluate the performance of the current state of the classifier. Now, let's check the performance of this strategy against always picking new samples as random (i.e. the most straight-forward strategy).
On the same data set, we evaluate both strategies, based on the same online classifier. The data set is made of variable length natural language sentences and a binary label (true/false).
Explaining the pretty pictures:
- For each strategy, on the top we have a map of predictions for the test set, where each cell is a sample; blue represents a correct prediction, red is an incorrect one; the opacity represents confidence (transparent is 0 confidence)
- On the bottom we have the history of the classifier's performance; X-axis is number of samples, Y-axis is F-score
Note: In both cases, before the mark of 40 samples, we see a streak of null F-score because the classifier is in a pessimist state and prefers to label everything as false.
Conclusion: The lowest-confidence strategy got our classifier to a mature level quicker and in a much more stable fashion.
The predictions heat map and the F-score plot are complementary. While it's natural that when the F-score rockets up, the heat map will turn blue rather quickly, we can also see how the colors fade (and with them the confidence).
PATH OF WISDOM
Side-note: Does this look familiar? Looks like human nature. We're young and we think we know it all, then we have a sudden realization of our ignorance. We start getting wiser and wiser but now we know that we don't possess the absolute truth and there's much to be learned.
We've seen how sorting the train data set by confidence helps but this doesn't apply that much to a real-life scenario. If we have all the data beforehand, we'll just batch train on all of it... obviously.
The more interesting scenario is when we don't have all the data at time t0. We get a sample at time t1, another one at t2 and so on. Each of these iterations means physical time put in by the user. Precious physical time. So, instead of the user feeding us information about some random sample, we now know we want the user to work on the lowest-confidence sample. The process goes like this, on every step tn:
- compute confidence scores for all available samples
- pick the lowest
- get feedback from the user
- train the classifier with the new sample
Each time we'll ask the user for help, we'll be sure to make good use of their time.
Happy data acquisition!