The main task for any machine learning project is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data". Let’s focus on bad data in this article.
Insufficient Quantity of Training Data
For a child to learn what an apple is, all it takes is for you to point to an apple and say “apple” (possibly repeating this procedure a few times). Now the child is able to recognize apples in all sorts of shapes and colors. Genius.
But this approach doesn’t work for Machine Learning; it takes a lot of data for most Machine Learning algorithms to work properly. Even for very simple problems the Machine Learning model typically needs thousands of examples, and for complex problems such as image or speech recognition, it may require millions of examples (unless it can reuse parts of an existing model).
Poor-Quality Data
It is very certain, if the training data is full of errors, outliers, and noise, for example due to poor-quality measurements, it will make it harder for the system to detect the underlying patterns, so the system is less likely to perform well. That’s why it is a good practice to often spend time cleaning up your training data. The fact is, most data scientists spend a significant part of their time doing just that.
For example,
- If some instances are clearly outliers, it may help to simply discard them or try to fix the errors manually.
- If some cases are missing a few features (e.g., 5% of your customers did not specify their age), then we must decide whether we want to ignore this attribute altogether, ignore these instances, fill in the missing values (e.g., with the median age), or train one model with the feature and one model without it, and so on.
Irrelevant Features
As the saying goes: garbage in, garbage out. The system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. A critical part of the success of a Machine Learning project is to come up with a good set of features to train on. This process, called feature engineering, involves:
Feature selection: It’s is a step to select the most useful features to train on among existing features.
Feature extraction: It involves combining existing features to produce a more useful one.
Creating new features by gathering new data.
In the next article, we will have a challenge based on the bad algorithm.