Introduction
Things in machine learning are repeated over and over and hence machine learning is iterative in nature. Therefore, to know machine learning, one has to understand the machine learning process. The machine learning process is a bit tricky and challenging. It is very rare that we find the machine learning process easy. The reason for it being so complex is very clear - a large amount of complex data is involved in this process and out of which we try to find out meaningful predictive patterns and models.
That’s why, as I mentioned in my last article that this is dealt with by data scientists who are actually specialists in this space. In my last article, I also mentioned how rewarding a machine learning process could be. The benefits out of this process could be outstanding, but we should also keep in mind that the process may not always succeed but can fail, but that’s too rare. Let’s focus on the processes and scenarios used in Machine learning in this article.
Series
We’ll try to cover the topic and Machine Learning concepts, processes, and scenarios including terminology in a form of series. This is the second article of the series and will largely focus on machine learning processes and scenarios. The following are the articles that we’ll follow to know about machine learning.
- Introduction to Machine Learning
- Machine learning processes and scenarios
- Machine learning: Deep dive
Baseline the need
In machine learning, asking the right question and knowing the correct answer is most important. We should know what questions to ask; it is the most important part of the process. And after that, we should ask ourselves a question of ourselves: do we have enough and correct data to answer that question. If you ask the wrong question or you do not have enough or correct data, the answer you get could never be what it should be and what exactly is expected. For example, if we again take an example of Internet banking transaction frauds, we ask the question, "how can we predict that the transaction is going to be fraudulent?" Maybe it could be a case where a large piece of predictive data is based on which city the customer resides in or what is his occupation/business or how long has he lived at his current address.
We might not have all this complete data, and we may also not get this data to some point. In that case, we should ask ourselves, do we have enough data to start or correct data at least. If we don’t have then we are not going to get the result or answer that we are looking for from the machine learning process. We also should then ask ourselves what would be the criteria to define success.
As at the end of the process we only get the model out of data that predicts and not exactly gives us the answer. So we should ask the question of how best those predictions should be so that the entire process could be tagged as a success. In the case of our example, if we find that we are sure about the fraud prediction in maybe 16 out of 20 cases, then is this fair enough? Or what about 14 out of 20 or should it be 18 out of 20? How do we decide this? Knowing correct answers to these questions is really important, as without it we won’t get the desired result and would never know that the process is complete and we are done with getting actual predictive model.
Machine Learning: Process
If we go into the details of the machine learning process, firstly we identify, choose and get the data that we want to work with. For our example, we would often need to work with the domain experts in this area that are people who know a lot about fraudulent transactions or we would work with these people for the actual problem that we need to solve. These people being experts know that what data or data model that we get from the process is predictive. But since the data with which we start is raw and unstructured data is never in the correct form as needed for actual processing. It could have duplicate data or the data that is missing, it could have lots of extra data that is not needed.
The data could be formed from various sources which may also eventually end up being duplicate or redundant data. In this case, there comes the requirement for pre-processing the data so that the process could understand the data, and the good thing is that the machine learning products usually provide some data pre-processing modules to process the raw or unstructured data. For e.g. in Capital markets there is always a need of price predictions for instruments or equities/assets and an algorithm is applied to the huge amount of unstructured data coming from various feed providers, in that case, multiple feed providers could provide the same data or some feed providers may provide the missing data and some the complete data. So to apply the actual algorithm to the data, we need to have that complete unstructured data into a structured and shaped data for which a process of pre-massaging is required, through which the data is passed and we get a candidate copy of data which could be processed through the algorithm to get the actual golden copy.
After the data is pre-processed, we get some well-structured data, and this data is now an input for machine learning. But is this a one time job? Of course not, the process has to be iterative, and it has to be iterative until the data is available. In machine learning, the major chunk of time is spent in this process. That is working on the data to make it structured, clean, ready and available. Once the data is available, the algorithms could be applied to the data. Not only pre-processing tools but the machine learning products also offer a large number of machine learning algorithms as well. The result of the algorithm applied data is a model, but now the question is, is this the final model that we needed?
No, it is the candidate model that we got. The candidate model means the first most appropriate model that we get, but still, it needs to be massaged. But do we get only one candidate model, of course not, since this is an iterative process, we do not actually know what the best candidate model is until we, again and again, produce several candidate models through the iterative process. We do it until we get the model that is good enough to be deployed. Once the model is deployed, applications start making use of it, so there is iteration at small levels and at the largest level as well.
We need to repeat the entire process again and again and re-create the model at regular intervals. The reason again for this process is very simple, it’s because the scenarios and factors change and we need to have our model up to date and real all the time. This could eventually also mean to process new data or applying new algorithms altogether.
Machine Learning: Scenarios
Let’s try to take a few scenarios showing how we can actually use machine learning.
Fraudulent Internet Banking Transaction
Let’s again take the example of fraudulent internet banking transactions. Let’s assume that we have a certain number of bank customers using their internet banking facility to some third-party payment application or gateway. In that case, there should be a point where if the transaction is fraudulent should get rejected. That’s what the challenge is to find out the fraudulent transaction.
We could in this scenario, get all the historical transaction data and processes that through the machine learning process like we saw in earlier sections and eventually get a predictive model, that an application could use to make decisions.
Predicting Customer
Another such example is where the challenge is to find out how likely a customer is to switch. Let’s take an example of an internet data provider or a mobile company. In this space customers usually call the call centers. For every customer, the call center employee needs to identify what are the chances of a customer to switch to a competitor. Knowing that a call center executive can then offer a better deal or offer some lucrative deal to prevent the customer from switching and retain him.
The challenge is how to identify those customers and the answer is again machine learning. The data provider or mobile companies usually have a lot of recorded call data. The data may be vast and very detailed, so an application could be created around that data to consolidate it. That created application could use technologies like Spark or Hadoop or any other big data technology.
The company, then, may need to associate the consolidated data with more data like data coming from the CRM’s to really create an ample amount of right data that machine learning wants to use. This is not uncommon. The machine learning process can take data from multiple sources to process. As a result, there would be a predictive model that the application of call center could use to make decisions and predictions on customers likeliness to switch. It really adds value to the business and helps in overall growth altogether.
Conclusion
It’s all about asking the right questions, which acts as a beginning to the machine learning process. After this, we need the right and structured data to answer that question, and this is the part that takes most of the time in the complete machine learning process. Then, starts the process with n number of iterations until we get a desired predictive model. That model is updated from time to time to adapt to the changes that happen periodically, and finally, the model is deployed. In the next article we’ ll focus on some terminologies and look the machine learning process more closely.
References
- https://www.r-project.org/about.html
- https://app.pluralsight.com/library/courses/understanding-machine-learning/table-of-contents