When a lot of us think about Data Science and Machine Learning, we might shy away and think - 'that's only for the clever boffins' ... 'I didn't do advanced maths in school, I'll avoid that...', or 'I only do basic coding and SQL, it's no use to me...'. Like a lot of areas, the thing is that, if we don't at least take a look at it, we are doing ourselves a disservice. Sure, I will be the first to admit that Data Science and Machine Learning can be hard, but guess what, that's only a small part of the domain. The majority of it is quite accessible to most developers and is not that difficult to either learn or use. In addition, the Data Science and Machine Learning tools are given to use by Microsoft in Azure, make it super easy to get started. Remember, you can't eat an apple in one mouthful, it takes one bite at a time.
In the introductory tutorial '
The five questions Data Science can answer',
Brandon Rohrer, a Senior Data Scientist at Microsoft, demystifies Data Science by bringing it down into five basic groups of algorithms, which are:
This may seem like a very small list, especially when we have learned of the black art of data science and machine learning and how difficult it is! When you think about it, however, it makes sense. I read recently that 'machine learning is nothing more than doing the same things humans do - looking for patterns, but at machine speed'. And that's the core of it ... at machine speed.
Humans are extremely good at spotting patterns, where we fall down is when we have to do it very fast, or for many many many iterations of the same thing - conversely, where is where machines excel - they are able to repeat the same thing over and over without tiring, and with exact and consistent precision. When we as humans look at something new, we generally ask the same questions that data science asks - mmm, what's that? ... (is it A or B), is it like something we already know? (is it weird), is it more or less than we are used to or we expect? (how much/many), how is what we are looking at held together/how is it organized, and finally, given the information or data we are presented with, what do we do next?
Sometimes, when we are working on a problem, we may find that one single approach or 'question' may give us exactly what we want. This, however, is not that common. It is more usual to have to ask a series of questions of the data, and take the output or answer of one question, and feed this in as the input to another question. If this sounds a workflow, well it is. And this is where, for example, the Microsoft
Azure Machine Learning Studio comes into play. It is presented to us like a flow-chart program, where we can link inputs to processes, outputs to inputs, and chain workflow together into dynamic workflows to take hitherto unconnected blocks of data and combine and enhance them to give new and interesting insights.
The Azure ML Studio is incredibly powerful, and dramatically lowers the barrier of entry to Data Science and Machine Learning. It saves you from having to worry about setting up the production environment and effectively gives you' Data Science' and 'Machine Learning' as a service. The following screenshot shows an example from the studio where Azure ML is being used to automatically classify the news articles from the BBC website home page. You can read more
about it here.
Data science is not rocket science, and like rocket science, while it may seem amazing when we see the end results, getting to the end result takes a lot of hard work. You might be surprised to learn that for the majority of data science projects, 50-80% + of the time is spent not running magic algorithms to predict the future or find patterns, but actually cleaning and preparing the data to enable it to be worked with optimally. Is it rare that we build a report using SQL that shows data from one single table? Usually, we are joining and combining and creating exceptions and other such combinations. With Data Science things are no different, and
preparation of data is key. As you would expect, Azure ML Studio gives us many options in this area.
Data Science is not all about the Guru, the Rockstar, the elusive 'Data Scientist' ... rather, it is about a tightly knit group of people working together in a team. It is very very difficult to find a single person who can do all the Data Science work that an organization will need. Normally, it is better to start to bring people together in your organization who have at least some of the skills needed and use these to form the foundations you can build upon.
Data Science is not about chasing unicorns, it's about identifying what skills you have, what ones you need, and bring it all together. A good Data Science team will be overall, strong at data engineering/wrangling, SQL, visualization, ETL/cleaning, and will normally have at least one person who understands the different options available from an algorithmic point of view.
To get a really good grasp of some of the fundamental concepts in Data Science/Machine Learning, take five short minutes and go through the amazing, beautiful, and insightful visualization/tutorial '
A Visual Introduction to Machine Learning' - it's brought to you by the combined genius of
Stephanie Yee and
Tony Chu ... Then, if you went wow, put it into your diary to go back every other week, and look at it over and over again for the next few months until you are so comfortable with each of the concepts has shown that you could describe them to someone else using only a stick of chalk and a blackboard!
The wrap-up is this - If you are a developer, and want to get started, you need to look no further than the free resources that Microsoft have provided in the
Data Science for Beginners series - it's a great series, very easy to understand, and really broadens your knowledge. Check it out. You'll be glad you did!