Introduction
Data is an important factor not only in the field of data science but data is involved everywhere, from setting the morning alarm to managing the schedule on Skype, to messaging your friend.
Before you dig deeper into Machine Learning let us understand the key concepts involved in machine learning. To get started let us begin with data!
Data
In general, data are facts that are collected together for analysis or references. Data and information are closely related concepts. In today’s world data may be referred to as information that is translated into binary digital form.
Data can be collected, measured, and visualized after proper analysis using different analytics tools. Unprocessed data or Raw Data includes numbers or characters that need to be cleaned, to remove outliers or data entry errors.
Data comes in different sizes and flavors. It may be in text, numbers, clickstreams, graphs, tables, images, transactions, videos, some or all of the above. Having said that it’s time to learn more about data as we discuss data types now.
Data Types
Data types are important concepts as they need to be studied before applying any statistical measurements in data.
Categorical Data
It consists of categorical variables, that take values as grouped data that are with their names or labels. For example, the color of a pen (green, blue, red) or names of the cities (London, Paris, Washington). Categorical data can also take numeric values like ‘0’ for a female or ‘1’ for a male. And therefore, they can be used to represent information like language, gender, education, qualification.
Nominal Data
The word nominal means ‘name’ which comes from the Latin word ‘namen’ and the data that are differentiated by the naming system are called nominal data. All nominal data are items that all fall in the same group, that have something in common. An example would be a set of countries. Nominal data also takes numeric values, and must not be confused with ordinal data, because it’s just used for reference.
Ordinal Data
Ordinal data are nearly the same as nominal data. They represent discretely and ordered units by their position on the scale. Often numbers are assigned to items to show their relative position, which indicates, superiority, temporal position, etc... Since ordinal data shows the only sequence, arithmetic operations cannot be performed with them. At times, letters or other sequential data may also be used.
Numerical Data
Often numerical values can be measured and these may be placed in descending or ascending order.
Intervals
Interval data are measured on a scale where each data point is equidistant from each other. Say a set of Real number R, level of happiness from 1-10. Open Intervals are indicated with parenthesis and don’t include endpoints, for example (1,2) i.e. greater than 1 and less than 2. Closed Intervals are represented by square brackets that include endpoints. For example [1,2] i.e. greater than or equal to 1, and less than or equal to 2.
Ratio
At its most basic form ratio is the relationship between two numbers specifying how many times the first number contains the second number. Ratio values are the same as intervals values as items in ratio values also have the same difference, whereas the difference is ratio data has absolute zero. For example, length-weight and height.
Record
Simply, a record is an information asset that we keep. Some want to keep for business value, some for authority wills and some for applying disposition. This information can come in any form and media type.
In computer data processing, the record is a collection of fields which may have different data types relating to single individual or item. For example, a circle record might contain radius and center, where the center once again is represented as point record containing y and x coordinate.
In context to the database, a record is sometimes called a row, containing a group of fields within a table, subject to a specific entity. Several data records make, data files and several data files into a database. For example, a row in a table of customers' contact information would contain fields as serial number, name, city, pin code, contact number.
Data Set
A data set is a collection of data, that may be related, or discrete items, which may be accessed individually or as a whole or in combination.
The physical structure in a data set is almost the same, often a data set corresponds to the entities of a single database table. An example would be a database containing a data set that might contain business information such as salaries, sales, figures, profit. Remember a database itself can also be considered as a data set.
Structured Data
When you hear the term Structured Data think of a spreadsheet. It’s easy for us to understand and so for computers, structured data consists of fields and predefined data, for instance, a database or spreadsheet. The data are organized in a format for effective processing and analysis. As said earlier, data stored in rows and column format i.e. spreadsheet is an example of structured data.
Unstructured Data
Think about it as a test document. For computers, it is easier to understand a spreadsheet than understanding a text document. The computer sees a string of letters instead to conceptualize what a word means. If a computer sees the letters T-I-G-E-R, it doesn’t know if it means tiger and it is referring to any animal.
One of the applications of machine learning is to take this unstructured data, turn it into structured information, in such a way that computers are now able to understand.
Aha! This document talks about Tiger – an animal.
Other types of unstructured data would be images, videos, audio files.
Data Exploration
It’s the first step in data analysis that involves the collection of information from various sources which are often unstructured data in order to find the characteristics for focused analysis. Various visual analytics tools or advanced statistical software might be used for this purpose.
Data Mining
It refers to finding patterns in large data sets. Data mining is a subfield of computer science, which aims to extract information from the dataset and transform it into a comprehensive structure for further use. The patterns found may be used in a further analysis like in machine learning for predictive analytics. The data mining task may be semi-automatic or fully automatic depending upon the amount of data.
Descriptive Analytics
In simple words, descriptive analytics provides information about what happened in the past. It’s the preliminary stage of data processing in order to summarize past data to get useful information and prepare data for further analytics.
Predictive Analytics
It’s used to identify probabilities and trends based on historical or current data in order to provide information about what might happen in the future.