Machine
learning technique, which it learns from a historical dataset that categories
in various ways to predict new observation based on the given inputs. There are
two types of data analysis used to predict future data trends such as
classification and prediction. Here we will use these techniques to clarify
various fruits and predict the best accuracy of them. Some example of
classification applications is mail checking (spam or not), credit card fraud
detection, speech recognition, and biometric identification.
Number of steps to be followed
1. Understand the dataset
2. Methods of showing data (Visualization)
3. Crete train and test teste set to generate accuracy.
1. Understanding the data
The fruits dataset is a multivariate dataset introduced by Mr. Iain Murray from Edinburgh University. It contains dozens of fruit measurements such as apple, orange, and lemon.
1.1 Shape of data
Let’s look, how many instances we
have at the dataset.
- import pandas as pd
- import matplotlib.pyplot as plt
- import seaborn as sns
- fruit=pd.read_csv('fruit.csv')
-
- print(fruit.shape)
Here,
the dataset contains 59 pieces of fruit with seven features.
1.2 Types of fruits and count
-
- print(fruit.groupby('fruit_names').size())
- sns.countplot(fruit['fruit_name'],label="Count")
Graphical representation of fruit counts
1.3 count data features
In the data frame, each row contains
one piece of fruit which measured by four features.
1.4 Statistical distribution
The fruits numerical data points,
which can be measured by the mean, median and percentiles. If the data
distribution does not have the same scale so, we need to apply the scaling
techniques.
2. Methods of showing data (Visualization)
Here, we will apply two types of
visualization techniques to determine the distribution of variables and their
correlations.
2.1 Boxplot
It figures out data distribution by boxplot graph.
-
- plt.figure(figsize=(15,10))
- plt.subplot(2,2,1)
- sns.boxplot(x='fruit_name',y='mass',data=fruit)
- plt.subplot(2,2,2)
- sns.boxplot(x='fruit_name',y='width',data=fruit)
- plt.subplot(2,2,3)
- sns.boxplot(x='fruit_name',y='height',data=fruit)
- plt.subplot(2,2,4)
- sns.boxplot(x='fruit_name',y='color_score',data=fruit)
2.2 Pair plot – scatter matrix
Each fruit data point represented
by different color plots to provides better and effective determination as well
as a correlation between them.
-
- sns.pairplot(fruit,hue='fruit_name')
3. Create a train and test set to generate accuracy
3.1 Split dataset
Now, we will separate the data frame
in two parts such as train and test set.
- feature_names = ['mass', 'width', 'height', 'color_score']
- X = fruit[feature_names]
- y = fruit['fruit_label']
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
- from sklearn.preprocessing import MinMaxScaler
- scaler = MinMaxScaler()
- X_train = scaler.fit_transform(X_train)
- X_test = scaler.transform(X_test)
3.2 machine learning (Modeling)
Now the time to find out a best-suite algorithm for
getting the highest accuracy points. So, we are going to handle with some
frequently use algorithms for modeling the dataset.
3.2.1 Decision tree
-
- from sklearn.tree import DecisionTreeClassifier
- clf = DecisionTreeClassifier().fit(X_train, y_train)
- print('DecisionTreeClassifier:')
- print('Accuracy of training set: {:.2f}'
- .format(clf.score(X_train, y_train)))
- print('Accuracy of test set: {:.2f}'
- .format(clf.score(X_test, y_test)))
3.2.2 logistic regression
-
- from sklearn.linear_model import LogisticRegression
- logreg = LogisticRegression()
- logreg.fit(X_train, y_train)
- print('LogisticRegression:')
- print('Accuracy of training set: {:.2f}'
- .format(logreg.score(X_train, y_train)))
- print('Accuracy of test set: {:.2f}'
- .format(logreg.score(X_test, y_test)))
3.2.3 K-nearest neighbor
-
- from sklearn.neighbors import KNeighborsClassifier
- knn = KNeighborsClassifier()
- knn.fit(X_train, y_train)
- print('KNeighborsClassifier:')
- print('Accuracy of on training set: {:.2f}'
- .format(knn.score(X_train, y_train)))
- print('Accuracy of test set: {:.2f}'
- .format(knn.score(X_test, y_test)))
3.2.4 Gaussian Naive Bayes
-
- from sklearn.naive_bayes import GaussianNB
- gnb = GaussianNB()
- gnb.fit(X_train, y_train)
- print('GaussianNB:')
- print('Accuracy of training set: {:.2f}'
- .format(gnb.score(X_train, y_train)))
- print('Accuracy of test set: {:.2f}'
- .format(gnb.score(X_test, y_test)))
3.2.5 support vector machine
-
- from sklearn.svm import SVC
- svm = SVC()
- svm.fit(X_train, y_train)
- print('Support vectore machine:')
- print('Accuracy of training set: {:.2f}'
- .format(svm.score(X_train, y_train)))
- print('Accuracy of test set: {:.2f}'
- .format(svm.score(X_test, y_test)))
After, end of modeling we can obtain
the best accuracy model is K-nearest neighbor it provides the highest accuracy
score.
3.3 Prediction
- Now, we have the best accuracy model for the validation process.
- The KNN model directly runs on the validation set to finding the best final accuracy of points.
-
-
- from sklearn.metrics import classification_report
- from sklearn.metrics import confusion_matrix
- from sklearn.metrics import accuracy_score
- pred = knn.predict(X_test)
- print(accuracy_score(y_test, pred))
- print(confusion_matrix(y_test, pred))
- print(classification_report(y_test, pred))
Summary
In this article, we obtain the best accuracy of fruit distribution. I hope you have understood
very well.