A simple classification techniques on fruit dataset

Elavarasan R
5y
18.6k
0
3

Article

Classification

Machine learning technique, which it learns from a historical dataset that categories in various ways to predict new observation based on the given inputs. There are two types of data analysis used to predict future data trends such as classification and prediction. Here we will use these techniques to clarify various fruits and predict the best accuracy of them. Some example of classification applications is mail checking (spam or not), credit card fraud detection, speech recognition, and biometric identification.

Number of steps to be followed

1. Understand the dataset

2. Methods of showing data (Visualization)

3. Crete train and test teste set to generate accuracy.

1. Understanding the data

The fruits dataset is a multivariate dataset introduced by Mr. Iain Murray from Edinburgh University. It contains dozens of fruit measurements such as apple, orange, and lemon.

1.1 Shape of data

Let’s look, how many instances we have at the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
fruit=pd.read_csv('fruit.csv')
#fruits shape
print(fruit.shape)

Here, the dataset contains 59 pieces of fruit with seven features.

1.2 Types of fruits and count

#types of fruits
print(fruit.groupby('fruit_names').size())
sns.countplot(fruit['fruit_name'],label="Count")

Graphical representation of fruit counts

1.3 count data features

In the data frame, each row contains one piece of fruit which measured by four features.

#preview data
print(fruit.head(15))

1.4 Statistical distribution

The fruits numerical data points, which can be measured by the mean, median and percentiles. If the data distribution does not have the same scale so, we need to apply the scaling techniques.

#Describtion of Data
print(fruit.describe())

2. Methods of showing data (Visualization)

Here, we will apply two types of visualization techniques to determine the distribution of variables and their correlations.

2.1 Boxplot

It figures out data distribution by boxplot graph.

#Boxplot
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.boxplot(x='fruit_name',y='mass',data=fruit)
plt.subplot(2,2,2)
sns.boxplot(x='fruit_name',y='width',data=fruit)
plt.subplot(2,2,3)
sns.boxplot(x='fruit_name',y='height',data=fruit)
plt.subplot(2,2,4)
sns.boxplot(x='fruit_name',y='color_score',data=fruit)

2.2 Pair plot – scatter matrix

Each fruit data point represented by different color plots to provides better and effective determination as well as a correlation between them.

#pairplot
sns.pairplot(fruit,hue='fruit_name')

3. Create a train and test set to generate accuracy

3.1 Split dataset

Now, we will separate the data frame in two parts such as train and test set.

feature_names = ['mass', 'width', 'height', 'color_score']
X = fruit[feature_names]
y = fruit['fruit_label']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

3.2 machine learning (Modeling)

Now the time to find out a best-suite algorithm for getting the highest accuracy points. So, we are going to handle with some frequently use algorithms for modeling the dataset.

3.2.1 Decision tree

# DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('DecisionTreeClassifier:')
print('Accuracy of training set: {:.2f}'
.format(clf.score(X_train, y_train)))
print('Accuracy of test set: {:.2f}'
.format(clf.score(X_test, y_test)))

3.2.2 logistic regression

#LogisticRegression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('LogisticRegression:')
print('Accuracy of training set: {:.2f}'
.format(logreg.score(X_train, y_train)))
print('Accuracy of test set: {:.2f}'
.format(logreg.score(X_test, y_test)))

3.2.3 K-nearest neighbor

#KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print('KNeighborsClassifier:')
print('Accuracy of on training set: {:.2f}'
.format(knn.score(X_train, y_train)))
print('Accuracy of test set: {:.2f}'
.format(knn.score(X_test, y_test)))

3.2.4 Gaussian Naive Bayes

#GaussianNB
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('GaussianNB:')
print('Accuracy of training set: {:.2f}'
.format(gnb.score(X_train, y_train)))
print('Accuracy of test set: {:.2f}'
.format(gnb.score(X_test, y_test)))

3.2.5 support vector machine

#SVC
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print('Support vectore machine:')
print('Accuracy of training set: {:.2f}'
.format(svm.score(X_train, y_train)))
print('Accuracy of test set: {:.2f}'
.format(svm.score(X_test, y_test)))

After, end of modeling we can obtain the best accuracy model is K-nearest neighbor it provides the highest accuracy score.

3.3 Prediction

Now, we have the best accuracy model for the validation process.
The KNN model directly runs on the validation set to finding the best final accuracy of points.
1. #pretiction
2. from sklearn.metrics import classification_report
3. from sklearn.metrics import confusion_matrix
4. from sklearn.metrics import accuracy_score
5. pred = knn.predict(X_test)
6. print(accuracy_score(y_test, pred))
7. print(confusion_matrix(y_test, pred))
8. print(classification_report(y_test, pred))

Summary

In this article, we obtain the best accuracy of fruit distribution. I hope you have understood very well.