Data Science

Introduction to Classification Algorithm [Types]

Posted in Data Science
Introduction to Classification Algorithm [Types]

What is the Classification Algorithm?

Our task in the analysis part starts from the step to know the targeted class. So this whole process is said to be classification. An algorithm is a procedure or formula for solving the problems of mathematics and computer science, which is based on doing the steps in the sequence of specified actions. We can view the computer program as a detailed algorithm. We use an algorithm in almost all information technology. For example- taking into consideration the search engine algorithm, it takes search strings of keyboards. It operates as input, which is associated with the concerned web pages and gives us results, respectively. An encryption algorithm like that of the US Department of defence’s data encryption standard (des) uses the secret key algorithm to protect the data from getting it hacked or getting viral because leakage of the country's information can put them into danger. As long as the algorithm is sufficiently cited, no one lacking the key can decrypt the secured data.

Some of the Examples of the Target Class

To an analysis of the buyer data to predict whether he would be buying the computer accessories (target class: yes or no)

Grouping and differentiating fruits based on its colour, taste, size, weight (target class: apple, mango, litchi, cherry, papaya, orange, melon, and tomato)

Differentiating the gender from hair length (target class: male or female)

Now, we are going to understand the concept of the classification algorithm with differentiating them according to the gender-based on their hair length (by no means am I trying to boilerplate by gender, this is only for example sake). We should have the proper hair length value. Let us suppose the discern boundary hair length value is 25.0 cm; then, we say that is hair length is more than that, then gender could be male or female.

Dataset Sources and Content

The dataset contains salaries. The following is descriptive of our dataset:

Of classes: 2(">50k" and "<=50k")
Of attributes (columns): 7
Of instances (rows): 48,842

This data was taken from the census bureau database.

Explanation

Two classes of salaries are taken into account. The first one is greater than 50k and the second one is equal to and less than 50k. If we take 7 attributes or columns and rows up to 48,842, considering data from the census bureau database, then we can easily distribute the names of the people with 7 attributes under the two group salaries considered in the initial phase. Hence several calculations and labour work can be avoided using this method.

Applications of Classification Algorithms

  • Email spam classification
  • Bank customers loan pay willingness prediction
  • Cancer tumour cells identification
  • Sentiment analysis
  • Drugs classification
  • Facial keypoints detection
  • Pedestrians detection in an automotive car driving

Types of Classification Algorithms

Classification algorithms could be broadly classified as the following:

  • Linear classifiers
  • Logistic regression
  • Naïve Bayes classifier
  • Fisher's linear discriminant
  • Support vector machines
  • Least squares support vector
  • Quadratic classifiers
  • Kernel estimation
  • K-nearest neighbour
  • Decision trees
  • Random forests
  • Neural networks
  • Learning vector quantization

Explanation of Some of the Important Types of Classification Algorithm

1. Logistic Regression

Logistic regression is a classification and not a regression algorithm.

R-code

X< - cbind (x_train, y_train)
# train the model using the training sets and check score logistic < - glm (y_train - ., data = x, family ="binomial")
Summary (logistic)
# predict output

Predicted= predict (logistic, x_test)

There are many steps which can help us to improve the model:

  • Include interaction terms
  • Remove features
  • Regularize techniques
  • Use a non-linear model

Advantage

  • It is designed for classification and is most useful to understand the influence of some independent variables on a single outcome variable.

Disadvantages

  • It works only when the predicted variable is binary.

2. Decision Trees

The decision tree supports a supervised learning algorithm using classification problems.

R-code

Library (rpart)
X < - cbind (x_train, y_train)
# grow tree
Fit < - rpart (y_train - ., data = x, method="class")
Summary (fit)
# predict output
Predicted = predict (fit, x_test)

Advantages

  • A decision tree is simple to understand and visualize, requires little data preparation, and can handle both numerical and categorical data.

Disadvantages

  • It can be created as complex trees.

3. Naive Bayes Classifier

It takes into assumption the independence between predictors or what's known as Bayes theorem.

It helps us to calculate posterior probability p(c/x) from p(c), p(x) and p(x0/c)

P(c/x) = (p(x/c) p(c)) / p(x)

Here,

P(c/x) is the posterior probability of class (target) stated predictor (attribute).

Example:

Now we will classify it on the bases of the weather that the players will be playing or not.

  • Step 1: Firstly, we have to convert data set to the frequency table.
  • Step 2: Now, we have to create a likelihood table by finding the overcast probability = 0.29 and probability of playing is 0.64
  • Step 3: After the second step, we have to calculate the posterior probability for each class by using the naïve bayesian equation.

Example:

A Golf player will play if the weather is sunny. Is this statement correct?

We will be solving the equation by

P (yes/sunny) =p (sunny/yes)*p (yes)/p (sunny)

Now, p (sunny/yes) =3/9 =0.33,

P (sunny) =5/14 =0.36,
P (yes) =9/14 =0.64.

Now, p (yes/sunny) =0.33*0.64/0.36 =0.60

This has a higher probability.

R-code

Library (e1071)
X < - cbind (x_train, y_train)
# fitting model
Fit < -naivebayes (y_train - ., data= summary (fit)
#predict output
Predicted = predict (fit, x_test)

Advantages

  • This type of algorithm needs a small amount of training data to estimate the required parameters.
  • This method is enormously fast compared to more cosmopolitan methods.

Disadvantages

  • It does not make good estimates.

4. SVM(Support Vector Machine)

It helps in coordinating groups with different features. For example, if we only had two features like the height and hair length of an individual, firstly, we had to plot these two-dimensional spaces where each point has two coordinates, which are known as support vectors.

R-code

Library (e1071)
X < - cbind (x_train, y_train)
# fitting model
Fit < - svm (y_train - ., data = x)
Summary (fit)
#predict output
Predicted = predict (fit, x_test)

Advantages

Good in high dimensional spaces and uses a subset of training points in the decision function, so it is also memory efficient.

Disadvantages

It gives out complex outcomes that may be difficult to understand and analyze.

It would take into consideration any of the groups as per the directions of the users even if they are not relevant.

5. Stochastic Gradient Descent

Stochastic gradient descent is used when the sample size is large.

R-code

From sklearn.linear_model import sgdclassifier

Sgd = sgdclassifier (loss = "modified_huber", shuffle = true, random_state =101)
Sgd.fit (x_train, y_train)
Y_pred = sgd.predict (x_test)

Advantages

  • Efficiency and ease of implementation.

Disadvantages

  • It requires several hyper-parameters, and it is sensitive to feature scaling.

Conclusion

Classification Algorithms help ineffective analysis of the buyer data to predict whether he would be buying the computer accessories. It also helps in grouping items and differentiating the inputs from one another, which saves a huge lot of time and effort. As a result, analysis becomes easier, and the process of classification supports speed up the decision-making process, which is vital for maintaining the sustenance and growth of business in the highly competitive world.

People are also reading:

Simran Kaur Arora

Simran Kaur Arora

Simran works at Hackr as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Traveling, sketching, and gardening are the hobbies that interest her. View all posts by the Author

Leave a comment

Your email will not be published
Cancel