Archive January 2020

Random Forest-Theory

image source –

Random forest algorithm is a supervised algorithm. As you can guess from its name this algorithm creates a forest with number of trees. It operates by constructing multiple decision trees. The final decision is made based on the majority of the trees and is chosen by the random forest.

image source –

The method of combining trees is known as an ensemble method. Ensembling is nothing but a combination of weak learners (individual trees) to produce a strong learner.

Let’s understand ensemble with an example. Let’s suppose you want to watch movie but you have doubt in your mind regarding it’s reviews, so you have asked 10 people who have watched the movie, 8 of them said movie is fantastic and 2 of them said movie was not good. Since the majority is in favour, you decide to watch the movie. This is how we use ensemble techniques in our daily life too.

Random Forest can be used to solve regression and classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent variable is categorical.

Advantages and Disadvantages of Random Forest

Advantages are as follows:

  1. It is used to solve both regression and classification problems.
  2. It can be also used to solve unsupervised ML problems.
  3. It can handle thousands of input variables without variable selection.
  4. It can be used as a feature selection tool using its variable importance plot.
  5. It takes care of missing data internally in an effective manner.

Disadvantages are as follows:

  1. This is a black-box model so Random Forest model is difficult to interpret.
  2. It can take longer than expected time to computer a large number of trees.

How Random Forest works?

Algorithm can be divided into two stages.

  • Random forest creation.
  • Perform prediction from the created random forest classifier.

Random forest creation:

To create random forest we need to select following steps

  1. Randomly select “k” features from total “m” features, where k << m.
  2. Among the “k” features, calculate the node “d” using the best split point.
  3. Split the node into child nodes using the best split.
  4. Repeat 1 to 3 steps until “L” number of nodes has been reached.
  5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of trees.

Perform prediction from the created random forest classifier

To perform prediction we need to take following steps

  1. Takes the test features and use the rules of each randomly created decision tree to predict the outcomes and stores the predicted outcome (target)
  2. Calculate the votes for each predicted target.
  3. Consider the high voted predicted target as the final prediction from the random forest algorithm.

Set the parameters for the random forest model:

Parameters = {‘bootstrap’: True,’min_samples_leaf’: 3, ‘n_estimators’: 50, ‘min_samples_split’: 10, ‘max_features’: ‘sqrt’,’max_depth’: 6,’max_leaf_nodes’: None} 

Hyperparameters Tuning of Random forest classifier:

bootstrap : boolean, optional (default=True)

min_samples_leaf : int, float, optional (default=1)

The minimum number of samples required to be at a leaf node:

  • If int, then consider min_samples_leaf as the minimum number.
  • If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

n_estimators : integer, optional (default=10):

  • The number of trees in the forest.

min_samples_split : int, float, optional (default=2):

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.
  • If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

max_features : int, float, string or None, optional (default=”auto”):

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split. -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
  • If “auto”, then max_features=sqrt(n_features).
  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
  • If “log2”, then max_features=log2(n_features).
  • If None, then max_features=n_features.

max_depth : integer or None, optional (default=None):

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

max_leaf_nodes : int or None, optional (default=None):

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

If you want to learn more about the rest of hyperparameters , check here

Bagging & Boosting – Theory


Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bootstrap method refers to random sampling with replacement. Here with replacement means a sample can be repetitive. Bagging allows model or algorithm to get understand about various biases and variance.

To create bagging model, first we create multiple random samples so that each new random sample will act as another (almost) independent dataset drawn from original distribution. Then, we can fit a weak learner for each of these samples and finally aggregate their outputs and obtain an ensemble model with less variance from its components.

Let’s understand it with an we can see in below figure where each sample population has different pieces and none of them are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set.  It develops more robust models.

How bagging works

How Bagging Works?

  1. You generate multiple samples from your training set using next scheme: you take randomly an element from training set and then return it back. So, some of elements of training set will present multiple times in generated sample and some will be absent. These samples should have the same size as the train set.
  2. You train you learner on each generated sample.
  3. When you apply the algorithm you just average predictions of learners in case of regression or make the voting in case of classification.

Applying bagging often help to deal with overfitting by reducing prediction variance.

Bagging Algorithms:

  1. Take M bootstrap samples (with replacement)
  2. Train M different classifiers on these bootstrap samples
  3. For a new query, let all classifiers predict and take an average(or majority vote)
  4. If the classifiers make independent errors, then their ensembles can improve performance.


Boosting is an ensemble modeling technique which converts weak learner to strong learners.

Let’s understand it with an example. Let’s suppose you want to identify an email is a SPAM or NOT SPAM. To do that you need to take some criteria as follows.

  1. Email has only one image file, It’s a SPAM
  2. Email has only link, It’s a SPAM
  3. Email body consist of sentence like “You won a prize money of $ xxxx”, It’s a SPAM
  4. Email from our official domain “”, Not a SPAM
  5. Email from known source, Not a SPAM

As we can see above there are multiple rules to identify an email is a spam or not. But if we will talk about individual rules they are not as powerful as multiple rules. There these individual rules is a weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner using methods like:
•   Using average/ weighted average
•   Considering prediction has higher vote

For example:  Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’ and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM because we have higher (3) vote for ‘SPAM’

Boosting Algorithm:

  1. The base learner takes all the distributions and assigns equal weight or attention to each observation.
  2. If there is any prediction error caused by first base learning algorithm, then we pay higher attention to observations having prediction error. Then, we apply the next base learning algorithm.
  3. Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which eventually improves the prediction power of the model.

Types of Boosting Algorithm:

  1. AdaBoost (Adaptive Boosting)
  2. Gradient Tree Boosting
  3. XGBoost

AdaBoost(Adaptive Boosting)

Adaboost was the first successful and very popular boosting algorithm which developed for the purpose of binary classification. AdaBoost technique which combines multiple “weak classifiers” into a single “strong classifier”.

  1. Initialise the dataset and assign equal weight to each of the data point.
  2. Provide this as input to the model and identify the wrongly classified data points
  3. Increase the weight of the wrongly classified data points.
  4. if (got required results)
      Go to step 5
      Go to step 2
  5. End

Let’s understand the concept with following example.

BOX – 1: In box 1 we have assigned equal weight to each data points and applied a decision stump to classify them as  + (plus) or – (minus). The decision stump (D1) has generated vertical line at left side to classify the data points. As we can see in the box vertical line has incorrectly predicted three + (plus) as – (minus). In this case, we will assign higher weights to these three + (plus) and apply another decision stump. As you can see in below image.

Decision stump – 1

BOX – 2: Now in box 2 size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But in this process, it has caused mis-classification errors again. This time with three -(minus). So we will assign higher weight to three – (minus) and apply another decision stump. As you can see in below image.

Decision stump -2

BOX – 3: In box 3 there are three – (minus) has been given higher weights. A decision stump (D3) is applied to predict these mis-classified observation correctly. This time a horizontal line is generated to classify + (plus) and – (minus) based on higher weight of mis-classified observation.

Decision stump – 3

BOX – 4: in box 4 we will combine D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. As we can see this algorithm has classified these observation quite well as compared to any of individual weak learner.

Decision Stump – 4

Python Code

from sklearn.ensamble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=4, random_state=0, algorithm=’SAMME’), Y)

  • n_estimators : integer, optional (default=50)

The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.

  • random_state : int, RandomState instance or None, optional (default=None)
  • algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)

If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm.