Imagine that you have a dataset with a list of predictors or independent variables and a list of targets or dependent variables. Then, by applying a decision tree like J48 on that dataset would allow you to predict the target variable of a new dataset record. 

Decision tree J48 is the implementation of algorithm ID3 (Iterative Dichotomiser 3) developed by the WEKA project team. R includes this nice work into package RWeka. 

Let’s use it in the IRIS dataset. Flower specie will be our target variable, so we will predict it based on its measured features like Sepal or Petal length and width among others.

# If not already, we should start by installing “RWeka” and “party” Packages

> install.packages(RWeka)

> install.packages(party)

# Load both packages

> library(RWeka)

> library(party)

# We will use dataset ‘IRIS’ from package ‘datasets’. It consists of 50 objects from each of three species of Iris flowers (Setosa, Virginica and Versicolor). For each object four attributes are measured length and width of sepal and petal.

> str(iris)

'data.frame':   150 obs. of  5 variables:

 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# Using the decision tree ID3 in its J48 weka implementation, we want to predict the objective attribute “Species” based on attributes length and width of sepal and petal.

> m1 <- J48(Species~., data = iris)

> if(require("party", quietly = TRUE)) plot(m1)

J48 decision tree result on iris dataset

How do we read this output ?

Attribute Petal width is the one that contains much more information and for this reason it has been selected as the first split criteria.

Reading the graphic we can notice that if Petal Width is less than 0.6 then there are 50 out of 150 objects that fall in Setosa.

Otherwise if Petal Width is greater than 1.7 then there are 46 out of 150 objects that fall in Virginica … and so on.

# To get the confusion matrix of J48 algorithm, type the following code.

> summary(m1)

=== Summary ===

Correctly Classified Instances         147               98      %

Incorrectly Classified Instances         3                2      %

Kappa statistic                          0.97  

Mean absolute error                      0.0233

Root mean squared error                  0.108 

Relative absolute error                  5.2482 %

Root relative squared error             22.9089 %

Coverage of cases (0.95 level)          98.6667 %

Mean rel. region size (0.95 level)      34      %

Total Number of Instances              150     

=== Confusion Matrix ===

  a  b  c   <-- classified as

 50  0  0 |  a = setosa

  0 49  1 |  b = versicolor

  0  2 48 |  c = virginica

We can see how J48 has taken a subset of IRIS as a training set and after that he has applied the resulting decision tree to the rest of IRIS dataset objects.

Confusion Matrix is telling the following:

  • The decision tree has classified 50 Setosa objects as Setosa.
  • The decision tree has classified 49 Versicolor objects as Versicolor and 2 as Virginica, leading in 2 misclassifications.
  • The decision tree has classified 48 Virginica objects as Virginica and 1 as Versicolor, leading in 1 misclassification.

Open questions for students

The decision tree has taken value 0.6 for Petal Width as split criteria. Why?

How a decision tree works internally

Behind the idea of a decision tree we will find what it is called information gain, a concept that measures the amount of information contained in a set of data. It gives the idea of importance of an attribute in a dataset.

# The information gain calculation will answer the question of why the algorithm has decided to start with attribute Petal.Width. 

> library(FSelector)

> information.gain(Species~., data = iris)

             attr_importance

Sepal.Length       0.6522837

Sepal.Width        0.3855963

Petal.Length       1.3565450

Petal.Width        1.3784027

The point is that attribute Petal.Width has information gain of 1.378, the highest in the IRIS dataset.

# Let’s go further in this study. We will take a subset of IRIS which contains only objects with attribute Petal.Width > 0.6 and we will get the information gain of this subset.

> subset1.iris <- subset(iris, Petal.Width > 0.6)

> information.gain(Species~., data = subset1.iris)

             attr_importance

Sepal.Length       0.1605000

Sepal.Width        0.0000000

Petal.Length       0.6573738

Petal.Width        0.6901604

Once again Petal.Width is the attribute which contains much more information and that is the reason why the second leave of the tree starts from the attribute Petal.Width.

# Next step takes us to calculate the information gain of the subset which contains only objects with attribute Petal.Width <= 1.7

> subset2.iris <- subset(subset1.iris, Petal.Width <= 1.7)

> information.gain(Species~., data = subset2.iris)

             attr_importance

Sepal.Length       0.0000000

Sepal.Width        0.0000000

Petal.Length       0.2131704

Petal.Width        0.0000000

This time Petal.Length is the attribute with the highest information gain.

In summary, Information gain is the mathematical tool that algorithm J48 has used to decide, in each tree node, which variable fits better in terms of target variable prediction.

Pruning the decision tree

We will develop the importance of error tolerance parameter and the pruning concept in decision trees.

                                   __________

                   ) ) )         /

                 (o o)      <  still working !

        ooO–(_)–Ooo  __________

 

You can also post any comment about this article, by visiting

cRomoData.com