How to prove it in math: why deeper decision trees will never have higher expected cross entropy?

How to prove it in math: why deeper decision trees will never have higher expected cross entropy?


What we will discuss here is intuition v.s math derivation, and we are very lucky on this topic, our intuition is consistent with the result of math derivation.

The expected cross-entropy is usually used as the cost function for the decision tree. You can find the definition of expected cross entropy everywhere. Let's start our story from a simple example.

1. From A Simple Example

Most of us may have observed cases where deeper decision trees have lower cross entropy than shallower decision trees. For example, there are 30 black balls and 70 white balls in the bag,

Figure 1: root nodes with no extra layer


The cross entropy here is same as expected cross entropy:

Let's add more features. Supposing these balls are made either in glass or in wood. For the 30 black balls, 20 of them are glass, 10 of them are wooden. For the 70 white balls, 41 of them are glass, 29 of them are wooden. We want to use this feature to help classify the color of balls, and add one more layer using this feature (glass or wood) for our decision tree.

Figure 2: decision tree with more feature as extra layer for classification

Now the new expected cross entropy is:

Here it is: 0.6079 < 0.6108. After adding one more layer, the expected cross entropy is lower than the original cross entropy. You can change the black/white numbers any way you like. However, I can not raise up any opposite examples which results in a higher expected cross entropy after adding more layers.

It draws my attention.

My intuition tells me this is a rule and there is something interesting behind it.

2. Let's form our conjecture

Conjecture: the expected cross entropy will not increase after adding layers in the decision tree.

2.1 Intuition for the conjecture

Besides our observations, we have more intuitive believes supporting this conjecture.

Let's first take a look at the cross entropy.

What does it mean? It is not only a bunch of logarithm calculations. Borrowing the ideas from physics (forget Shannon's information theory for a while), entropy is a measurement for the degree of disorder or randomness. In our case, the definition of cross-entropy is like the one in physics, but more precisely, it is a measurement for the degree of purity on the leaf node.

A leaf node is more pure if the items it held is from a single type, than from mixed types. For example, leaf node (filled with cyan, each node holds only one item, of course it is pure) in figure 3.C is much purer than the root node in figure 3.A.

The expected cross-entropy, which is calculated as the weighted summation of cross-entropy on each leaf nodes, is a measurement of the linear combination for the degree of purity over all leaf nodes on the decision tree. For figure 3.C, the expected cross-entropy is 0 (linear combination of 0 is 0), lower than other cases (3.A, 3.B).

Figure 3: root node only (3.A), partially expanded tree (3.B), fully expanded tree (3.C). The leaf node in the fully expanded decision tree (figure 3.C, leaf nodes filled with cyan) is more pure than in the root node only scenario (figure 3.A).

Look at figure 3, our intuition tells us the following thing (ECE = expected cross-entropy):

ECE (3.A) >= ECE (3.B) >= ECE (3.C) = 0

This is consistent with our math conjecture, which can be written as:

  • Inequality 1:

ECE (root node only) >= ECE (tree with 1 layer) >= ECE (tree with 2 layer) >= ... >= ECE(fully expanded decision tree) = 0.

It will be beautiful if it is true. And we can prove it now.

3. Let's prove it in Math

  • Which part do we have to prove?

We are using a procedure like Mathematical Induction, but a little bit different. I know the Mathematical Induction should be applied as follows:

  1. prove 0 to 1, usually this part is the easiest part
  2. prove n to n+1, which is the hard part

What I would do, is:

  1. prove ECE(0 layer) >= ECE(1 layer), this part, however, is the most interesting part in the proof
  2. repetitively applying the result of 0 to 1, and prove the n to n+1. This part is relatively easier.

3.1. ECE(0 layer) >= ECE(1 layer)

First, let's simplify the problem without lose of generality. We assume there are two types of balls in the bag, blue and gray. Each node will have two branches (left and right) if grow deeper.

Figure 4. illustration of decision tree from 0 to 1.

According to figure 4, the inequality will be

These items on the right hand can be written in another order:

From this formula, we will begin use the Jenson Inequality:

  • Jenson Inequation: for concave function, f(E(x))>=E(f(x))


Figure 5. borrowed from https://alliance.seas.upenn.edu/~cis520/dynamic/2017/wiki/uploads/Lectures/jensen.png

We can write the Jenson inequality as:

Compare it with the right and left side of the inequality we want to prove:

We can prove the left >=right following:

  1. prove f(x) = - x * log(x) is concave. This is quite easy, using 2nd order derivative <=0.
  2. proving the following inequality.

I'll only prove of the first line, and the second line is the same:

  • from Jenson inequality (remember l+r = b+g)

ECE(0 layer) >= ECE(1 layer) have done!

3.2. ECE(n layer) >= ECE(n+1 layer)

To prove ECE(n layer) >= ECE(n+1 layer), just use ECE(0 layer) >= ECE(1 layer) on each of the leaf node when further splitting the nth layer into n+1 layer.

Figure 6. repetitively apply '0 to 1' in proving 'n to n+1'.

Then we have finished the proof. Our intuition didn't lie!! It is true that deeper decision tree will never have higher expected cross-entropy than shallower decision tree.

4. Last word.

PS1. Jenson inequality is extremely powerful!

Jenson inequality is not difficult. I've been using it since middle school for Math Olympic Competition. But it is an MDW (Massive Destruction Weapon) in data science! We need use it in the theoretical speculation for decision tree, K-means, EM. It is very helpful!

PS2. Thanks to Entropy

Why are we able to implement Jenson inequality in the decision tree? It is because: 1. the cross-entropy function is concave, 2. expected cross-entropy is a linear combination of cross-entropy. These two factors are the keys to the 'none-increasing entropy' property for decision tree.

PS3. Is a lower expected cross-entropy what we really want in decision tree?

Sadly, the answer is NO. What we want from the classifier is its ability to distinguish different types of items. The ECE is an indirect index we use in decision tree. The lower expected cross-entropy only means the leaf nodes have higher degree of purity. It does not directly revealing the ability for the decision tree to recognize cat or dog. The lower ECE may be results from the fact that the decision tree is fully expanded and each of its node hold one item. Surely it is pure, but it still stupid and it makes terrible guesses besides the existing samples. This is over-fitting.

Figure 7. Lower Expected Cross-Entropy only means high degree of purity on the leaf nodes. It may lead to over-fitting.

From figure 7, you can clearly see the comparison between a good classifier (7.B) and a over-fitting tree (7.C), which are both lowest in Expected Cross-Entropy (0). To avoid over-fitting, we need restrictions on the depth of the tree. What we want is a classifier like 7.B, not the pure but stupid 7.C.

Thanks for reading and I appreciate your feedbacks!

Ke (Kevin) Wu

Staff Machine Learning Engineer at Uber

6 年

Does being concave/convex help learning?

回复
Matthew Lu

Discover your favorite coffee at BerryBird

6 年

How can I start learning about this? Any basic readings you recommend?

回复
Gerard Ompad

Jiujitsu | Computational Statistics | Tropical Medicine | BioChemical Engineering | Data Science | Artificial Intelligence | Drug Repurposing | Healthcare and Medical Informatics | Agri/Aqua-Culture Investor

6 年

Great post, thanks for sharing! Our professor just discussed Jensen's Inequality a few weeks ago, and I would never appreciate it so much if it was not for this article!

要查看或添加评论,请登录

Xiaoli C.的更多文章

  • Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

    Reinforcement Learning and game player: big picture of the method, the information backpropagation in iterations

    Reinforcement Learning (RL) has been a hot topic since AlphaGo beat human go player. It is very interesting and…

  • Intuitive and Visual Explanation on the differences between L1 and L2 regularization

    Intuitive and Visual Explanation on the differences between L1 and L2 regularization

    This blog is quite easy in math. But it is still interesting.

    21 条评论
  • 你的未来,与你孩子的未来发展,会被AI打断吗?

    你的未来,与你孩子的未来发展,会被AI打断吗?

    硅兔赛跑 (ID: sv_race) 硅谷是一种思考态度 文|陈晓理@数据应用学院 本文为数据应用学院为硅兔赛跑的特稿…

  • 新的一年,数据科学求职者应该做的几件事

    新的一年,数据科学求职者应该做的几件事

    悟以往之不谏,知来者之可追 ——陶潜《归去来兮》…

    1 条评论
  • 机会+八卦|Snapchat进入中国

    机会+八卦|Snapchat进入中国

    日前,数据应用学院的老学员发来消息,风靡欧美年轻群体的社交产品,‘阅后即焚’Snapchat终于开始在国内组建研发团队,坐标深圳,并可以为国内的学员提供内推。 招聘启事如下: Snapchat中国第一批骨干团队招聘…

  • 再论数据科学竞赛中的Data Leakage

    再论数据科学竞赛中的Data Leakage

    越来越多的数据爱好者把注意力放在了数据竞赛上,像Kaggle数据竞赛。这类数据竞赛中,有时会遇到Data Leakage。而大部分人对Data Leakage的概念理解都是错误的。这次,我们来梳理一下Data…

  • 数据应用学院如何上榜北美Top Data Camp

    数据应用学院如何上榜北美Top Data Camp

    感谢众位老师和助教的不懈努力,数据应用学院(Data Application Lab)被美国著名科技期刊Tech Beacon列入北美Top Data Camp,与老牌劲旅Data Incubator…

  • 关于数据科学竞赛的一点思考

    关于数据科学竞赛的一点思考

    我以前在国内是做数学竞赛的,保送到大学之后,又开始参加数学建模比赛,到美国后逐渐往数据科学方向转型。在数据应用学院担任助教和竞赛协调后,组织参加Kaggle竞赛了很多次,这次又跟企业合办Fintech数据科学竞赛。从这些竞赛的导向来看,很多…

    1 条评论
  • 理论上说,什么是数据工程师,什么是数据科学家

    理论上说,什么是数据工程师,什么是数据科学家

    这么多公司需要大数据人才,小伙伴们也纷纷跃跃欲试投身这场数据革命中。可到底大数据有哪些岗位需求呢?对用人的要求是怎样的呢?我们今天来仔细看一看。 数据行业里面,跟数据有关的岗位一般有三种:1. Data Analysis, 2.

社区洞察

其他会员也浏览了