Proof that a randomly built binary search tree has logarithmic height

Question Detail: 

How do you prove that the expected height of a randomly built binary search tree with $n$ nodes is $O(\log n)$? There is a proof in CLRS Introduction to Algorithms (chapter 12.4), but I don't understand it.

Asked By : user1675999
Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/6342

Answered By : Merbs

Let's first think about this intuitively. In the best-case scenario, the tree is perfectly balanced; in the worst-case scenario, the tree is entirely unbalanced:

Height-balanced binary search treeWorst-case binary search tree

Starting from the root node $p$, this left tree has twice as many nodes at each succeeding depth, such that the tree has $n=\sum_{i=0}^{h}2^i =2^{h+1}-1$ nodes and a height $h$ (which is in this case 3). With a little math, $n\le2^{h+1}-1\rightarrow h\le\lceil\log_2(n+1)-1\rceil\le\lfloor log_2 n\rfloor$, which is to say it has $O(\log n)$ height. For the entirely unbalanced tree, the height of tree is simply $n-1\rightarrow O(n)$. So we have our bounds.

If we were constructing a balanced tree from an ordered list $\{ 1,2,\dots,n\}$, we'd choose the middle element to be our root node. If we are instead randomly constructing a tree, any of the $n$ nodes are equally likely to be picked and the height of our tree is: $$height_{tree}=1+\max (height_{left\space subtree}, height_{right\space subtree})$$ We know that in a binary search tree, the left subtree must only contain keys less than the root node. Thus, if we randomly choose the $i^{th}$ element, the left subtree has $i-1$ elements and the right subtree has $n-i$ elements, so more compactly: $h_n=1+\max (h_{i-1},h_{n-i})$. From there, it makes sense that if each element is equally likely to be picked, the expected value is just the average of all cases (rather than a weighted average). Hence: $\operatorname{E}[h_n]=\frac{1}{n}\sum_{i=1}^{n}[1+\max (h_{i-1},h_{n-i})]$

As I'm sure you've noticed, I've deviated slightly from how CLRS proves this, because CLRS uses two relatively common proof techniques that are disconcerting for the uninitiated. The first is to use exponents (or logarithms) of what we want to find (in this case height), which makes the math work out a little more cleanly; the second is to use indicator functions (which I'm just going to ignore here). CLRS defines the exponential height as $Y_n=2^{h_n}$, so the analogous recurrence is $Y_n=2\times\max (Y_{i-1},Y_{n-i})$.

Assuming independence (that each draw of an element (out of the available elements) to be the root of a subtree is irrespective of all previous draws), we still have the relation: $$\operatorname{E}[Y_n]=\sum_{i=1}^{n}\frac{1}{n}\operatorname{E}[2\times\max (Y_{i-1},Y_{n-i})]=\frac{2}{n}\sum_{i=1}^{n}\operatorname{E}[\max (Y_{i-1},Y_{n-i})]$$ for which I did two steps: (1) moving the $\frac{1}{n}$ outside because it is a constant and one of the properties of summations is that $\sum_i ci=c\sum_i i$, and (2) moving the 2 outside because it is also a constant and one of the properties of expected values is $\operatorname{E}[ax]=a\operatorname{E}[x]$. Now we're going to replace the $\max$ function with something larger because otherwise simplifying is difficult. If we argue for nonnegative $X$, $Y$: $\operatorname{E}[\max(X,Y)]\le\operatorname{E}[\max(X,Y)+\min(X,Y)]=\operatorname{E}[X]+\operatorname{E}[Y]$, then: $$\operatorname{E}[Y_n]\le\frac{2}{n}\sum_{i=1}^{n}(\operatorname{E}[Y_{i-1}]+\operatorname{E}[Y_{n-i}])=\frac{2}{n}\sum_{i=0}^{n-1}2\operatorname{E}[Y_{i}]$$ such that the last step follows from the observation that for $i=1$, $Y_{i-1}=Y_{0}$ and $Y_{n-i}=Y_{n-1}$ and going all the way to $i=n$, $Y_{i-1}=Y_{n-1}$ and $Y_{n-i}=Y_{0}$, so every term $Y_0$ to $Y_{n-1}$ appears twice, so we can replace the entire summation with an analogous one. The good news is that we have a recurrence $\operatorname{E}[Y_n]\le\frac{4}{n}\sum_{i=0}^{n-1}\operatorname{E}[Y_{i}]$; the bad news, is that we're not much further than where we started.

At this point, CLRS pulls an induction proof $\operatorname{E}[Y_n]\le\frac{1}{4}\binom{n+3}{3}$ out of their ... repertoire of mathematical experience, one that includes an identity $\sum_{i=0}^{n-1}\binom{i+3}{3}=\binom{n+3}{4}$ they leave to the user to prove. What's important about their choice is that its largest term is $n^3$, and recall that we are using exponential height $Y_n=2^{h_n}$ such that $h_n=\log_2n^3=3\log_2n\rightarrow O(\log n)$. Perhaps someone will comment why this particular binomial was chosen. The general idea though is to bound from above our recurrence with an expression $n^k$ for some constant $k$.

To conclude with a one liner: $$2^{\operatorname{E}[X_n]}\le \operatorname{E}[Y_n]\le \frac{4}{n}\sum_{i=0}^{n-1}\operatorname{E}[Y_i]\le\frac{1}{4}\binom{n+3}{3}=\frac{(n+3)(n+2)(n+1)}{24}\rightarrow \operatorname{E}[h_n]=O(\log n) $$

No comments

Powered by Blogger.