Left-Factoring a grammar into LL(1)
I have a homework assignment where I need to convert a grammar into LL(1). I've already removed the left recursion, but I'm having trouble doing left-factoring. All of the examples I've found are simple, and look something like this:
A -> aX | aY
becomes:
A -> aZ
Z -> X | Y
I understand that. However, my grammar looks more like this:
X -> aE | IXE | (X)E E -> IE | BXE | ϵ I -> ++ | -- B -> + | - | ϵ
I'm not sure how to apply the simpler example to this. I've been trying for at least a couple of hours and I've lost track of all of the things I've tried. Generally, my attempts have looked something like this:
X -> X' | IXE X' -> aE | (X)E E -> IE | BIX'E | BX'E | ϵ
And I then try to convert the E rules into ones having only one production starting with + or -:
X -> X' | IXE X' -> aE | (X)E B' -> + | - E -> IE | B'IX'E | IX'E | B'X'E | X'E | ϵ
And then...
X -> X' | IXE X' -> aE | (X)E B' -> + | - E -> +P | -M | ϵ P -> +E | IX'E | +X'E | X'E M -> -E | IX'E | -X'E | X'E
And so on. But I continually end up with a lot of extra nonterminals, and some very long productions / chains of productions, without actually having left-factored it. I'm not sure how to approach this - I can't seem to eliminate some nonterminal having multiple productions starting with a + and with a -.
Asked By : Kami's Aibou
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/4862
Answered By : Raphael
Let's have a look at your grammar:
$\qquad \begin{align} X &\to aE \mid IXE \mid (X)E \\ E &\to IE \mid BXE \mid \varepsilon \\ I &\to \text{++} \mid \text{--} \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$
Note that $X$ does not need left-factoring: all rules have disjoint FIRST sets¹. If you want to make this obvious, you can drop $I$ and inline it:
$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid BXE \mid \varepsilon \\ B &\to \text{+} \mid \text{-} \mid \varepsilon \end{align}$
Similarly, we can inline $B$:
$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid XE \mid \varepsilon \end{align}$
Now we see that we actually have to do left-factoring on $E$: we have obvious conflicts, and we get additional conflicts via $XE$. So, let's inline $X$ once at $XE$:
$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{++}E \mid \text{--}E \mid \text{+}XE \mid \text{-}XE \mid aEE \mid \text{++}XEE \mid \text{--}XEE \mid (X)EE \mid \varepsilon \end{align}$
And now we can left-factor as easily as in your example:
$\qquad \begin{align} X &\to aE \mid \text{++}XE \mid \text{--}XE \mid (X)E \\ E &\to \text{+}P \mid \text{-}M \mid aEE \mid (X)EE \mid \varepsilon \\ P &\to \text{+}E \mid XE \mid \text{+}XEE \\ M &\to \text{-}E \mid XE \mid \text{-}XEE \end{align}$
By now we can see that we are not getting anywhere: by factoring away $\text{+}$ or $\text{-}$ from the alternatives, we dig up another $X$ which again has both $\text{+}$ and $\text{-}$ in its FIRST set.
So let's have a look at your language. Via
$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nBXE$
and
$\qquad \displaystyle X \Rightarrow aE \Rightarrow^* aI^n E \Rightarrow aI^nIE$
you have arbitrarily long prefixes of the form $+^+$ which end differently, semantic-wise: an LL(1) parser can not decide whether any given (next) $\text{+}$ belongs to a pair -- which would mean choosing alternative $IE$ -- or comes alone -- which would mean choosing $BXE$.
In consequence, it doesn't look like your language can be expressed by any LL(1) grammar, so trying to convert yours into one is futile.
It's even worse: as $BXE \Rightarrow BIXEE \Rightarrow^* BI^n X E^n E$, you can not decide to chose $BXE$ with any finite look-ahead. This is not a formal proof, but it strongly suggests that your language is not even LL.
If you think about what you are doing -- mixing Polish notation with unary operators -- it is not very surprising that parsing should be hard. Basically, you have to count from the left and from the right to identify even a single $B$-$\text{+}$ in a long chain of $\text{+}$. If I think of multiple $B$-$\text{+}$ in a chain, I'm not even sure the language (with two semantically different but syntactically equal $\text{+}$) can be parsed deterministically (without backtracking) at all.
- That would be the sets of terminals that can come first in derivations of a non-terminal/rule alternative.
Post a Comment