Earley parser: Difference between revisions

Line 1:

{{Short description|Algorithm for parsing context-free languages}}

{{Infobox algorithm

|name=~~{{PAGENAMEBASE}}~~

|name=Earley parser

|class=[[Parsing]] ~~grammars that are~~ [[Context-free grammar|context-free]]

|class=[[Parsing]], [[Context-free grammar|context-free]]

|data=[[String (computer science)|String]]

|time=<math>O(n^3)</math>

|time={{plainlist|

|~~best-time~~={{plainlist|

* <math>O(n)</math> for non-right recursive LR(k) grammars

* <math>~~\Omega~~(n)</math> for all [[~~deterministic~~ context-free ~~grammar~~]]s

* <math>O(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]]

* <math>~~\Omega~~(n^2)</math> for ~~[[Ambiguous grammar|unambiguous~~ grammars]]

* <math>O(n^3)</math> for all other context-free grammars

}}

|space={{plainlist|

* <math>O(n)</math> for non-right recursive LR(k) grammars

* <math>O(n^2)</math> for all other context-free grammars

}}

{{Infobox algorithm

|name=Leo variant

|class=[[Parsing]], [[Context-free grammar|context-free]]

|data=[[String (computer science)|String]]

|time={{plainlist|

* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)

* <math>O(n^3)</math> for all other context-free grammars

}}

|space={{plainlist|

* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)

* <math>O(n^2)</math> for all other context-free grammars

}}

~~|average-time=<math>\Theta(n^3)</math>~~

~~|space=~~

}}

In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]]~~, though (depending on the variant) it may suffer problems with certain nullable grammars~~.<ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> The algorithm, named after its inventor [[Jay Earley]], ~~is a [[chart parser]] that uses [[dynamic programming]];~~ it ~~is mainly used for parsing in [[computational linguistics]]. It~~ was first introduced in his dissertation<ref name=Earley1>{{cite book

In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]]. Named after its inventor [[Jay Earley]], it was first introduced in his dissertation in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley1">{{cite book

| last=Earley

| last = Earley | first = Jay

| first=Jay

| title = An Efficient Context-Free Parsing Algorithm

| title=An Efficient Context-Free Parsing Algorithm

| year = 1968

| year=1968

| publisher = Carnegie-Mellon Dissertation

| publisher=Carnegie-Mellon Dissertation

| url = http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | url-status = dead

| url=http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf

| archive-url = https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf

| ~~access~~-~~date~~=~~2012-09-12~~

| archive-date = 2017-09-22

~~| archive-date=2017-09-22~~

}}</ref><ref name="Earley2">{{citation

| archive-url=https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf

| last = Earley | first = Jay

| ~~url~~-~~status~~=~~dead~~

| doi = 10.1145/362007.362035

}}</ref> ~~in 1968 (and later appeared in abbreviated, more legible form in a journal).~~<ref name="Earley2">{{citation

| url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead

| last = Earley | first = Jay ~~| author-link = Jay Earley~~

| archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf

| doi = 10.1145/362007.362035 | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf ~~| url-status = dead~~ | archive-date = 2004-07-08 ~~| issue = 2~~

| archive-date = 2004-07-08

| journal = [[Communications of the ACM]]

| issue = 2

| pages = 94–102

| title = An efficient context-free parsing algorithm

| volume = 13

| year = 1970| s2cid = 47032707 }}</ref>

| year = 1970

| s2cid = 47032707

}}</ref> It is a [[chart parser]] that uses [[dynamic programming]].

Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages. However, the original Earley algorithm contained a bug, and its run time and memory use were high for the computers available at the time it was published. Earley parsing has since become a viable solution, as corrections and optimizations of the algorithm have been found and computers have become far more powerful.<ref>{{Cite web|title=Parsing: a timeline -- V3.1|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/personal/timeline_v3|date=2023-07-06|access-date=2026-05-05}}</ref>

Earley parsers perform particularly well when the rules are written [[left recursion|left-recursively]]. Some grammars can be automatically rewritten so that the Earley algorithm parses the language they describe more efficiently.

~~Earley parsers are appealing because they can parse all context~~-~~free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes~~ of ~~languages. The~~ Earley ~~parser executes in cubic time in the general case~~ <math>{O}(n^3)</math>~~, where ''n'' is the length of the parsed string, quadratic~~ time for ~~[[unambiguous grammar]]s~~ <math>{O}(n^2)</math>,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> and ~~linear time~~ for all ~~[[deterministic~~ context-free ~~grammar]]s. It performs particularly well when~~ the ~~rules are written [[left recursion|left-recursively]]~~.

The worst-case costs of the original Earley algorithm are: <math>O(n)</math> linear time and space for non-right recursive LR(k) grammars, <math>O(n^2)</math> quadratic time and space for [[unambiguous grammar]]s,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> <math>O(n^3)</math> cubic time and <math>O(n^2)</math> quadratic space for all other context-free grammars, where <math>n</math> is the length of the parsed string.<ref name="Leo" />

== Earley recogniser ==

The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.

~~== The algorithm ==~~

In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol.

~~Earley's algorithm is a top-down [[dynamic programming]] algorithm.~~ In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.

In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.

Input position 0 is the position prior to input. Input position ''n'' is the position after accepting the ''n''th token. (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.) For every input position, the parser generates a ''state set''. Each state is a [[tuple]] (X → α • β, ''i''), consisting of

Line 63:

Line 84:

The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects.

== Pseudocode ==

=== Pseudocode ===

Adapted from Speech and Language Processing<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,

Adapted from ''Speech and Language Processing''<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,

Line 197:

Line 218:

== Constructing the parse forest ==

Earley's dissertation<ref ~~name=Earley3~~>{{cite book

Earley's dissertation<ref>{{cite book

| last=Earley

| first=Jay

Line 219:

Line 240:

SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.

== ~~Optimizations~~ ==

== Fixes and optimizations ==

~~Philippe McLean~~ and ~~R. Nigel Horspool~~ in their ~~paper [https://link~~.~~springer~~.~~com/content/pdf/10~~.~~1007%2F3-540-61053-7_68.pdf "A Faster Earley Parser"] combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude~~.

In 1972, Alfred Aho and Jeffrey Ullman fix the empty-rule bug in Earley's algorithm, but their solution makes the algorithm even more costly to run.<ref>{{cite book

| last1 = Aho | first1 = Alfred | last2 = Ullman | first2 = Jeffrey

| title = The theory of parsing, translation, and compiling. Vol. 1.

| publisher = Prentice Hall

| year = 1972

| page = 321

}}</ref>

== ~~See also~~ ==

In 1991, Joop Leo achieves parsing in linear time and space for every LR(''k'') grammar.<ref name="Leo">{{citation

* [[~~CYK algorithm~~]]

| last = Leo | first = Joop M.I.M.

* [[Context-free grammar]]

| doi = 10.1016/0304-3975(91)90180-A

* [[List of algorithms#Parsing|~~Parsing algorithms]]~~

| issue = 1

| journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]]

| mr = 1112117

| pages = 165–176

| title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead

| volume = 82

| year = 1991

| doi-access = free

}}</ref>

== ~~Citations~~ ==

In 1996, Philippe McLean and R. Nigel Horspool, seemingly unaware of Leo's work, combine Earley parsing with LR parsing.<ref>{{cite journal

~~{{Reflist~~}}

| last1 = McLean | first1 = Philippe

| last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool

| title = A Faster Earley Parser

| journal = [[LNCS]]

| volume = 1060

| pages = 281-293

| year = 1996

| url = https://webhome.cs.uvic.ca/~nigelh/Publications/fastEarley.pdf

}}</ref>

~~== Other reference materials ==~~

In 2002, John Aycock and R. Nigel Horspool, still seemingly unaware of Leo's work, solve the empty-rule problem of Earley's algorithm without increasing its cost.<ref>{{cite journal

*{{cite journal

| last1 = Aycock | first1 = John

| last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool

Line 241:

Line 283:

| title = Practical Earley Parsing

| volume = 45

| year = 2002| citeseerx = 10.1.1.12.4254

| year = 2002

}}

| citeseerx = 10.1.1.12.4254

*{{~~citation~~

| url = https://webhome.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf

| ~~last~~ = ~~Leo~~ | ~~first~~ = ~~Joop M~~. I. M.

}}</ref>

| ~~doi~~ = 10.~~1016~~/~~0304~~-~~3975(91)90180~~-A

| ~~issue~~ = 1

In 2010, Jeffrey Kegler combines the 1991 and 2002 optimizations in his [[open source]] Marpa parser.<ref>{{cite web|title=Marpa is now O(n) for Right Recursions|date=2010-06-05|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/06/marpa-is-now-on-for-right-recursions.html|access-date=2026-05-05}}</ref><ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|date=2011-11-18|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> He later publishes a formal description of the Marpa recognizer.<ref>{{citation|title=Marpa, A practical general parser: the recognizer|url=https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer|date=2013-05-22|last=Kegler|first=Jeffrey}}</ref><ref>{{citation|title=Marpa, A practical general parser: the recognizer|edition=3rd|url=https://arxiv.org/abs/1910.08129|date=2023-01-27|last=Kegler|first=Jeffrey}}</ref> As of 2026, he claims that Marpa runs in linear time for:<ref>{{Cite web|title=The Marpa parser|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Marpa-web-site/|access-date=2026-05-05}}</ref>

| ~~journal~~ = ~~[[Theoretical Computer Science (journal)~~|~~Theoretical Computer Science]]~~

~~| mr~~ = ~~1112117~~

* All grammar classes that [[recursive descent]] parses.

| ~~pages~~ = ~~165–176~~

* The grammar class that the [[yacc]] family parses.

| title = A general ~~context~~-~~free parsing algorithm running~~ in linear time ~~on every LR(''k'') grammar without using lookahead~~

* Practically all unambiguous grammars.

| ~~volume~~ = 82

* Ambiguous grammars that are unions of a finite set of any of the above grammars.

| ~~year~~ = ~~1991~~

| ~~doi~~-access = free

== See also ==

}}

* [[CYK algorithm]]

* [[Context-free grammar]]

* [[List of algorithms#Parsing|Parsing algorithms]]

== Citations ==

== Other reference materials ==

*{{cite conference |first= Masaru|last= Tomita|title= LR parsers for natural languages |conference= 10th International Conference on Computational Linguistics |book-title= COLING|pages= 354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf}}

*{{cite conference|first=Masaru|last=Tomita|title=LR parsers for natural languages|conference=10th International Conference on Computational Linguistics|book-title=COLING|pages=354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf|url-status=dead|archive-url=https://web.archive.org/web/20180523172712/https://aclanthology.info/pdf/P/P84/P84-1073.pdf|archive-date=2018-05-23}}

[[Category:Parsing algorithms]]

[[Category:Dynamic programming]]

@@ Line 1: / Line 1: @@
 {{Short description|Algorithm for parsing context-free languages}}
 {{Infobox algorithm
-|name={{PAGENAMEBASE}}
+|name=Earley parser
-|class=[[Parsing]] grammars that are [[Context-free grammar|context-free]]
+|class=[[Parsing]], [[Context-free grammar|context-free]]
 |data=[[String (computer science)|String]]
-|time=<math>O(n^3)</math>
+|time={{plainlist|
-|best-time={{plainlist|
+* <math>O(n)</math> for non-right recursive LR(k) grammars
-* <math>\Omega(n)</math> for all [[deterministic context-free grammar]]s
+* <math>O(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]]
-* <math>\Omega(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]]
+* <math>O(n^3)</math> for all other context-free grammars
+}}
+|space={{plainlist|
+* <math>O(n)</math> for non-right recursive LR(k) grammars
+* <math>O(n^2)</math> for all other context-free grammars
+}}
+}}
+{{Infobox algorithm
+|name=Leo variant
+|class=[[Parsing]], [[Context-free grammar|context-free]]
+|data=[[String (computer science)|String]]
+|time={{plainlist|
+* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)
+* <math>O(n^3)</math> for all other context-free grammars
+}}
+|space={{plainlist|
+* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)
+* <math>O(n^2)</math> for all other context-free grammars
 }}
-|average-time=<math>\Theta(n^3)</math>
-|space=
 }}
-In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]], though (depending on the variant) it may suffer problems with certain nullable grammars.<ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> The algorithm, named after its inventor [[Jay Earley]], is a [[chart parser]] that uses [[dynamic programming]]; it is mainly used for parsing in [[computational linguistics]]. It was first introduced in his dissertation<ref name=Earley1>{{cite book
+In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]]. Named after its inventor [[Jay Earley]], it was first introduced in his dissertation in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley1">{{cite book
-  | last=Earley
+  | last = Earley | first = Jay
- | first=Jay
+  | title = An Efficient Context-Free Parsing Algorithm
-  | title=An Efficient Context-Free Parsing Algorithm
+  | year = 1968
-  | year=1968
+  | publisher = Carnegie-Mellon Dissertation
-  | publisher=Carnegie-Mellon Dissertation
+  | url = http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | url-status = dead
-  | url=http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
+  | archive-url = https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
- | access-date=2012-09-12
+  | archive-date = 2017-09-22
- | archive-date=2017-09-22
+  }}</ref><ref name="Earley2">{{citation
-  | archive-url=https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
+  | last = Earley | first = Jay
-  | url-status=dead
+  | doi = 10.1145/362007.362035
-  }}</ref> in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley2">{{citation
+ | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead
-  | last = Earley | first = Jay | author-link = Jay Earley
+ | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf
-  | doi = 10.1145/362007.362035 | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead | archive-date = 2004-07-08 | issue = 2
+ | archive-date = 2004-07-08
   | journal = [[Communications of the ACM]]
+ | issue = 2
   | pages = 94–102
   | title = An efficient context-free parsing algorithm
   | volume = 13
-  | year = 1970| s2cid = 47032707 }}</ref>
+  | year = 1970
+ | s2cid = 47032707
+}}</ref> It is a [[chart parser]] that uses [[dynamic programming]].
+Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages. However, the original Earley algorithm contained a bug, and its run time and memory use were high for the computers available at the time it was published. Earley parsing has since become a viable solution, as corrections and optimizations of the algorithm have been found and computers have become far more powerful.<ref>{{Cite web|title=Parsing: a timeline -- V3.1|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/personal/timeline_v3|date=2023-07-06|access-date=2026-05-05}}</ref>
+Earley parsers perform particularly well when the rules are written [[left recursion|left-recursively]]. Some grammars can be automatically rewritten so that the Earley algorithm parses the language they describe more efficiently.
-Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages.  The Earley parser executes in cubic time in the general case <math>{O}(n^3)</math>, where ''n'' is the length of the parsed string, quadratic time for [[unambiguous grammar]]s <math>{O}(n^2)</math>,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> and linear time for all [[deterministic context-free grammar]]s. It performs particularly well when the rules are written [[left recursion|left-recursively]].
+The worst-case costs of the original Earley algorithm are: <math>O(n)</math> linear time and space for non-right recursive LR(k) grammars, <math>O(n^2)</math> quadratic time and space for [[unambiguous grammar]]s,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> <math>O(n^3)</math> cubic time and <math>O(n^2)</math> quadratic space for all other context-free grammars, where <math>n</math> is the length of the parsed string.<ref name="Leo" />
 == Earley recogniser ==
 The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.
-== The algorithm ==
 In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol.
-Earley's algorithm is a top-down [[dynamic programming]] algorithm. In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
+In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
 Input position 0 is the position prior to input.  Input position ''n'' is the position after accepting the ''n''th token.  (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.)  For every input position, the parser generates a ''state set''.  Each state is a [[tuple]] (X → α • β, ''i''), consisting of
@@ Line 63: / Line 84: @@
 The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects.
-== Pseudocode ==
+=== Pseudocode ===
-Adapted from Speech and Language Processing<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,
+Adapted from ''Speech and Language Processing''<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,
 <syntaxhighlight lang="pascal">
@@ Line 197: / Line 218: @@
 == Constructing the parse forest ==
-Earley's dissertation<ref name=Earley3>{{cite book
+Earley's dissertation<ref>{{cite book
   | last=Earley
   | first=Jay
@@ Line 219: / Line 240: @@
 SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.
-== Optimizations ==
+== Fixes and optimizations ==
-Philippe McLean and R. Nigel Horspool in their paper [https://link.springer.com/content/pdf/10.1007%2F3-540-61053-7_68.pdf "A Faster Earley Parser"] combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude.
+In 1972, Alfred Aho and Jeffrey Ullman fix the empty-rule bug in Earley's algorithm, but their solution makes the algorithm even more costly to run.<ref>{{cite book
+ | last1 = Aho | first1 = Alfred | last2 = Ullman | first2 = Jeffrey
+ | title = The theory of parsing, translation, and compiling. Vol. 1.
+ | publisher = Prentice Hall
+ | year = 1972
+ | page = 321
+}}</ref>
-== See also ==
+In 1991, Joop Leo achieves parsing in linear time and space for every LR(''k'') grammar.<ref name="Leo">{{citation
-* [[CYK algorithm]]
+ | last = Leo | first = Joop M.I.M.
-* [[Context-free grammar]]
+ | doi = 10.1016/0304-3975(91)90180-A
-* [[List of algorithms#Parsing|Parsing algorithms]]
+ | issue = 1
+ | journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]]
+ | mr = 1112117
+ | pages = 165–176
+ | title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead
+ | volume = 82
+ | year = 1991
+ | doi-access = free
+}}</ref>
-== Citations ==
+In 1996, Philippe McLean and R. Nigel Horspool, seemingly unaware of Leo's work, combine Earley parsing with LR parsing.<ref>{{cite journal
-{{Reflist}}
+ | last1 = McLean | first1 = Philippe
+ | last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool
+ | title = A Faster Earley Parser
+ | journal = [[LNCS]]
+ | volume = 1060
+ | pages = 281-293
+ | year = 1996
+ | url = https://webhome.cs.uvic.ca/~nigelh/Publications/fastEarley.pdf
+}}</ref>
-== Other reference materials ==
+In 2002, John Aycock and R. Nigel Horspool, still seemingly unaware of Leo's work, solve the empty-rule problem of Earley's algorithm without increasing its cost.<ref>{{cite journal
-*{{cite journal
   | last1 = Aycock | first1 = John
   | last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool
@@ Line 241: / Line 283: @@
   | title = Practical Earley Parsing
   | volume = 45
-  | year = 2002| citeseerx = 10.1.1.12.4254
+  | year = 2002
-  }}
+ | citeseerx = 10.1.1.12.4254
-*{{citation
+ | url = https://webhome.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf
- | last = Leo | first = Joop M. I. M.
+  }}</ref>
- | doi = 10.1016/0304-3975(91)90180-A
- | issue = 1
+In 2010, Jeffrey Kegler combines the 1991 and 2002 optimizations in his [[open source]] Marpa parser.<ref>{{cite web|title=Marpa is now O(n) for Right Recursions|date=2010-06-05|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/06/marpa-is-now-on-for-right-recursions.html|access-date=2026-05-05}}</ref><ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|date=2011-11-18|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> He later publishes a formal description of the Marpa recognizer.<ref>{{citation|title=Marpa, A practical general parser: the recognizer|url=https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer|date=2013-05-22|last=Kegler|first=Jeffrey}}</ref><ref>{{citation|title=Marpa, A practical general parser: the recognizer|edition=3rd|url=https://arxiv.org/abs/1910.08129|date=2023-01-27|last=Kegler|first=Jeffrey}}</ref> As of 2026, he claims that Marpa runs in linear time for:<ref>{{Cite web|title=The Marpa parser|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Marpa-web-site/|access-date=2026-05-05}}</ref>
- | journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]]
- | mr = 1112117
+* All grammar classes that [[recursive descent]] parses.
- | pages = 165–176
+* The grammar class that the [[yacc]] family parses.
- | title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead
+* Practically all unambiguous grammars.
- | volume = 82
+* Ambiguous grammars that are unions of a finite set of any of the above grammars.
- | year = 1991
-| doi-access = free
+== See also ==
- }}
+* [[CYK algorithm]]
+* [[Context-free grammar]]
+* [[List of algorithms#Parsing|Parsing algorithms]]
+== Citations ==
+{{Reflist}}
+== Other reference materials ==
-*{{cite conference |first= Masaru|last= Tomita|title= LR parsers for natural languages |conference= 10th International Conference on Computational Linguistics |book-title= COLING|pages= 354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf}}
+*{{cite conference|first=Masaru|last=Tomita|title=LR parsers for natural languages|conference=10th International Conference on Computational Linguistics|book-title=COLING|pages=354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf|url-status=dead|archive-url=https://web.archive.org/web/20180523172712/https://aclanthology.info/pdf/P/P84/P84-1073.pdf|archive-date=2018-05-23}}
 {{parsers}}
 [[Category:Parsing algorithms]]
 [[Category:Dynamic programming]]

Earley parser: Difference between revisions

Navigation menu

Search