Earley parser: Difference between revisions

Jump to navigation Jump to search
imported>Progenitor Eri
m Returning lede to pre-13 March 2018 form which was grammatically correct but unduly changed
 
imported>Changaco
tweak one sentence to improve readability, fix Marpa cost claims and switch the reference to the primary source
 
Line 1: Line 1:
{{Short description|Algorithm for parsing context-free languages}}
{{Short description|Algorithm for parsing context-free languages}}
{{Infobox algorithm
{{Infobox algorithm
|name={{PAGENAMEBASE}}
|name=Earley parser
|class=[[Parsing]] grammars that are [[Context-free grammar|context-free]]
|class=[[Parsing]], [[Context-free grammar|context-free]]
|data=[[String (computer science)|String]]
|data=[[String (computer science)|String]]
|time=<math>O(n^3)</math>
|time={{plainlist|
|best-time={{plainlist|
* <math>O(n)</math> for non-right recursive LR(k) grammars
* <math>\Omega(n)</math> for all [[deterministic context-free grammar]]s
* <math>O(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]]
* <math>\Omega(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]]
* <math>O(n^3)</math> for all other context-free grammars
}}
|space={{plainlist|
* <math>O(n)</math> for non-right recursive LR(k) grammars
* <math>O(n^2)</math> for all other context-free grammars
}}
}}
{{Infobox algorithm
|name=Leo variant
|class=[[Parsing]], [[Context-free grammar|context-free]]
|data=[[String (computer science)|String]]
|time={{plainlist|
* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)
* <math>O(n^3)</math> for all other context-free grammars
}}
|space={{plainlist|
* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones)
* <math>O(n^2)</math> for all other context-free grammars
}}
}}
|average-time=<math>\Theta(n^3)</math>
|space=
}}
}}


In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]], though (depending on the variant) it may suffer problems with certain nullable grammars.<ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> The algorithm, named after its inventor [[Jay Earley]], is a [[chart parser]] that uses [[dynamic programming]]; it is mainly used for parsing in [[computational linguistics]]. It was first introduced in his dissertation<ref name=Earley1>{{cite book
In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]]. Named after its inventor [[Jay Earley]], it was first introduced in his dissertation in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley1">{{cite book
  | last=Earley
  | last = Earley | first = Jay
| first=Jay
  | title = An Efficient Context-Free Parsing Algorithm
  | title=An Efficient Context-Free Parsing Algorithm
  | year = 1968
  | year=1968
  | publisher = Carnegie-Mellon Dissertation
  | publisher=Carnegie-Mellon Dissertation
  | url = http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | url-status = dead
  | url=http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
  | archive-url = https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
| access-date=2012-09-12
  | archive-date = 2017-09-22
| archive-date=2017-09-22
  }}</ref><ref name="Earley2">{{citation
  | archive-url=https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf
  | last = Earley | first = Jay
  | url-status=dead
  | doi = 10.1145/362007.362035
  }}</ref> in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley2">{{citation
| url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead
  | last = Earley | first = Jay | author-link = Jay Earley
| archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf
  | doi = 10.1145/362007.362035 | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead | archive-date = 2004-07-08 | issue = 2
| archive-date = 2004-07-08
  | journal = [[Communications of the ACM]]
  | journal = [[Communications of the ACM]]
| issue = 2
  | pages = 94–102
  | pages = 94–102
  | title = An efficient context-free parsing algorithm
  | title = An efficient context-free parsing algorithm
  | volume = 13
  | volume = 13
  | year = 1970| s2cid = 47032707 }}</ref>
  | year = 1970
| s2cid = 47032707
}}</ref> It is a [[chart parser]] that uses [[dynamic programming]].
 
Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages. However, the original Earley algorithm contained a bug, and its run time and memory use were high for the computers available at the time it was published. Earley parsing has since become a viable solution, as corrections and optimizations of the algorithm have been found and computers have become far more powerful.<ref>{{Cite web|title=Parsing: a timeline -- V3.1|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/personal/timeline_v3|date=2023-07-06|access-date=2026-05-05}}</ref>
 
Earley parsers perform particularly well when the rules are written [[left recursion|left-recursively]]. Some grammars can be automatically rewritten so that the Earley algorithm parses the language they describe more efficiently.


Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages.  The Earley parser executes in cubic time in the general case <math>{O}(n^3)</math>, where ''n'' is the length of the parsed string, quadratic time for [[unambiguous grammar]]s <math>{O}(n^2)</math>,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> and linear time for all [[deterministic context-free grammar]]s. It performs particularly well when the rules are written [[left recursion|left-recursively]].
The worst-case costs of the original Earley algorithm are: <math>O(n)</math> linear time and space for non-right recursive LR(k) grammars, <math>O(n^2)</math> quadratic time and space for [[unambiguous grammar]]s,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> <math>O(n^3)</math> cubic time and <math>O(n^2)</math> quadratic space for all other context-free grammars, where <math>n</math> is the length of the parsed string.<ref name="Leo" />


== Earley recogniser ==
== Earley recogniser ==
The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.
The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser.


== The algorithm ==
In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol.
In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol.


Earley's algorithm is a top-down [[dynamic programming]] algorithm. In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.


Input position 0 is the position prior to input.  Input position ''n'' is the position after accepting the ''n''th token.  (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.)  For every input position, the parser generates a ''state set''.  Each state is a [[tuple]] (X → α • β, ''i''), consisting of
Input position 0 is the position prior to input.  Input position ''n'' is the position after accepting the ''n''th token.  (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.)  For every input position, the parser generates a ''state set''.  Each state is a [[tuple]] (X → α • β, ''i''), consisting of
Line 63: Line 84:
The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects.
The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects.


== Pseudocode ==
=== Pseudocode ===
Adapted from Speech and Language Processing<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,  
Adapted from ''Speech and Language Processing''<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin,  


<syntaxhighlight lang="pascal">
<syntaxhighlight lang="pascal">
Line 197: Line 218:


== Constructing the parse forest ==
== Constructing the parse forest ==
Earley's dissertation<ref name=Earley3>{{cite book
Earley's dissertation<ref>{{cite book
  | last=Earley
  | last=Earley
  | first=Jay
  | first=Jay
Line 219: Line 240:
SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.
SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from.


== Optimizations ==
== Fixes and optimizations ==


Philippe McLean and R. Nigel Horspool in their paper [https://link.springer.com/content/pdf/10.1007%2F3-540-61053-7_68.pdf "A Faster Earley Parser"] combine Earley parsing with LR parsing and achieve an improvement in an order of magnitude.
In 1972, Alfred Aho and Jeffrey Ullman fix the empty-rule bug in Earley's algorithm, but their solution makes the algorithm even more costly to run.<ref>{{cite book
| last1 = Aho | first1 = Alfred | last2 = Ullman | first2 = Jeffrey
| title = The theory of parsing, translation, and compiling. Vol. 1.
| publisher = Prentice Hall
| year = 1972
| page = 321
}}</ref>


== See also ==
In 1991, Joop Leo achieves parsing in linear time and space for every LR(''k'') grammar.<ref name="Leo">{{citation
* [[CYK algorithm]]
| last = Leo | first = Joop M.I.M.
* [[Context-free grammar]]
| doi = 10.1016/0304-3975(91)90180-A
* [[List of algorithms#Parsing|Parsing algorithms]]
| issue = 1
| journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]]
| mr = 1112117
| pages = 165–176
| title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead
| volume = 82
| year = 1991
| doi-access = free
}}</ref>


== Citations ==
In 1996, Philippe McLean and R. Nigel Horspool, seemingly unaware of Leo's work, combine Earley parsing with LR parsing.<ref>{{cite journal
{{Reflist}}
| last1 = McLean | first1 = Philippe
| last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool
| title = A Faster Earley Parser
| journal = [[LNCS]]
| volume = 1060
| pages = 281-293
| year = 1996
| url = https://webhome.cs.uvic.ca/~nigelh/Publications/fastEarley.pdf
}}</ref>


== Other reference materials ==
In 2002, John Aycock and R. Nigel Horspool, still seemingly unaware of Leo's work, solve the empty-rule problem of Earley's algorithm without increasing its cost.<ref>{{cite journal
*{{cite journal
  | last1 = Aycock | first1 = John
  | last1 = Aycock | first1 = John
  | last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool
  | last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool
Line 241: Line 283:
  | title = Practical Earley Parsing
  | title = Practical Earley Parsing
  | volume = 45
  | volume = 45
  | year = 2002| citeseerx = 10.1.1.12.4254
  | year = 2002
  }}
| citeseerx = 10.1.1.12.4254
*{{citation
| url = https://webhome.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf
| last = Leo | first = Joop M. I. M.
  }}</ref>
| doi = 10.1016/0304-3975(91)90180-A
 
| issue = 1
In 2010, Jeffrey Kegler combines the 1991 and 2002 optimizations in his [[open source]] Marpa parser.<ref>{{cite web|title=Marpa is now O(n) for Right Recursions|date=2010-06-05|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/06/marpa-is-now-on-for-right-recursions.html|access-date=2026-05-05}}</ref><ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|date=2011-11-18|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> He later publishes a formal description of the Marpa recognizer.<ref>{{citation|title=Marpa, A practical general parser: the recognizer|url=https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer|date=2013-05-22|last=Kegler|first=Jeffrey}}</ref><ref>{{citation|title=Marpa, A practical general parser: the recognizer|edition=3rd|url=https://arxiv.org/abs/1910.08129|date=2023-01-27|last=Kegler|first=Jeffrey}}</ref> As of 2026, he claims that Marpa runs in linear time for:<ref>{{Cite web|title=The Marpa parser|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Marpa-web-site/|access-date=2026-05-05}}</ref>
| journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]]
 
| mr = 1112117
* All grammar classes that [[recursive descent]] parses.
| pages = 165–176
* The grammar class that the [[yacc]] family parses.
| title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead
* Practically all unambiguous grammars.
| volume = 82
* Ambiguous grammars that are unions of a finite set of any of the above grammars.
| year = 1991
 
| doi-access = free
== See also ==
}}
* [[CYK algorithm]]
* [[Context-free grammar]]
* [[List of algorithms#Parsing|Parsing algorithms]]
 
== Citations ==
{{Reflist}}
 
== Other reference materials ==


*{{cite conference |first= Masaru|last= Tomita|title= LR parsers for natural languages |conference= 10th International Conference on Computational Linguistics |book-title= COLING|pages= 354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf}}
*{{cite conference|first=Masaru|last=Tomita|title=LR parsers for natural languages|conference=10th International Conference on Computational Linguistics|book-title=COLING|pages=354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf|url-status=dead|archive-url=https://web.archive.org/web/20180523172712/https://aclanthology.info/pdf/P/P84/P84-1073.pdf|archive-date=2018-05-23}}


{{parsers}}
{{parsers}}
[[Category:Parsing algorithms]]
[[Category:Parsing algorithms]]
[[Category:Dynamic programming]]
[[Category:Dynamic programming]]