Earley parser: Difference between revisions
Jump to navigation
Jump to search
imported>Progenitor Eri m Returning lede to pre-13 March 2018 form which was grammatically correct but unduly changed |
imported>Changaco tweak one sentence to improve readability, fix Marpa cost claims and switch the reference to the primary source |
||
| Line 1: | Line 1: | ||
{{Short description|Algorithm for parsing context-free languages}} | {{Short description|Algorithm for parsing context-free languages}} | ||
{{Infobox algorithm | {{Infobox algorithm | ||
|name= | |name=Earley parser | ||
|class=[[Parsing]] | |class=[[Parsing]], [[Context-free grammar|context-free]] | ||
|data=[[String (computer science)|String]] | |data=[[String (computer science)|String]] | ||
|time=<math>O(n^3)</math> | |time={{plainlist| | ||
| | * <math>O(n)</math> for non-right recursive LR(k) grammars | ||
* <math> | * <math>O(n^2)</math> for [[Ambiguous grammar|unambiguous grammars]] | ||
* <math> | * <math>O(n^3)</math> for all other context-free grammars | ||
}} | |||
|space={{plainlist| | |||
* <math>O(n)</math> for non-right recursive LR(k) grammars | |||
* <math>O(n^2)</math> for all other context-free grammars | |||
}} | |||
}} | |||
{{Infobox algorithm | |||
|name=Leo variant | |||
|class=[[Parsing]], [[Context-free grammar|context-free]] | |||
|data=[[String (computer science)|String]] | |||
|time={{plainlist| | |||
* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones) | |||
* <math>O(n^3)</math> for all other context-free grammars | |||
}} | |||
|space={{plainlist| | |||
* <math>O(n)</math> for every LR-regular grammar and some others (including some ambiguous ones) | |||
* <math>O(n^2)</math> for all other context-free grammars | |||
}} | }} | ||
}} | }} | ||
In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]] | In [[computer science]], the '''Earley parser''' is an [[algorithm]] for [[parsing]] [[String (computer science)|strings]] that belong to a given [[context-free language]]. Named after its inventor [[Jay Earley]], it was first introduced in his dissertation in 1968 (and later appeared in abbreviated, more legible form in a journal).<ref name="Earley1">{{cite book | ||
| last=Earley | | last = Earley | first = Jay | ||
| title = An Efficient Context-Free Parsing Algorithm | |||
| title=An Efficient Context-Free Parsing Algorithm | | year = 1968 | ||
| year=1968 | | publisher = Carnegie-Mellon Dissertation | ||
| publisher=Carnegie-Mellon Dissertation | | url = http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | url-status = dead | ||
| url=http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | | archive-url = https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | ||
| archive-date = 2017-09-22 | |||
}}</ref><ref name="Earley2">{{citation | |||
| archive-url=https://web.archive.org/web/20170922004954/http://reports-archive.adm.cs.cmu.edu/anon/anon/usr/ftp/scan/CMU-CS-68-earley.pdf | | last = Earley | first = Jay | ||
| | | doi = 10.1145/362007.362035 | ||
}}</ref> | | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | url-status = dead | ||
| last = Earley | first = Jay | | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | ||
| doi = 10.1145/362007.362035 | url = http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | archive-url = https://web.archive.org/web/20040708052627/http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-55/lti/Courses/711/Class-notes/p94-earley.pdf | | archive-date = 2004-07-08 | ||
| journal = [[Communications of the ACM]] | | journal = [[Communications of the ACM]] | ||
| issue = 2 | |||
| pages = 94–102 | | pages = 94–102 | ||
| title = An efficient context-free parsing algorithm | | title = An efficient context-free parsing algorithm | ||
| volume = 13 | | volume = 13 | ||
| year = 1970| s2cid = 47032707 }}</ref> | | year = 1970 | ||
| s2cid = 47032707 | |||
}}</ref> It is a [[chart parser]] that uses [[dynamic programming]]. | |||
Earley parsers are appealing because they can parse all context-free languages, unlike [[LR parser]]s and [[LL parser]]s, which are more typically used in [[compiler]]s but which can only handle restricted classes of languages. However, the original Earley algorithm contained a bug, and its run time and memory use were high for the computers available at the time it was published. Earley parsing has since become a viable solution, as corrections and optimizations of the algorithm have been found and computers have become far more powerful.<ref>{{Cite web|title=Parsing: a timeline -- V3.1|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/personal/timeline_v3|date=2023-07-06|access-date=2026-05-05}}</ref> | |||
Earley parsers perform particularly well when the rules are written [[left recursion|left-recursively]]. Some grammars can be automatically rewritten so that the Earley algorithm parses the language they describe more efficiently. | |||
The worst-case costs of the original Earley algorithm are: <math>O(n)</math> linear time and space for non-right recursive LR(k) grammars, <math>O(n^2)</math> quadratic time and space for [[unambiguous grammar]]s,<ref>{{cite book | isbn=978-0-201-02988-8 | author=John E. Hopcroft and Jeffrey D. Ullman | title=Introduction to Automata Theory, Languages, and Computation | location=Reading/MA | publisher=Addison-Wesley | year=1979 | url-access=registration | url=https://archive.org/details/introductiontoau00hopc }} p.145</ref> <math>O(n^3)</math> cubic time and <math>O(n^2)</math> quadratic space for all other context-free grammars, where <math>n</math> is the length of the parsed string.<ref name="Leo" /> | |||
== Earley recogniser == | == Earley recogniser == | ||
The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser. | The following algorithm describes the Earley recogniser. The recogniser can be modified to create a parse tree as it recognises, and in that way can be turned into a parser. | ||
In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol. | In the following descriptions, α, β, and γ represent any [[string (computer science)|string]] of [[Terminal and nonterminal symbols|terminals/nonterminals]] (including the [[empty string]]), X and Y represent single nonterminals, and ''a'' represents a terminal symbol. | ||
In the following, we use Earley's dot notation: given a [[Formal grammar#The syntax of grammars|production]] X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected. | |||
Input position 0 is the position prior to input. Input position ''n'' is the position after accepting the ''n''th token. (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.) For every input position, the parser generates a ''state set''. Each state is a [[tuple]] (X → α • β, ''i''), consisting of | Input position 0 is the position prior to input. Input position ''n'' is the position after accepting the ''n''th token. (Informally, input positions can be thought of as locations at [[Lexical analysis|token]] boundaries.) For every input position, the parser generates a ''state set''. Each state is a [[tuple]] (X → α • β, ''i''), consisting of | ||
| Line 63: | Line 84: | ||
The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects. | The algorithm accepts if (X → γ •, 0) ends up in S(''n''), where (X → γ) is the top level-rule and ''n'' the input length, otherwise it rejects. | ||
== Pseudocode == | === Pseudocode === | ||
Adapted from Speech and Language Processing<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin, | Adapted from ''Speech and Language Processing''<ref name=Jurafsky>{{cite book|last=Jurafsky|first=D.|title=Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition|year=2009|publisher=Pearson Prentice Hall|isbn=9780131873216|url=https://books.google.com/books?id=fZmj5UNK8AQC}}</ref> by [[Daniel Jurafsky]] and James H. Martin, | ||
<syntaxhighlight lang="pascal"> | <syntaxhighlight lang="pascal"> | ||
| Line 197: | Line 218: | ||
== Constructing the parse forest == | == Constructing the parse forest == | ||
Earley's dissertation<ref | Earley's dissertation<ref>{{cite book | ||
| last=Earley | | last=Earley | ||
| first=Jay | | first=Jay | ||
| Line 219: | Line 240: | ||
SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from. | SPPF nodes are never labeled with a completed LR(0) item: instead they are labelled with the symbol that is produced so that all derivations are combined under one node regardless of which alternative production they come from. | ||
== | == Fixes and optimizations == | ||
In 1972, Alfred Aho and Jeffrey Ullman fix the empty-rule bug in Earley's algorithm, but their solution makes the algorithm even more costly to run.<ref>{{cite book | |||
| last1 = Aho | first1 = Alfred | last2 = Ullman | first2 = Jeffrey | |||
| title = The theory of parsing, translation, and compiling. Vol. 1. | |||
| publisher = Prentice Hall | |||
| year = 1972 | |||
| page = 321 | |||
}}</ref> | |||
== | In 1991, Joop Leo achieves parsing in linear time and space for every LR(''k'') grammar.<ref name="Leo">{{citation | ||
| last = Leo | first = Joop M.I.M. | |||
| doi = 10.1016/0304-3975(91)90180-A | |||
| issue = 1 | |||
| journal = [[Theoretical Computer Science (journal)|Theoretical Computer Science]] | |||
| mr = 1112117 | |||
| pages = 165–176 | |||
| title = A general context-free parsing algorithm running in linear time on every LR(''k'') grammar without using lookahead | |||
| volume = 82 | |||
| year = 1991 | |||
| doi-access = free | |||
}}</ref> | |||
== | In 1996, Philippe McLean and R. Nigel Horspool, seemingly unaware of Leo's work, combine Earley parsing with LR parsing.<ref>{{cite journal | ||
| last1 = McLean | first1 = Philippe | |||
| last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool | |||
| title = A Faster Earley Parser | |||
| journal = [[LNCS]] | |||
| volume = 1060 | |||
| pages = 281-293 | |||
| year = 1996 | |||
| url = https://webhome.cs.uvic.ca/~nigelh/Publications/fastEarley.pdf | |||
}}</ref> | |||
In 2002, John Aycock and R. Nigel Horspool, still seemingly unaware of Leo's work, solve the empty-rule problem of Earley's algorithm without increasing its cost.<ref>{{cite journal | |||
| last1 = Aycock | first1 = John | | last1 = Aycock | first1 = John | ||
| last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool | | last2 = Horspool | first2 = R. Nigel | author2-link = Nigel Horspool | ||
| Line 241: | Line 283: | ||
| title = Practical Earley Parsing | | title = Practical Earley Parsing | ||
| volume = 45 | | volume = 45 | ||
| year = 2002| citeseerx = 10.1.1.12.4254 | | year = 2002 | ||
}} | | citeseerx = 10.1.1.12.4254 | ||
| url = https://webhome.cs.uvic.ca/~nigelh/Publications/PracticalEarleyParsing.pdf | |||
}}</ref> | |||
In 2010, Jeffrey Kegler combines the 1991 and 2002 optimizations in his [[open source]] Marpa parser.<ref>{{cite web|title=Marpa is now O(n) for Right Recursions|date=2010-06-05|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2010/06/marpa-is-now-on-for-right-recursions.html|access-date=2026-05-05}}</ref><ref>{{cite web|last=Kegler|first=Jeffrey|title=What is the Marpa algorithm?|date=2011-11-18|url=http://blogs.perl.org/users/jeffrey_kegler/2011/11/what-is-the-marpa-algorithm.html|access-date=20 August 2013}}</ref> He later publishes a formal description of the Marpa recognizer.<ref>{{citation|title=Marpa, A practical general parser: the recognizer|url=https://www.academia.edu/10341474/Marpa_A_practical_general_parser_the_recognizer|date=2013-05-22|last=Kegler|first=Jeffrey}}</ref><ref>{{citation|title=Marpa, A practical general parser: the recognizer|edition=3rd|url=https://arxiv.org/abs/1910.08129|date=2023-01-27|last=Kegler|first=Jeffrey}}</ref> As of 2026, he claims that Marpa runs in linear time for:<ref>{{Cite web|title=The Marpa parser|author=Jeffrey Kegler|url=https://jeffreykegler.github.io/Marpa-web-site/|access-date=2026-05-05}}</ref> | |||
* All grammar classes that [[recursive descent]] parses. | |||
* The grammar class that the [[yacc]] family parses. | |||
* Practically all unambiguous grammars. | |||
* Ambiguous grammars that are unions of a finite set of any of the above grammars. | |||
| | == See also == | ||
* [[CYK algorithm]] | |||
* [[Context-free grammar]] | |||
* [[List of algorithms#Parsing|Parsing algorithms]] | |||
== Citations == | |||
{{Reflist}} | |||
== Other reference materials == | |||
*{{cite conference |first= Masaru|last= Tomita|title= LR parsers for natural languages |conference= 10th International Conference on Computational Linguistics |book-title= COLING|pages= 354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf}} | *{{cite conference|first=Masaru|last=Tomita|title=LR parsers for natural languages|conference=10th International Conference on Computational Linguistics|book-title=COLING|pages=354–357|year= 1984|url=https://aclanthology.info/pdf/P/P84/P84-1073.pdf|url-status=dead|archive-url=https://web.archive.org/web/20180523172712/https://aclanthology.info/pdf/P/P84/P84-1073.pdf|archive-date=2018-05-23}} | ||
{{parsers}} | {{parsers}} | ||
[[Category:Parsing algorithms]] | [[Category:Parsing algorithms]] | ||
[[Category:Dynamic programming]] | [[Category:Dynamic programming]] | ||