Chomsky hierarchy

From Wikipedia
Jump to navigation Jump to search
The Chomsky hierarchy
Set inclusions described by the Chomsky hierarchy

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a formal language's alphabet that are valid according to the language's syntax. The linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes (set inclusive).

History

[edit | edit source]

The general idea of a hierarchy of grammars was first described by Noam Chomsky in "Three models for the description of language" during the formalization of transformational-generative grammar (TGG).[1] Marcel-Paul Schützenberger also played a role in the development of the theory of formal languages; the paper "The algebraic theory of context free languages"[2] describes the modern hierarchy, including context-free grammars.[3]

Independently, alongside linguists, mathematicians were developing models of computation (via automata). Parsing a sentence in a language is similar to computation, and the grammars described by Chomsky proved to both resemble and be equivalent in computational power to various machine models.[4]

The hierarchy

[edit | edit source]

The following table summarizes each of Chomsky's four types of grammars, the class of language it generates, the type of automaton that recognizes it, and the form its rules must have. The classes are defined by the constraints on the productions rules.

Grammar Languages Recognizing automaton Production rules (constraints)[lower-alpha 1] Examples[5][6]
Type-3 Regular Finite-state automaton Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow \text{a}}
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow \text{a}B} (right regular)
or
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow \text{a}}
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow B\text{a}} (left regular)
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^n \mid n > 0\}}
Type-2 Context-free Non-deterministic pushdown automaton Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow \alpha} Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^n \mid n > 0\}}
Type-1 Context-sensitive Linear-bounded non-deterministic Turing machine Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha A \beta \rightarrow \alpha \gamma \beta} Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^nc^n \mid n > 0\}}
Type-0 Recursively enumerable Turing machine Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma \rightarrow \alpha} (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} non-empty) Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{w \mid w} describes a terminating Turing machine Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \}}

Template:Noteslist

Note that the set of grammars corresponding to recursive languages is not a member of this hierarchy; these would be properly between Type-0 and Type-1.

Every regular language is context-free, every context-free language is context-sensitive, every context-sensitive language is recursive and every recursive language is recursively enumerable. These are all proper inclusions, meaning that there exist recursively enumerable languages that are not context-sensitive, context-sensitive languages that are not context-free and context-free languages that are not regular.[7]

Regular (Type-3) grammars

[edit | edit source]

Type-3 grammars generate the regular languages. Such a grammar restricts its rules to a single nonterminal on the left-hand side and a right-hand side consisting of a single terminal, possibly followed by a single nonterminal, in which case the grammar is right regular. Alternatively, all the rules can have their right-hand sides consist of a single terminal, possibly preceded by a single nonterminal (left regular). These generate the same languages. However, if left-regular rules and right-regular rules are combined, the language need no longer be regular. The rule Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S \rightarrow \varepsilon} is also allowed here if Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} does not appear on the right side of any rule. These languages are exactly all languages that can be decided by a finite-state automaton. Additionally, this family of formal languages can be obtained by regular expressions. Regular languages are commonly used to define search patterns and the lexical structure of programming languages.

For example, the regular language Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^n \mid n > 0\}} is generated by the Type-3 grammar Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G = (\{S\}, \{a\}, P, S)} with the productions Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P} being the following.

SaS
Sa

In linguistics, a language that is not regular is called supra-regular.[8][9]

Context-free (Type-2) grammars

[edit | edit source]

Type-2 grammars generate the context-free languages. These are defined by rules of the form Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A \rightarrow \alpha} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} being a nonterminal and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} being a string of terminals and/or nonterminals. These languages are exactly all languages that can be recognized by a non-deterministic pushdown automaton. Context-free languages—or rather its subset of deterministic context-free languages—are the theoretical basis for the phrase structure of most programming languages, though their semantic analysis includes context-sensitive name resolution due to declarations and scope. Often a subset of grammars is used to make parsing easier, such as by an LL parser.

For example, the context-free language Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^n \mid n > 0\}} is generated by the Type-2 grammar Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G = (\{S\}, \{a, b\}, P, S)} with the productions Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P} being the following.

SaSb
Sab

The language is context-free but not regular (by the pumping lemma for regular languages).

Every context-free language can be generated by a grammar in Chomsky normal form.

Context-sensitive (Type-1) grammars

[edit | edit source]

Type-1 grammars generate context-sensitive languages. These grammars have rules of the form Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha A\beta \rightarrow \alpha\gamma\beta} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} a nonterminal and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} , Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} strings of terminals and/or nonterminals. The strings Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} may be empty, but Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} must be nonempty. The rule Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S \rightarrow \epsilon} is allowed if Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} does not appear on the right side of any rule. The languages described by these grammars are exactly all languages that can be recognized by a linear bounded automaton (a nondeterministic Turing machine whose tape is bounded by a constant times the length of the input.)

For example, the context-sensitive language Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^nc^n \mid n > 0\}} is generated by the Type-1 grammar Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G = (\{S,A,B,C,W,Z\}, \{a, b, c\}, P, S)} with the productions Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P} being the following.

SaBC
SaSBC
CBCZ
CZWZ
WZWC
WCBC
aBab
bBbb
bCbc
cCcc

The language is context-sensitive but not context-free (by the pumping lemma for context-free languages). A proof that this grammar generates Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^nc^n \mid n > 0\}} is sketched in the article on Context-sensitive grammars.

Recursively enumerable (Type-0) grammars

[edit | edit source]

Type-0 grammars include all formal grammars. There are no constraints on the productions rules. They generate exactly all languages that can be recognized by a Turing machine, thus any language that is possible to be generated can be generated by a Type-0 grammar.[10] These languages are also known as the recursively enumerable or Turing-recognizable languages.[10] Note that this is different from the recursive languages, which can be decided by an always-halting Turing machine.

Natural languages

[edit | edit source]

When researching the position of natural language in the Chomsky hierarchy, it has been shown in the 1950s that natural language is not regular.[11] For instance, English contains center embedding constructions, which means that it cannot be regular.[8][9] To be more exact, the argument is that the language

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^mc^md^n \mid n,m > 0\}}

is not regular, which can be shown using the pumping lemma for context-free languages. In subsequent decades, it was shown that natural language is also not context-free, using the example of cross-serial dependencies in Swiss German.[12][13] In this case, the argument is that the language

Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle L = \{a^nb^mc^nd^m \mid n,m > 0\}}

is not context-free.

Citations

[edit | edit source]
  1. Chomsky 1956.
  2. Chomsky & Schützenberger 1963.
  3. Allott, Nicholas; Lohndal, Terje; Rey, Georges (27 April 2021). "Synoptic Introduction". A Companion to Chomsky. pp. 1–17. doi:10.1002/9781119598732.ch1. ISBN 9781119598701. S2CID 241301126.
  4. Kozen, Dexter C. (2007). Automata and computability. Undergraduate Texts in Computer Science. Springer. pp. 3–4. ISBN 978-0-387-94907-9.
  5. Geuvers, H.; Rot, J. (2016). "Applications, Chomsky hierarchy, and Recap" (PDF). Regular Languages. Archived (PDF) from the original on 2018-11-19.
  6. Sudkamp, Thomas A. (1997) [1988]. Languages and machines: An Introduction to the Theory of Computer Science. Reading, Massachusetts, USA: Addison Wesley Longman. p. 310. ISBN 978-0-201-82136-9.
  7. Chomsky, Noam (1963). "Chapter 12: Formal Properties of Grammars". In Luce, R. Duncan; Bush, Robert R.; Galanter, Eugene (eds.). Handbook of Mathematical Psychology. II. John Wiley and Sons, Inc. pp. 323–418.
  8. 8.0 8.1 Jäger, Gerhard; Rogers, James (2012). "Formal language theory: refining the Chomsky hierarchy". Philosophical Transactions of the Royal Society B. 367: 1956–1970. doi:10.1098/rstb.2012.00777.
  9. 9.0 9.1 Fitch, W. Tecumseh; Friederici, Angela D. (2012). "Artificial grammar learning meets formal language theory: an overview". Philosophical Transactions of the Royal Society B. 367: 1933–1955. doi:10.1098/rstb.2012.0103.
  10. 10.0 10.1 Sipser, Michael (1997). Introduction to the Theory of Computation (1st ed.). Cengage Learning. p. 130. ISBN 0-534-94728-X. The Church-Turing Thesis
  11. Chomsky, Noam (1957). Syntactic Structures. Mouton & Co.
  12. Huijbregts, Riny (1984). "The weak inadequacy of context-free phrase structure grammars". In de Haan, Geer; Trommele, Mieke; Zonnefeld, Wim (eds.). Van Periferie naar Kern. Foris. pp. 81–99.
  13. Shieber, Stuart M. (1985). "Evidence against the context-freeness of natural language". Linguistics and Philosophy. 8: 333–343. doi:10.1007/BF00630917.

References

[edit | edit source]

Template:Formal languages and grammars Template:Noam Chomsky
Cite error: There are <ref group=lower-alpha> tags or {{efn}} templates on this page, but the references will not show without a {{reflist|group=lower-alpha}} template or {{notelist}} template (see the help page).