Character encodings in HTML: Difference between revisions

From Wikipedia
Jump to navigation Jump to search
imported>Ejazz128
m External links: i have removed the broken link
 
imported>Hairy Dude
 
Line 5: Line 5:
{{Html series}}
{{Html series}}
While Hypertext Markup Language ([[HTML]]) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international [[character (computing)|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]], two goals are worth considering: the information's [[integrity]], and universal [[Web browser|browser]] display.
While Hypertext Markup Language ([[HTML]]) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international [[character (computing)|character]]s were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit [[ASCII]], two goals are worth considering: the information's [[integrity]], and universal [[Web browser|browser]] display.
In version 5.3 of the now retired W3C specification, and the current Living Standard published by WHATWG, the only valid encoding is [[UTF-8]].<ref name="W3C5.3">{{cite book |chapter-url=https://www.w3.org/TR/2021/NOTE-html53-20210128/document-metadata.html#specifying-the-documents-character-encoding |chapter=Specifying the document's character encoding |title=HTML 5.3 |publisher=[[World Wide Web Consortium]] |date=28 January 2021 |access-date=2026-01-06}}</ref><ref name="html5charset">{{cite book |chapter-url=https://html.spec.whatwg.org/multipage/semantics.html#charset |chapter=Specifying the document's character encoding |title=HTML Standard |publisher=[[WHATWG]] |date=17 December 2025 |access-date=2026-01-06}}</ref>


==Specifying the document's character encoding==
==Specifying the document's character encoding==
There are two general ways to specify which character encoding is used in the document.
There are two general ways to specify which character encoding is used in the document.


First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{citation |chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |doi=10.17487/RFC7231 |access-date=2014-07-30|editor-last1=Fielding |editor-last2=Reschke |editor-first1=R |editor-first2=J |last1=Fielding |first1=R. |last2=Reschke |first2=J. |s2cid=14399078 }}</ref>
First, the [[web server]] can include the character encoding or "<code>charset</code>" in the [[HTTP|Hypertext Transfer Protocol]] (HTTP) <code>Content-Type</code> header, which would typically look like this:<ref>{{cite book |chapter-url=http://tools.ietf.org/html/rfc7231#section-3.1.1.5|chapter=Content-Type |title=Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content|publisher=[[IETF]] |date=June 2014 |doi=10.17487/RFC7231 |access-date=2014-07-30|editor-last1=Fielding |editor-last2=Reschke |editor-first1=R |editor-first2=J |last1=Fielding |first1=R. |last2=Reschke |first2=J. |s2cid=14399078 }}</ref>
<syntaxhighlight lang="http">
<syntaxhighlight lang="http">
Content-Type: text/html; charset=utf-8
Content-Type: text/html; charset=utf-8
Line 17: Line 20:
Second, a declaration can be included within the document itself.
Second, a declaration can be included within the document itself.


For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name=html5charset/>
For HTML it is possible to include this information inside the <code>head</code> element near the top of the document:<ref name="html5charset"/>
<!-- Please don't add a closing "/": that is incorrect here. -->
<!-- Please don't add a closing "/": that is incorrect here. -->
<syntaxhighlight lang="html">
<syntaxhighlight lang="html">
Line 23: Line 26:
</syntaxhighlight>
</syntaxhighlight>


[[HTML5]] also allows the following syntax to mean exactly the same:<ref name=html5charset>{{citation |chapter-url=http://www.w3.org/TR/html5/document-metadata.html#specifying-the-documents-character-encoding |chapter=Specifying the document's character encoding |title=HTML5 |publisher=[[World Wide Web Consortium]] |date=14 December 2017 |access-date=2018-05-28}}</ref>
[[HTML5]] also allows the following syntax to mean exactly the same:<ref name="html5charset"/>
<!-- Please don't add a closing "/": that is unnecessary here. -->
<!-- Please don't add a closing "/": that is unnecessary here. -->
<syntaxhighlight lang="html">
<syntaxhighlight lang="html">
Line 35: Line 38:


With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.
With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an [[ASCII extension]] then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as [[UTF-16BE]] and [[UTF-16LE]], a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.
Although HTML written to the current Living Standard is required to be UTF-8, an encoding declaration, in any of the above forms, is nonetheless required. It must be a case-insensitive match for the string "utf-8" and the document must, in fact, be in UTF-8.<ref name="html5charset"/><ref name="W3C5.3"/>


===Encoding detection algorithm===
===Encoding detection algorithm===
As of HTML5 the recommended charset is [[UTF-8]].<ref name=html5charset/> An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:
 
# Explicit user instruction
# Explicit user instruction
# An explicit meta tag within the first 1024 bytes of the document
# An explicit meta tag within the first 1024 bytes of the document
Line 44: Line 50:
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms.
# Analysis of the document bytes looking for specific sequences or ranges of byte values,<ref>{{cite web| url = http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding| title = HTML5 prescan a byte stream to determine its encoding}}</ref> and other tentative detection mechanisms.


Characters outside of the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.
Characters outside of the printable ASCII range (32 to 126) may appear incorrectly if the document is served with an incorrect character encoding. This presents few problems for [[English language|English]]-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean ([[CJK characters|CJK]]) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override ''incorrect'' charset label manually as well.


It is increasingly common for multilingual websites and websites in non-Western languages to use [[UTF-8]], which allows use of the same encoding for all languages. [[UTF-16]] or [[UTF-32]], which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.
[[UTF-8]] has been the most common character encoding on the Web since 2008, in part because, as an encoding of [[Unicode]], it allows use of the same encoding for all languages. <!--Sentence and source copied from [[UTF-8#Implementations and adoption]]:-->{{As of|2026|01}}, UTF-8 is used by 98.9% of web sites surveyed by W3Techs.<ref name=W3TechsWebEncoding>{{Cite web|url=https://w3techs.com/technologies/cross/character_encoding/ranking |title=Usage Survey of Character Encodings broken down by Ranking |website=W3Techs |language=en |date=January 2026 |access-date=2026-01-03}}</ref> [[UTF-16]] or [[UTF-32]], other encodings of Unicode, are less widely used because they can be harder to handle in programming languages that assume a [[byte-oriented]] ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.


Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.


==Permitted encodings==
==Permitted encodings==
The [[WHATWG]] Encoding Standard, referenced by recent HTML standards (the current WHATWG HTML Living Standard, as well as the formerly competing [[W3C]] HTML 5.0 and 5.1) specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.<ref name="html51">{{Cite web |url=https://www.w3.org/TR/html51/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5.1 Standard |publisher=W3C}}</ref><ref name="html50">{{Cite web |url=https://www.w3.org/TR/html5/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5 Standard |publisher=W3C}}</ref><ref name="html5living">{{Cite web |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |title=12.2.3.3 Character encodings |website=HTML Living Standard |publisher=WHATWG}}</ref> The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use [[UTF-8]] exclusively.<ref name="namesandlabels"/>
Version 5.3 of the retired W3C standard and the current ({{as of|2026|lc=y}}) WHATWG Living Standard both require UTF-8. No other encoding is considered valid.<ref name="W3C5.3"/><ref name="html5charset"/> Nonetheless, implementations must use the encoding sniffing algorithm to determine which encoding to apply to the document, in accordance with the [[robustness principle]].
 
The [[WHATWG]] Encoding Standard, referenced by both standards, specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.<ref name="html51">{{Cite web |url=https://www.w3.org/TR/html51/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5.1 Standard |publisher=W3C}}</ref><ref name="html50">{{Cite web |url=https://www.w3.org/TR/html5/syntax.html#character-encodings |title=8.2.2.3. Character encodings |website=HTML 5 Standard |publisher=W3C}}</ref><ref name="html5living">{{Cite web |url=https://html.spec.whatwg.org/multipage/parsing.html#character-encodings |title=12.2.3.3 Character encodings |website=HTML Living Standard |publisher=WHATWG}}</ref> The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use [[UTF-8]] exclusively.<ref name="namesandlabels"/>


Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:<ref name="html5living"/>
Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:<ref name="html5living"/>
{{columns-list|colwidth=12em|
{{columns-list|colwidth=12em|
* [[ISO-8859-2]]
* [[ISO-8859-2]]
* [[ISO-8859-7]]
* [[ISO-8859-7]]{{efn|name=not5.3|Omitted from W3C version 5.3.}}
* [[ISO-8859-8]]
* [[ISO-8859-8]]
* [[Windows-874]]{{efn|Also specified for <code>[[TIS-620]]</code>, <code>[[ISO-8859-11]]</code> and related labels.<ref name="namesandlabels"/>}}
* [[Windows-874]]{{efn|Also specified for <code>[[TIS-620]]</code>, <code>[[ISO-8859-11]]</code> and related labels.<ref name="namesandlabels"/>}}{{efn|name=not5.3}}
* [[Windows-1250]]
* [[Windows-1250]]
* [[Windows-1251]]
* [[Windows-1251]]
* [[Windows-1252]]{{efn|Also specified for <code>[[ASCII]]</code>, <code>[[ISO-8859-1]]</code> and related labels.<ref name="namesandlabels"/>}}
* [[Windows-1252]]{{efn|Also specified for <code>[[ASCII]]</code>, <code>[[ISO-8859-1]]</code> and related labels.<ref name="namesandlabels"/>}}
* [[Windows-1254]]{{efn|Also specified for <code>[[ISO-8859-9]]</code> and related labels.<ref name="namesandlabels"/>}}
* [[Windows-1254]]{{efn|Also specified for <code>[[ISO-8859-9]]</code> and related labels.<ref name="namesandlabels"/>}}
* [[Windows-1255]]
* [[Windows-1255]]{{efn|name=not5.3}}
* [[Windows-1256]]
* [[Windows-1256]]
* [[Windows-1257]]
* [[Windows-1257]]
* [[Windows-1258]]
* [[Windows-1258]]{{efn|name=not5.3}}
* [[GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[GB 18030]]{{efn|Specified with 0xA3A0 as a duplicate encoding of the [[ideographic space]] (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).<ref name="gbenc"/><ref name="gbindex"/> Also, specified with 0x80 accepted as an alternative encoding of the [[euro sign]] (U+20AC; see [[Windows-936]]).<ref>{{cite web |url=https://encoding.spec.whatwg.org/#gb18030-decoder |title=10.2.1. gb18030 decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> Otherwise, follows the mappings from the 2005 standard.<ref name="gbindex">{{cite web |url=https://encoding.spec.whatwg.org/#index-gb18030 |title=5. Indexes (§ index gb18030) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Big5]]{{efn|[[Hong Kong Supplementary Character Set]] variant,<ref name="encoding_rs"/> although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-big5-pointer |title=5. Indexes (§ index Big5 pointer) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[Shift JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}}
* [[Shift_JIS]]{{efn|The specification includes [[IBM]] and [[NEC]] extensions,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-jis0208 |title=5. Indexes (§ Index jis0208) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> and is more precisely [[Windows-31J]].<ref name="encoding_rs">{{cite web |url=https://docs.rs/encoding_rs/latest/encoding_rs/#notable-differences-from-iana-naming |title=Notable Differences from IANA Naming |work=Crate encoding_rs |publisher=docs.rs |author=Mozilla Foundation |author-link=Mozilla Foundation}}</ref>}}
* [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[ISO-2022-JP]]{{efn|The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. [[Half-width kana]] is converted to fullwidth by the encoder,<ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-iso-2022-jp-katakana |title=5. Indexes (§ Index ISO-2022-JP katakana) |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.<ref name="whatwgjisdecoder">{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-decoder |title=12.2.1. ISO-2022-JP decoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref> [[Shift Out]] and [[Shift In]] (0x0E and 0x0F) are excluded entirely to prevent attacks.<ref name="whatwgjisdecoder" /><ref>{{cite web |url=https://encoding.spec.whatwg.org/#iso-2022-jp-encoder |title=12.2.2. ISO-2022-JP encoder |institution=[[WHATWG]] |work=Encoding Standard |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
* [[EUC-KR]]{{efn|Actually [[Unified Hangul Code]] (Windows-949), which is a superset which covers the entire [[Hangul Syllables (Unicode block)|Hangul Syllables]] block.<ref name="encoding_rs"/><ref>{{cite web |url=https://encoding.spec.whatwg.org/#index-euc-kr |title=5. Indexes (§ index EUC-KR) |work=Encoding Standard |institution=[[WHATWG]] |last=van Kesteren |first=Anne |author-link=Anne van Kesteren}}</ref>}}
Line 126: Line 134:
A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format
A ''[[numeric character reference]]'' in HTML refers to a character by its [[Universal Character Set]]/[[Unicode]] ''[[code point]]'', and uses the format


:<code>&#''nnnn'';</code>
{{block indent|<code>&#''nnnn'';</code>}}
or
or
:<code>&#x''hhhh'';</code>
{{block indent|<code>&#x''hhhh'';</code>}}


where ''nnnn'' is the code point in [[decimal]] form, and ''hhhh'' is the code point in [[hexadecimal]] form. The ''x'' must be lowercase in XML documents. The ''nnnn'' or ''hhhh'' may be any number of digits and may include leading zeros. The ''hhhh'' may mix uppercase and lowercase, though uppercase is the usual style.
where ''nnnn'' is the code point in [[decimal]] form, and ''hhhh'' is the code point in [[hexadecimal]] form. The ''x'' must be lowercase in XML documents. The ''nnnn'' or ''hhhh'' may be any number of digits and may include leading zeros. The ''hhhh'' may mix uppercase and lowercase, though uppercase is the usual style.
Line 141: Line 149:


===XML character references===
===XML character references===
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{citation |chapter-url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |access-date=8 March 2010}}</ref>
Unlike traditional HTML with its large range of character entity references, in [[XML]] there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:<ref>{{cite book |chapter-url=http://www.w3.org/TR/REC-xml/#sec-references |chapter=Character and Entity References |title=XML |first1=T. |last1=Bray |author-link1=Tim Bray |first2=J. |last2=Paoli |first3=C. |last3=Sperberg-McQueen |author-link3=Michael Sperberg-McQueen |first4=E. |last4=Maler |first5=F. |last5=Yergeau |publisher=[[W3C]] |date=26 November 2008 |access-date=8 March 2010}}</ref>


{| class="wikitable"
{| class="wikitable"
! Reference !! Character !! Name !! Code point
|-
| <code>&amp;amp;</code>  ||align="center"| & || [[ampersand]]    || U+0026
| <code>&amp;amp;</code>  ||align="center"| & || [[ampersand]]    || U+0026
|-
|-
Line 168: Line 178:
== External links ==
== External links ==
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.w3.org/TR/REC-html40/sgml/entities.html Character entity references in HTML4]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding]
* [http://www.sitepoint.com/article/guide-web-character-encoding/ The Definitive Guide to Web Character Encoding] {{Webarchive|url=https://web.archive.org/web/20090729235752/http://www.sitepoint.com/article/guide-web-character-encoding/ |date=29 July 2009 }}
* [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook – more information about current browsers and their entity handling]
* [http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding HTML Entity Encoding chapter of Browser Security Handbook – more information about current browsers and their entity handling]
* [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)]
* [http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet The Open Web Application Security Project's wiki article on cross-site scripting (XSS)]

Latest revision as of 16:34, 6 January 2026

While Hypertext Markup Language (HTML) has been in use since 1991, HTML 4.0 from December 1997 was the first standardized version where international characters were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

In version 5.3 of the now retired W3C specification, and the current Living Standard published by WHATWG, the only valid encoding is UTF-8.[1][2]

Specifying the document's character encoding

There are two general ways to specify which character encoding is used in the document.

First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[3]

Content-Type: text/html; charset=utf-8

This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[4]

Second, a declaration can be included within the document itself.

For HTML it is possible to include this information inside the head element near the top of the document:[2]

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 also allows the following syntax to mean exactly the same:[2]

<meta charset="utf-8">

XHTML documents have a third option: to express the character encoding via XML declaration, as follows:[5]

<?xml version="1.0" encoding="utf-8"?>

With this second approach, because the character encoding cannot be known until the declaration is parsed, there is a problem knowing which character encoding is used in the document up to and including the declaration itself. If the character encoding is an ASCII extension then the content up to and including the declaration itself should be pure ASCII and this will work correctly. For character encodings that are not ASCII extensions (i.e. not a superset of ASCII), such as UTF-16BE and UTF-16LE, a processor of HTML, such as a web browser, should be able to parse the declaration in some cases through the use of heuristics.

Although HTML written to the current Living Standard is required to be UTF-8, an encoding declaration, in any of the above forms, is nonetheless required. It must be a case-insensitive match for the string "utf-8" and the document must, in fact, be in UTF-8.[2][1]

Encoding detection algorithm

An "encoding sniffing algorithm" is defined in the specification to determine the character encoding of the document based on multiple sources of input, including:

  1. Explicit user instruction
  2. An explicit meta tag within the first 1024 bytes of the document
  3. A byte order mark (BOM) within the first three bytes of the document
  4. The HTTP Content-Type or other transport layer information
  5. Analysis of the document bytes looking for specific sequences or ranges of byte values,[6] and other tentative detection mechanisms.

Characters outside of the printable ASCII range (32 to 126) may appear incorrectly if the document is served with an incorrect character encoding. This presents few problems for English-speaking users, but other languages regularly—in some cases, always—require characters outside that range. In Chinese, Japanese, and Korean (CJK) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed. Finally, browsers usually permit the user to override incorrect charset label manually as well.

UTF-8 has been the most common character encoding on the Web since 2008, in part because, as an encoding of Unicode, it allows use of the same encoding for all languages. As of January 2026, UTF-8 is used by 98.9% of web sites surveyed by W3Techs.[7] UTF-16 or UTF-32, other encodings of Unicode, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some platform-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers on different platforms or with different native languages will not see the page as intended.

Permitted encodings

Version 5.3 of the retired W3C standard and the current (as of 2026) WHATWG Living Standard both require UTF-8. No other encoding is considered valid.[1][2] Nonetheless, implementations must use the encoding sniffing algorithm to determine which encoding to apply to the document, in accordance with the robustness principle.

The WHATWG Encoding Standard, referenced by both standards, specifies a list of encodings which browsers must support. The HTML standards forbid support of other encodings.[8][9][10] The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use UTF-8 exclusively.[11]

Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard:[10]

  1. 1.0 1.1 1.2 1.3 Omitted from W3C version 5.3.
  2. Also specified for TIS-620, ISO-8859-11 and related labels.[11]
  3. Also specified for ASCII, ISO-8859-1 and related labels.[11]
  4. Also specified for ISO-8859-9 and related labels.[11]
  5. Specified with 0xA3A0 as a duplicate encoding of the ideographic space (U+3000) for compatibility reasons, and as such excluding U+E5E5 (a private use character).[12][13] Also, specified with 0x80 accepted as an alternative encoding of the euro sign (U+20AC; see Windows-936).[14] Otherwise, follows the mappings from the 2005 standard.[13]
  6. Hong Kong Supplementary Character Set variant,[15] although most of the HKSCS extensions (those with lead bytes less than 0xA1) are not included by the encoder, only by the decoder.[16]
  7. The specification includes IBM and NEC extensions,[17] and is more precisely Windows-31J.[15]
  8. The specification uses the same index as used for Shift JIS (insofar as is within reach), i.e. includes NEC extensions. Half-width kana is converted to fullwidth by the encoder,[18] but accepted using an escape sequence (ESC 0x28 0x49) by the decoder.[19] Shift Out and Shift In (0x0E and 0x0F) are excluded entirely to prevent attacks.[19][20]
  9. Actually Unified Hangul Code (Windows-949), which is a superset which covers the entire Hangul Syllables block.[15][21]
  10. Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[22]
  11. For compatibility with deployed content, also specified for the plain UTF-16 label,[23] although a byte order mark (BOM), if present, takes priority over any label.[24] Specified for decoding only; form submissions from UTF-16-coded documents are to be encoded in UTF-8.[22]
  12. Maps 0x00 through 0x7F to U+0000 through U+007F, and 0x80 through 0xFF to U+F780 through U+F7FF (a Private Use Area range), such that the low 8 bits of the code point always match the original byte.[25]

The following additional encodings are listed in the Encoding Standard, and support for them is therefore also required:[11]

  1. Uses the same encoder and decoder as ISO-8859-8, but is not subject to the visual-order behaviour which is used for documents labelled as ISO-8859-8.[26]
  2. Titled KOI8-U and specified for both KOI8-U and KOI8-RU labels;[11] follows KOI8-RU in positions 0xAE and 0xBE (i.e. includes Ў/ў)[27][28] but KOI8-U in positions 0x93–9F.[27]
  3. Also specified for GB2312 and related labels. Handled the same as GB 18030 for decoding purposes.[29] For encoding purposes, labelling as GBK (or GB 2312) excludes four-byte codes, and favours the one-byte 0x80 representation for U+20AC.[12]
  4. The specification uses the same index as used for Shift JIS (insofar as is within reach of the EUC code set 1), i.e. includes NEC extensions. JIS X 0212 is included for decoding only.[30]

The following encodings are listed as explicit examples of forbidden encodings:[10]

The standard also defines a "replacement" decoder, which maps all content labelled as certain encodings to the replacement character (�), refusing to process it at all. This is intended to prevent attacks (e.g. cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.[31] Although the same security concern applies to ISO-2022-JP and UTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.[32] The following encodings receive this treatment:[33]

Character references

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML.

HTML character references

A numeric character reference in HTML refers to a character by its Universal Character Set/Unicode code point, and uses the format

&#nnnn;

or

&#xhhhh;

where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.

Not all web browsers or email clients used by receivers of HTML documents, or text editors used by authors of HTML documents, will be able to render all HTML characters. Most modern software is able to display most or all of the characters for the user's language, and will draw a box or other clear indicator for characters they cannot render.

For codes from 0 to 127, the original 7-bit ASCII standard set, most of these characters can be used without a character reference. Codes from 160 to 255 can all be created using character entity names. Only a few higher-numbered codes can be created using entity names, but all can be created by decimal number character reference.

Character entity references can also have the format &name; where name is a case-sensitive alphanumeric string. For example, "λ" can also be encoded as &lambda; in an HTML document. The character entity references &lt;, &gt;, &quot; and &amp; are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup. This notably did not include XML's &apos; (') entity prior to HTML5. For a list of all named HTML character entity references along with the versions in which they were introduced, see List of XML and HTML character entity references.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately, then HTML character references are usually only required for markup delimiting characters as mentioned above, and for a few special characters (or none at all if a native Unicode encoding like UTF-8 is used). Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as cross-site scripting. If HTML attributes are left unquoted, certain characters, most importantly whitespace, such as space and tab, must be escaped using entities. Other languages related to HTML have their own methods of escaping characters.

XML character references

Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:[34]

Reference Character Name Code point
&amp; & ampersand U+0026
&lt; < less-than sign U+003C
&gt; > greater-than sign U+003E
&quot; " quotation mark U+0022
&apos; ' apostrophe U+0027

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML, which is an XML application, supports the HTML entity set, along with XML's predefined entities.

See also

References

  1. 1.0 1.1 1.2 "Specifying the document's character encoding". HTML 5.3. World Wide Web Consortium. 28 January 2021. Retrieved 6 January 2026.
  2. 2.0 2.1 2.2 2.3 2.4 "Specifying the document's character encoding". HTML Standard. WHATWG. 17 December 2025. Retrieved 6 January 2026.
  3. Fielding, R.; Reschke, J. (June 2014). "Content-Type". In Fielding, R; Reschke, J (eds.). Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. IETF. doi:10.17487/RFC7231. S2CID 14399078. Retrieved 30 July 2014.
  4. "Apache Module mod_charset_lite".
  5. Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008), "Prolog and Document Type Declaration", XML, W3C, retrieved 8 March 2010
  6. "HTML5 prescan a byte stream to determine its encoding".
  7. "Usage Survey of Character Encodings broken down by Ranking". W3Techs. January 2026. Retrieved 3 January 2026.
  8. "8.2.2.3. Character encodings". HTML 5.1 Standard. W3C.
  9. "8.2.2.3. Character encodings". HTML 5 Standard. W3C.
  10. 10.0 10.1 10.2 "12.2.3.3 Character encodings". HTML Living Standard. WHATWG.
  11. 11.0 11.1 11.2 11.3 11.4 11.5 van Kesteren, Anne. "4.2: Names and labels". Encoding Standard. WHATWG.
  12. 12.0 12.1 van Kesteren, Anne. "10.2.2. gb18030 encoder". Encoding Standard. WHATWG.
  13. 13.0 13.1 van Kesteren, Anne. "5. Indexes (§ index gb18030)". Encoding Standard. WHATWG.
  14. van Kesteren, Anne. "10.2.1. gb18030 decoder". Encoding Standard. WHATWG.
  15. 15.0 15.1 15.2 Mozilla Foundation. "Notable Differences from IANA Naming". Crate encoding_rs. docs.rs.
  16. van Kesteren, Anne. "5. Indexes (§ index Big5 pointer)". Encoding Standard. WHATWG.
  17. van Kesteren, Anne. "5. Indexes (§ Index jis0208)". Encoding Standard. WHATWG.
  18. van Kesteren, Anne. "5. Indexes (§ Index ISO-2022-JP katakana)". Encoding Standard. WHATWG.
  19. 19.0 19.1 van Kesteren, Anne. "12.2.1. ISO-2022-JP decoder". Encoding Standard. WHATWG.
  20. van Kesteren, Anne. "12.2.2. ISO-2022-JP encoder". Encoding Standard. WHATWG.
  21. van Kesteren, Anne. "5. Indexes (§ index EUC-KR)". Encoding Standard. WHATWG.
  22. 22.0 22.1 van Kesteren, Anne. "4.3. Output encodings". Encoding Standard. WHATWG.
  23. van Kesteren, Anne. "14.4. UTF-16LE". Encoding Standard. WHATWG.
  24. van Kesteren, Anne. "6. Hooks for standards (§ decode)". Encoding Standard. WHATWG.
  25. van Kesteren, Anne. "14.5. x-user-defined". Encoding Standard. WHATWG.
  26. van Kesteren, Anne. "9. Legacy single-byte encodings (§ Note)". Encoding Standard. WHATWG.
  27. 27.0 27.1 van Kesteren, Anne. "index KOI8-U visualization". Encoding Standard. WHATWG.
  28. "Bug 17053: Support KOI8-RU mapping for KOI8-U". W3C Bugzilla. 19 August 2015.
  29. van Kesteren, Anne. "10.1. GBK". Encoding Standard. WHATWG.
  30. van Kesteren, Anne. "5. Indexes (§ Index jis0212)". Encoding Standard. WHATWG.
  31. van Kesteren, Anne. "14.1: replacement". Encoding Standard. WHATWG.
  32. van Kesteren, Anne. "2: Security background". Encoding Standard. WHATWG.
  33. van Kesteren, Anne. "4.2: Names and labels (§ replacement)". Encoding Standard. WHATWG.
  34. Bray, T.; Paoli, J.; Sperberg-McQueen, C.; Maler, E.; Yergeau, F. (26 November 2008). "Character and Entity References". XML. W3C. Retrieved 8 March 2010.