- Changing an HTML page to Unicode
- Answer
- Step 1: Save the data as UTF-8
- Step 2: Declare the encoding in your page
- Step 3: Ensure that your server does the right thing
- Further reading
- Атрибут charset
- Синтаксис
- Значения
- Значение по умолчанию
- Типы тегов
- HTML Encoding (Character Sets)
- From ASCII to UTF-8
- The HTML charset Attribute
- Differences Between Character Sets
- The ASCII Character Set
- The ANSI Character Set (Windows-1252)
- The ISO-8859-1 Character Set
- The UTF-8 Character Set
Changing an HTML page to Unicode
This page will help you change the character encoding of your HTML page to UTF-8.
Answer
Below we summarise the information you need to convert a simple page to a Unicode character encoding. Follow the links to other articles on the site if you need to get detailed information about any step.
For much more detailed advice about converting complex sites, software and data to Unicode, see the article Migrating to Unicode.
Step 1: Save the data as UTF-8
It will not be sufficient to just change the declarations inside your pages to say that the page is encoded in UTF-8. You must ensure that your data is actually encoded, ie. saved, in UTF-8.
If you are working with hand-edited files then you should use the options of your editor to save the file in UTF-8 rather than the encoding you were using. If you are building files from scripts and databases, you should ensure that the data is converted as necessary and that the correct parameters are set in your scripting environment.
Note that you may have to ensure that the data does not include a UTF-8 signature, also known as a byte-order mark (BOM).
Step 2: Declare the encoding in your page
You should change the character encoding declaration in your page (or add one if you don’t already declare it).
In its simplest form, this looks as follows, and should come at the beginning of the head element in your HTML code.
Step 3: Ensure that your server does the right thing
Although your data is in UTF-8 and you have declared it in the page, your server may still be serving the page with an accompanying HTTP header that says it is something else.
Test it by putting the URL of your page in this form. It will take you to the Internationalization Checker. Look in the table for the row with the title HTTP Content-Type , under Character Encoding , and check that it says either UTF-8 or No encoding information found .
If the HTTP Content-Type shows an encoding other than UTF-8 you’ll need to take steps to rectify it, because the declaration in the HTTP header will override information inside the page.
Server admin privileges are needed to change the encoding sent in the HTTP header, though you may be able to do so yourself even if you are serving files via an ISP. Consult your server admin person. See the explanation of one way to do this for an Apache server.
Further reading
- Getting started? Introducing Character Sets and Encodings
- Tutorial, Handling character encodings in HTML and CSS
- Migrating to Unicode A much more in-depth article about changing software and data to Unicode.
- Authoring HTML & CSS
- Characters
- Changing to UTF-8
Атрибут charset
Указывает кодировку документа. Атрибут введен в HTML5 и предназначен для сокращения формы тега , которая задавала кодировку в предыдущих версиях HTML и XHTML.
Синтаксис
Значения
Название кодировки, например UTF-8.
Значение по умолчанию
Типовой документ.
Не выкладывайте свой код напрямую в комментариях, он отображается некорректно. Воспользуйтесь сервисом cssdeck.com или jsfiddle.net, сохраните код и в комментариях дайте на него ссылку. Так и результат сразу увидят.
Типы тегов
HTML5
Блочные элементы
Строчные элементы
Универсальные элементы
Нестандартные теги
Осуждаемые теги
Видео
Документ
Звук
Изображения
Объекты
Скрипты
Списки
Ссылки
Таблицы
Текст
Форматирование
Формы
Фреймы
HTML Encoding (Character Sets)
To display an HTML page correctly, a web browser must know which character set to use.
From ASCII to UTF-8
ASCII was the first character encoding standard. ASCII defined 128 different characters that could be used on the internet: numbers (0-9), English letters (A-Z), and some special characters like ! $ + — ( ) @ < >.
ISO-8859-1 was the default character set for HTML 4. This character set supported 256 different character codes. HTML 4 also supported UTF-8.
ANSI (Windows-1252) was the original Windows character set. ANSI is identical to ISO-8859-1, except that ANSI has 32 extra characters.
The HTML5 specification encourages web developers to use the UTF-8 character set, which covers almost all of the characters and symbols in the world!
The HTML charset Attribute
To display an HTML page correctly, a web browser must know the character set used in the page.
This is specified in the tag:
Differences Between Character Sets
The following table displays the differences between the character sets described above:
Numb ASCII ANSI 8859 UTF-8 Description 32 space 33 ! ! ! ! exclamation mark 34 « « « « quotation mark 35 # # # # number sign 36 $ $ $ $ dollar sign 37 % % % % percent sign 38 & & & & ampersand 39 ‘ ‘ ‘ ‘ apostrophe 40 ( ( ( ( left parenthesis 41 ) ) ) ) right parenthesis 42 * * * * asterisk 43 + + + + plus sign 44 , , , , comma 45 — — — — hyphen-minus 46 . . . . full stop 47 / / / / solidus 48 0 0 0 0 digit zero 49 1 1 1 1 digit one 50 2 2 2 2 digit two 51 3 3 3 3 digit three 52 4 4 4 4 digit four 53 5 5 5 5 digit five 54 6 6 6 6 digit six 55 7 7 7 7 digit seven 56 8 8 8 8 digit eight 57 9 9 9 9 digit nine 58 : : : : colon 59 ; ; ; ; semicolon 60 less-than sign 61 = = = = equals sign 62 > > > > greater-than sign 63 ? ? ? ? question mark 64 @ @ @ @ commercial at 65 A A A A Latin capital letter A 66 B B B B Latin capital letter B 67 C C C C Latin capital letter C 68 D D D D Latin capital letter D 69 E E E E Latin capital letter E 70 F F F F Latin capital letter F 71 G G G G Latin capital letter G 72 H H H H Latin capital letter H 73 I I I I Latin capital letter I 74 J J J J Latin capital letter J 75 K K K K Latin capital letter K 76 L L L L Latin capital letter L 77 M M M M Latin capital letter M 78 N N N N Latin capital letter N 79 O O O O Latin capital letter O 80 P P P P Latin capital letter P 81 Q Q Q Q Latin capital letter Q 82 R R R R Latin capital letter R 83 S S S S Latin capital letter S 84 T T T T Latin capital letter T 85 U U U U Latin capital letter U 86 V V V V Latin capital letter V 87 W W W W Latin capital letter W 88 X X X X Latin capital letter X 89 Y Y Y Y Latin capital letter Y 90 Z Z Z Z Latin capital letter Z 91 [ [ [ [ left square bracket 92 \ \ \ \ reverse solidus 93 ] ] ] ] right square bracket 94 ^ ^ ^ ^ circumflex accent 95 _ _ _ _ low line 96 ` ` ` ` grave accent 97 a a a a Latin small letter a 98 b b b b Latin small letter b 99 c c c c Latin small letter c 100 d d d d Latin small letter d 101 e e e e Latin small letter e 102 f f f f Latin small letter f 103 g g g g Latin small letter g 104 h h h h Latin small letter h 105 i i i i Latin small letter i 106 j j j j Latin small letter j 107 k k k k Latin small letter k 108 l l l l Latin small letter l 109 m m m m Latin small letter m 110 n n n n Latin small letter n 111 o o o o Latin small letter o 112 p p p p Latin small letter p 113 q q q q Latin small letter q 114 r r r r Latin small letter r 115 s s s s Latin small letter s 116 t t t t Latin small letter t 117 u u u u Latin small letter u 118 v v v v Latin small letter v 119 w w w w Latin small letter w 120 x x x x Latin small letter x 121 y y y y Latin small letter y 122 z z z z Latin small letter z 123 } } } right curly bracket 126 ~ ~ ~ ~ tilde 127 DEL 128 euro sign 129 NOT USED 130 single low-9 quotation mark 131 Latin small letter f with hook 132 double low-9 quotation mark 133 horizontal ellipsis 134 dagger 135 double dagger 136 modifier letter circumflex accent 137 per mille sign 138 Latin capital letter S with caron 139 single left-pointing angle quotation mark 140 Latin capital ligature OE 141 NOT USED 142 Latin capital letter Z with caron 143 NOT USED 144 NOT USED 145 left single quotation mark 146 right single quotation mark 147 left double quotation mark 148 right double quotation mark 149 bullet 150 en dash 151 em dash 152 small tilde 153 trade mark sign 154 Latin small letter s with caron 155 single right-pointing angle quotation mark 156 Latin small ligature oe 157 NOT USED 158 Latin small letter z with caron 159 Latin capital letter Y with diaeresis 160 no-break space 161 ¡ ¡ ¡ inverted exclamation mark 162 ¢ ¢ ¢ cent sign 163 £ £ £ pound sign 164 ¤ ¤ ¤ currency sign 165 ¥ ¥ ¥ yen sign 166 ¦ ¦ ¦ broken bar 167 § § § section sign 168 ¨ ¨ ¨ diaeresis 169 © © © copyright sign 170 ª ª ª feminine ordinal indicator 171 « « « left-pointing double angle quotation mark 172 ¬ ¬ ¬ not sign 173 soft hyphen 174 ® ® ® registered sign 175 ¯ ¯ ¯ macron 176 ° ° ° degree sign 177 ± ± ± plus-minus sign 178 ² ² ² superscript two 179 ³ ³ ³ superscript three 180 ´ ´ ´ acute accent 181 µ µ µ micro sign 182 ¶ ¶ ¶ pilcrow sign 183 · · · middle dot 184 ¸ ¸ ¸ cedilla 185 ¹ ¹ ¹ superscript one 186 º º º masculine ordinal indicator 187 » » » right-pointing double angle quotation mark 188 ¼ ¼ ¼ vulgar fraction one quarter 189 ½ ½ ½ vulgar fraction one half 190 ¾ ¾ ¾ vulgar fraction three quarters 191 ¿ ¿ ¿ inverted question mark 192 À À À Latin capital letter A with grave 193 Á Á Á Latin capital letter A with acute 194 Â Â Â Latin capital letter A with circumflex 195 Ã Ã Ã Latin capital letter A with tilde 196 Ä Ä Ä Latin capital letter A with diaeresis 197 Å Å Å Latin capital letter A with ring above 198 Æ Æ Æ Latin capital letter AE 199 Ç Ç Ç Latin capital letter C with cedilla 200 È È È Latin capital letter E with grave 201 É É É Latin capital letter E with acute 202 Ê Ê Ê Latin capital letter E with circumflex 203 Ë Ë Ë Latin capital letter E with diaeresis 204 Ì Ì Ì Latin capital letter I with grave 205 Í Í Í Latin capital letter I with acute 206 Î Î Î Latin capital letter I with circumflex 207 Ï Ï Ï Latin capital letter I with diaeresis 208 Ð Ð Ð Latin capital letter Eth 209 Ñ Ñ Ñ Latin capital letter N with tilde 210 Ò Ò Ò Latin capital letter O with grave 211 Ó Ó Ó Latin capital letter O with acute 212 Ô Ô Ô Latin capital letter O with circumflex 213 Õ Õ Õ Latin capital letter O with tilde 214 Ö Ö Ö Latin capital letter O with diaeresis 215 × × × multiplication sign 216 Ø Ø Ø Latin capital letter O with stroke 217 Ù Ù Ù Latin capital letter U with grave 218 Ú Ú Ú Latin capital letter U with acute 219 Û Û Û Latin capital letter U with circumflex 220 Ü Ü Ü Latin capital letter U with diaeresis 221 Ý Ý Ý Latin capital letter Y with acute 222 Þ Þ Þ Latin capital letter Thorn 223 ß ß ß Latin small letter sharp s 224 à à à Latin small letter a with grave 225 á á á Latin small letter a with acute 226 â â â Latin small letter a with circumflex 227 ã ã ã Latin small letter a with tilde 228 ä ä ä Latin small letter a with diaeresis 229 å å å Latin small letter a with ring above 230 æ æ æ Latin small letter ae 231 ç ç ç Latin small letter c with cedilla 232 è è è Latin small letter e with grave 233 é é é Latin small letter e with acute 234 ê ê ê Latin small letter e with circumflex 235 ë ë ë Latin small letter e with diaeresis 236 ì ì ì Latin small letter i with grave 237 í í í Latin small letter i with acute 238 î î î Latin small letter i with circumflex 239 ï ï ï Latin small letter i with diaeresis 240 ð ð ð Latin small letter eth 241 ñ ñ ñ Latin small letter n with tilde 242 ò ò ò Latin small letter o with grave 243 ó ó ó Latin small letter o with acute 244 ô ô ô Latin small letter o with circumflex 245 õ õ õ Latin small letter o with tilde 246 ö ö ö Latin small letter o with diaeresis 247 ÷ ÷ ÷ division sign 248 ø ø ø Latin small letter o with stroke 249 ù ù ù Latin small letter u with grave 250 ú ú ú Latin small letter u with acute 251 û û û Latin small letter with circumflex 252 ü ü ü Latin small letter u with diaeresis 253 ý ý ý Latin small letter y with acute 254 þ þ þ Latin small letter thorn 255 ÿ ÿ ÿ Latin small letter y with diaeresis The ASCII Character Set
ASCII uses the values from 0 to 31 (and 127) for control characters.
ASCII uses the values from 32 to 126 for letters, digits, and symbols.
ASCII does not use the values from 128 to 255.
The ANSI Character Set (Windows-1252)
ANSI is identical to ASCII for the values from 0 to 127.
ANSI has a proprietary set of characters for the values from 128 to 159.
ANSI is identical to UTF-8 for the values from 160 to 255.
The ISO-8859-1 Character Set
ISO-8859-1 is identical to ASCII for the values from 0 to 127.
ISO-8859-1 does not use the values from 128 to 159.
ISO-8859-1 is identical to UTF-8 for the values from 160 to 255.
The UTF-8 Character Set
UTF-8 is identical to ASCII for the values from 0 to 127.
UTF-8 does not use the values from 128 to 159.
UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255.
UTF-8 continues from the value 256 with more than 10 000 different characters.