- Setting the HTTP charset parameter
- The charset parameter
- Server setup
- Scripting the header
- Further reading
- HTML Encoding (Character Sets)
- From ASCII to UTF-8
- The HTML charset Attribute
- Differences Between Character Sets
- The ASCII Character Set
- The ANSI Character Set (Windows-1252)
- The ISO-8859-1 Character Set
- The UTF-8 Character Set
Setting the HTTP charset parameter
When a server sends a document to a user agent (eg. a browser) it also sends information in the Content-Type field of the accompanying HTTP header about what type of data format this is. This information is expressed using a MIME type label. This article provides a starting point for those needing to set the encoding information in the HTTP header.
The charset parameter
Documents transmitted with HTTP that are of type text, such as text/html, text/plain, etc., can send a charset parameter in the HTTP header to specify the character encoding of the document.
It is very important to always label Web documents explicitly. HTTP 1.1 says that the default charset is ISO-8859-1. But there are too many unlabeled documents in other encodings, so browsers use the reader’s preferred encoding when there is no explicit charset parameter.
The line in the HTTP header typically looks like this:
In theory, any character encoding that has been registered with IANA can be used, but there is no browser that understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it. A Unicode encoding such as UTF-8 is a good choice for a number of reasons.
Server setup
How to make the server send out appropriate charset information depends on the server. You will need the appropriate administrative rights to be able to change server settings.
Apache. This can be done via the AddCharset (Apache 1.3.10 and later) or AddType directives, for directories or individual resources (files). With AddDefaultCharset (Apache 1.3.12 and later), it is possible to set the default charset for a whole server. For more information, see the article on Setting ‘charset’ information in .htaccess.
Jigsaw. Use an indexer in JigAdmin to associate extensions with charsets, or set the charset directly on a resource.
IIS 5 and 6. In Internet Services Manager, right-click «Default Web Site» (or the site you want to configure) and go to «Properties» => «HTTP Headers» => «File Types. » => «New Type. «. Put in the extension you want to map, separately for each extension; IIS users will probably want to map .htm, .html. Then, for Content type, add » text/html;charset=utf-8 » (without the quotes; substitute your desired charset for utf-8; do not leave any spaces anywhere because IIS ignores all text after spaces). For IIS 4, you may have to use «HTTP Headers» => «Creating a Custom HTTP Header» if the above does not work.
Scripting the header
The appropriate header can also be set in server side scripting languages. For example:
Perl. Output the correct header before any part of the actual page. After the last header, use a double linebreak, e.g.:
print «Content-Type: text/html; charset=utf-8\n\n»;
Python. Use the same solution as for Perl (except that you don’t need a semicolon at the end).
PHP. Use the header() function before generating any content, e.g.:
header(‘Content-type: text/html; charset=utf-8’);
Java Servlets. Use the setContentType method on the ServletResponse before obtaining any object (Stream or Writer) used for output, e.g.:
resource.setContentType («text/html;charset=utf-8»);
If you use a Writer, the Servlet automatically takes care of the conversion from Java Strings to the encoding selected.
JSP. Use the page directive e.g.:
Output from out.println() or the expression elements ( ) is automatically converted to the encoding selected. Also, the page itself is interpreted as being in this encoding.
ASP and ASP.Net. ContentType and charset are set independently, and are methods on the response object. To set the charset, use e.g.:
In ASP.Net, setting Response.ContentEncoding will take care both of the charset parameter in the HTTP Content-Type as well as of the actual encoding of the document sent out (which of course have to be the same). The default can be set in the globalization element in Web.config (or Machine.config , which is originally set to UTF-8).
Further reading
- Setting charset information in .htaccess
- Checking HTTP Headers
- Tutorial, Handling character encodings in HTML and CSS
- Related links, Setting up a server
- Characters
- Setting the HTTP charset parameter
- Characters
- Characters
HTML Encoding (Character Sets)
To display an HTML page correctly, a web browser must know which character set to use.
From ASCII to UTF-8
ASCII was the first character encoding standard. ASCII defined 128 different characters that could be used on the internet: numbers (0-9), English letters (A-Z), and some special characters like ! $ + — ( ) @ < >.
ISO-8859-1 was the default character set for HTML 4. This character set supported 256 different character codes. HTML 4 also supported UTF-8.
ANSI (Windows-1252) was the original Windows character set. ANSI is identical to ISO-8859-1, except that ANSI has 32 extra characters.
The HTML5 specification encourages web developers to use the UTF-8 character set, which covers almost all of the characters and symbols in the world!
The HTML charset Attribute
To display an HTML page correctly, a web browser must know the character set used in the page.
This is specified in the tag:
Differences Between Character Sets
The following table displays the differences between the character sets described above:
Numb ASCII ANSI 8859 UTF-8 Description 32 space 33 ! ! ! ! exclamation mark 34 « « « « quotation mark 35 # # # # number sign 36 $ $ $ $ dollar sign 37 % % % % percent sign 38 & & & & ampersand 39 ‘ ‘ ‘ ‘ apostrophe 40 ( ( ( ( left parenthesis 41 ) ) ) ) right parenthesis 42 * * * * asterisk 43 + + + + plus sign 44 , , , , comma 45 — — — — hyphen-minus 46 . . . . full stop 47 / / / / solidus 48 0 0 0 0 digit zero 49 1 1 1 1 digit one 50 2 2 2 2 digit two 51 3 3 3 3 digit three 52 4 4 4 4 digit four 53 5 5 5 5 digit five 54 6 6 6 6 digit six 55 7 7 7 7 digit seven 56 8 8 8 8 digit eight 57 9 9 9 9 digit nine 58 : : : : colon 59 ; ; ; ; semicolon 60 less-than sign 61 = = = = equals sign 62 > > > > greater-than sign 63 ? ? ? ? question mark 64 @ @ @ @ commercial at 65 A A A A Latin capital letter A 66 B B B B Latin capital letter B 67 C C C C Latin capital letter C 68 D D D D Latin capital letter D 69 E E E E Latin capital letter E 70 F F F F Latin capital letter F 71 G G G G Latin capital letter G 72 H H H H Latin capital letter H 73 I I I I Latin capital letter I 74 J J J J Latin capital letter J 75 K K K K Latin capital letter K 76 L L L L Latin capital letter L 77 M M M M Latin capital letter M 78 N N N N Latin capital letter N 79 O O O O Latin capital letter O 80 P P P P Latin capital letter P 81 Q Q Q Q Latin capital letter Q 82 R R R R Latin capital letter R 83 S S S S Latin capital letter S 84 T T T T Latin capital letter T 85 U U U U Latin capital letter U 86 V V V V Latin capital letter V 87 W W W W Latin capital letter W 88 X X X X Latin capital letter X 89 Y Y Y Y Latin capital letter Y 90 Z Z Z Z Latin capital letter Z 91 [ [ [ [ left square bracket 92 \ \ \ \ reverse solidus 93 ] ] ] ] right square bracket 94 ^ ^ ^ ^ circumflex accent 95 _ _ _ _ low line 96 ` ` ` ` grave accent 97 a a a a Latin small letter a 98 b b b b Latin small letter b 99 c c c c Latin small letter c 100 d d d d Latin small letter d 101 e e e e Latin small letter e 102 f f f f Latin small letter f 103 g g g g Latin small letter g 104 h h h h Latin small letter h 105 i i i i Latin small letter i 106 j j j j Latin small letter j 107 k k k k Latin small letter k 108 l l l l Latin small letter l 109 m m m m Latin small letter m 110 n n n n Latin small letter n 111 o o o o Latin small letter o 112 p p p p Latin small letter p 113 q q q q Latin small letter q 114 r r r r Latin small letter r 115 s s s s Latin small letter s 116 t t t t Latin small letter t 117 u u u u Latin small letter u 118 v v v v Latin small letter v 119 w w w w Latin small letter w 120 x x x x Latin small letter x 121 y y y y Latin small letter y 122 z z z z Latin small letter z 123 } } } right curly bracket 126 ~ ~ ~ ~ tilde 127 DEL 128 euro sign 129 NOT USED 130 single low-9 quotation mark 131 Latin small letter f with hook 132 double low-9 quotation mark 133 horizontal ellipsis 134 dagger 135 double dagger 136 modifier letter circumflex accent 137 per mille sign 138 Latin capital letter S with caron 139 single left-pointing angle quotation mark 140 Latin capital ligature OE 141 NOT USED 142 Latin capital letter Z with caron 143 NOT USED 144 NOT USED 145 left single quotation mark 146 right single quotation mark 147 left double quotation mark 148 right double quotation mark 149 bullet 150 en dash 151 em dash 152 small tilde 153 trade mark sign 154 Latin small letter s with caron 155 single right-pointing angle quotation mark 156 Latin small ligature oe 157 NOT USED 158 Latin small letter z with caron 159 Latin capital letter Y with diaeresis 160 no-break space 161 ¡ ¡ ¡ inverted exclamation mark 162 ¢ ¢ ¢ cent sign 163 £ £ £ pound sign 164 ¤ ¤ ¤ currency sign 165 ¥ ¥ ¥ yen sign 166 ¦ ¦ ¦ broken bar 167 § § § section sign 168 ¨ ¨ ¨ diaeresis 169 © © © copyright sign 170 ª ª ª feminine ordinal indicator 171 « « « left-pointing double angle quotation mark 172 ¬ ¬ ¬ not sign 173 soft hyphen 174 ® ® ® registered sign 175 ¯ ¯ ¯ macron 176 ° ° ° degree sign 177 ± ± ± plus-minus sign 178 ² ² ² superscript two 179 ³ ³ ³ superscript three 180 ´ ´ ´ acute accent 181 µ µ µ micro sign 182 ¶ ¶ ¶ pilcrow sign 183 · · · middle dot 184 ¸ ¸ ¸ cedilla 185 ¹ ¹ ¹ superscript one 186 º º º masculine ordinal indicator 187 » » » right-pointing double angle quotation mark 188 ¼ ¼ ¼ vulgar fraction one quarter 189 ½ ½ ½ vulgar fraction one half 190 ¾ ¾ ¾ vulgar fraction three quarters 191 ¿ ¿ ¿ inverted question mark 192 À À À Latin capital letter A with grave 193 Á Á Á Latin capital letter A with acute 194 Â Â Â Latin capital letter A with circumflex 195 Ã Ã Ã Latin capital letter A with tilde 196 Ä Ä Ä Latin capital letter A with diaeresis 197 Å Å Å Latin capital letter A with ring above 198 Æ Æ Æ Latin capital letter AE 199 Ç Ç Ç Latin capital letter C with cedilla 200 È È È Latin capital letter E with grave 201 É É É Latin capital letter E with acute 202 Ê Ê Ê Latin capital letter E with circumflex 203 Ë Ë Ë Latin capital letter E with diaeresis 204 Ì Ì Ì Latin capital letter I with grave 205 Í Í Í Latin capital letter I with acute 206 Î Î Î Latin capital letter I with circumflex 207 Ï Ï Ï Latin capital letter I with diaeresis 208 Ð Ð Ð Latin capital letter Eth 209 Ñ Ñ Ñ Latin capital letter N with tilde 210 Ò Ò Ò Latin capital letter O with grave 211 Ó Ó Ó Latin capital letter O with acute 212 Ô Ô Ô Latin capital letter O with circumflex 213 Õ Õ Õ Latin capital letter O with tilde 214 Ö Ö Ö Latin capital letter O with diaeresis 215 × × × multiplication sign 216 Ø Ø Ø Latin capital letter O with stroke 217 Ù Ù Ù Latin capital letter U with grave 218 Ú Ú Ú Latin capital letter U with acute 219 Û Û Û Latin capital letter U with circumflex 220 Ü Ü Ü Latin capital letter U with diaeresis 221 Ý Ý Ý Latin capital letter Y with acute 222 Þ Þ Þ Latin capital letter Thorn 223 ß ß ß Latin small letter sharp s 224 à à à Latin small letter a with grave 225 á á á Latin small letter a with acute 226 â â â Latin small letter a with circumflex 227 ã ã ã Latin small letter a with tilde 228 ä ä ä Latin small letter a with diaeresis 229 å å å Latin small letter a with ring above 230 æ æ æ Latin small letter ae 231 ç ç ç Latin small letter c with cedilla 232 è è è Latin small letter e with grave 233 é é é Latin small letter e with acute 234 ê ê ê Latin small letter e with circumflex 235 ë ë ë Latin small letter e with diaeresis 236 ì ì ì Latin small letter i with grave 237 í í í Latin small letter i with acute 238 î î î Latin small letter i with circumflex 239 ï ï ï Latin small letter i with diaeresis 240 ð ð ð Latin small letter eth 241 ñ ñ ñ Latin small letter n with tilde 242 ò ò ò Latin small letter o with grave 243 ó ó ó Latin small letter o with acute 244 ô ô ô Latin small letter o with circumflex 245 õ õ õ Latin small letter o with tilde 246 ö ö ö Latin small letter o with diaeresis 247 ÷ ÷ ÷ division sign 248 ø ø ø Latin small letter o with stroke 249 ù ù ù Latin small letter u with grave 250 ú ú ú Latin small letter u with acute 251 û û û Latin small letter with circumflex 252 ü ü ü Latin small letter u with diaeresis 253 ý ý ý Latin small letter y with acute 254 þ þ þ Latin small letter thorn 255 ÿ ÿ ÿ Latin small letter y with diaeresis The ASCII Character Set
ASCII uses the values from 0 to 31 (and 127) for control characters.
ASCII uses the values from 32 to 126 for letters, digits, and symbols.
ASCII does not use the values from 128 to 255.
The ANSI Character Set (Windows-1252)
ANSI is identical to ASCII for the values from 0 to 127.
ANSI has a proprietary set of characters for the values from 128 to 159.
ANSI is identical to UTF-8 for the values from 160 to 255.
The ISO-8859-1 Character Set
ISO-8859-1 is identical to ASCII for the values from 0 to 127.
ISO-8859-1 does not use the values from 128 to 159.
ISO-8859-1 is identical to UTF-8 for the values from 160 to 255.
The UTF-8 Character Set
UTF-8 is identical to ASCII for the values from 0 to 127.
UTF-8 does not use the values from 128 to 159.
UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255.
UTF-8 continues from the value 256 with more than 10 000 different characters.