Accept charset utf 8 php

PHP Charset FAQ

Warning: This blog post is more then 15 years old – read and use with care.

PHP Charset FAQ

List of questions

  • General
    • What is the difference between unicode and UTF-8/UTF-16/.
    • What is the difference between a charset and an encoding?
    • What is the difference between a character and a byte?
    • How do I determine the charset/encoding of a string?
    • What does «multibyte charset/encoding» mean?
    • What does string transliteration mean?
    • Why do I have such strange characters on my website?
    • Which charset does $client send?
    • How do I send the correct charset/encoding for $client?
    • How do I ensure the correct overall charset/encoding in my web application?
    • Does accept-charset help in HTML forms?
    • Which charset/encoding do strings have in PHP?
    • How do I change the encoding of a string?
    • How do I iterate characterwise over a string?
    • How do i iterate bytewise over a string?
    • How do I determine the length of a string?
    • Why shouln’t I use htmlentities?
    • Which encoding should I use for my source files?
    • How to ensure using the right encoding in my MySQL database?
    • How to ensure using the right encoding in my PostgreSQL database?
    • Balázs Bárány at Tuesday, 3.6. 2008
    • linuxamp at Thursday, 2.10. 2008
    • Sajal at Thursday, 15.1. 2009
    • Nicolas Grekas at Saturday, 14.2. 2009
    • Martin at Monday, 23.3. 2009
    • Leon at Monday, 31.8. 2009
    • Alfonso at Monday, 9.11. 2009
    • Thijs Feryn at Thursday, 28.1. 2010
    • ivan at Sunday, 28.3. 2010
    • spaze at Sunday, 6.6. 2010
    • shaffy at Tuesday, 10.8. 2010
    • Iain Cambridge at Wednesday, 10.11. 2010
    • Georgi at Thursday, 16.8. 2012
    • David Spector at Monday, 29.2. 2016

    If the FAQ was helpful to you, you can order me a thank you here: http://wishlist.kore-nordmann.de/

    General

    What is the difference between unicode and UTF-8/UTF-16/.

    Unicode is a charset, which means just a set of characters, which says nothing about how the characters are actually stored (mapped to bytes).

    UTF-8 / UTF-16 / . are encodings which define how a character is mapped to bytes in a string or byte array.

    Between UTF-8, UTF-16 and UTF-32 basically the amount of bytes used to encode some character differs. UTF-8 uses 1 byte for the characters defined in ASCII, and a dynamic width of two to four for other characters. UTF-32 constantly uses four bytes for each character which makes iterating over characters in a string trivial, but consumes much more space for common strings. Choosing the correct default encoding for your application is not trivial and depends on common strings and common string usage in your application.

    What is the difference between a charset and an encoding?

    A charset is a set of characters which can be represented in a certain encoding. The encoding actually defines which bytes are used for a certain character.

    An example: The character ‘☯’ is available in the unicode charset, and probably other charsets, too. But there are different encodings for this character, even for the same charset, like the following:

    Unicode character: ☯ UTF-8 encoded: 0xE2 0x98 0xAF UTF-16 encoded: 0x26 0x2F

    What is the difference between a character and a byte?

    Generally a byte, or a sequence of bytes, ist just an internal representation of a character depending on the used encoding. The encoding maps characters to bytes or byte sequences.

    In singlebyte encodings each character in the charset maps to exactly one byte, so bytes and characters can actually be confused, because they always represent just the same.

    On the other hand multibyte encodings like UTF-8 or UTF-16 are nowadays more common and map characters to multiple bytes, so that different character representations contain the same bytes:

    ⅱ, UTF-8 encoded: 0xE2 0x85 0xB1 ⅲ, UTF-8 encoded: 0xE2 0x85 0xB2

    As you can see, the first two bytes used for both characters are the same, while only the third byte differs.

    How do I determine the charset/encoding of a string?

    There is no way to do this right.

    For example all ISO 8859-* encodings work on all combinations of bytes, so that there is no way to know about the used encoding. You may guess the encoding, if you know the contents of a string, like detecting multiple expected occurrences of some not common characters.

    UTF-8 multibyte character sequences do have some characteristics you may check for, but each UTF-8 string may also be an ISO 8859-* string. To check if a string is a valid sequence of UTF-8 encoded characters you could use the following regular expression, but this won’t actually tell you, if the string is UTF-8, it still might be in nearly any other encoding:

    Don’t use it on big strings though, it may crash PCRE.

    What does «multibyte charset/encoding» mean?

    A multibyte charset uses not only one but multiple bytes for one character. The amount of bytes used for one character may be dynamic, like in UTF-8 and UTF-16, or fixed like in UTF-32.

    In a singlebyte encoding only 256 different characters can be represented, as this is the number of different values one byte can have (2^8). This number of characters is not sufficient for lots of languages, and especially not when you try to fit characters of multiple languages in one charset and encoding (unicode and UTF-*).

    What does string transliteration mean?

    When converting between different charsets it may happen that not all characters of the source string are available in the destination charset. In this case transliteration aims to provide another character or sequence of characters, which sufficiently replace the source character in the destination charset. Common transliterations for the german umlaut ä may be:

    Transliteration in PHP

    In PHP you can transliterate strings using different functions.

    1. The iconv() function supports very basic transliteration depending on the installed locales. Unknown characters are transliterated, when you append the string //TRANSLIT to the destination encoding, like shown in the conversion example: How do I change the encoding of a string?.
    2. The extension pecl/translit offers transliterations between several charsets, not depending on installed locales on your system. Check out its documentation for details.

    Why do I have such strange characters on my website?

    The content you send is encoded in a different encodings than specified for the client, or than the client detects.

    When speaking of websites we talk about browsers in most cases, which determine the encoding of a website basing on two factors:

    • The Content-Type header send by the webserver
    • The content-type meta tag in the (X)HTML header

    To ensure that the browser detects the correct charset and encoding of your website you should set both to the same value. The content type header sent by your website could either be configured in the web server configuration, or sent explicitly by PHP, using something like:

    header( ‘Content-type: text/html; charset=utf-8’ );

    Which charset does $client send?

    In most cases the client sends the encoding specified for the website. This works only if the encoding could be determined doubtless by the browser.

    If the browser can’t detect the specified charset, you cannot really know about the charset of the input strings. This is one reason you should respect the HTTP header «Accept-Encoding», even most browsers nowadays know about UTF-8.

    In every case there may be misbehaving browsers, or clients which just try to feed your application with invalid data. That’s why you should try to gracefully handle strings, which do not match your expected encoding.

    How do I send the correct charset/encoding for $client?

    Most HTTP clients send a header with a list of encodings/charsets they can understand and prefer. The header is called Accept-Charset and is available in PHP in $_SERVER[‘HTTP_ACCEPT_CHARSET’] . Most clients actually send lists of encodings they understand instead of lists of charset. A typical header may look like:

    utf-8;q=1.0, windows-1251;q=0.8, cp1251;q=0.8, koi8-r;q=0.8, *;q=0.5

    The header tells, that the client likes UTF-8 most (which is an encoding), and thinks it can also handle all kinds of charsets / encodings with a slight preference on windows-1251, cp1251 and koi8-r. Nowadays most clients can handle UTF-8 encoded content — for other clients you should either use plain UTF-7, which should work in most cases, or transform the contents to the requested encoding before sending.

    There is also a HTTP header Accept-Encoding, which actually does not have anything to do with the encodings we talk about in this FAQ, but contains a list of usable compression formats.

    How do I ensure the correct overall charset/encoding in my web application?

    You need to ensure that you know the correct charset at every point of your application. For all input strings you should recode them directly when receiving — maybe at a central input handler where you convert everything to the charset you use consistently in your application. For this read: Which charset does $client send?.

    With every content following a defined encoding you should pay attention that the backend uses the same encoding and also returns this. Check the databases section for this, depending on the type of backend you use.

    With every content in your defined encoding you should also ensure that you set this encoding in the output. Check How do I send the correct charset/encoding for $client? for details.

    Does accept-charset help in HTML forms?

    HTML forms may define the attribute accept-charset, which tells the browser which encoding to use when sending data to the server. The funny thing is, that it breaks horribly and is not handled properly by any browser. A form using this might look like:

    Testing this with different browsers on a website, which is encoded using UTF-8, this causes the following results with different browsers. The «acccept-charset» column contains the value, specified in the accept-value attribute of the form. The «Contained string» column shows the string passed in by the browser for the hiden field, while the «Pasted string» is something the user entered.

    Источник

    HTML accept-charset Attribute

    The accept-charset attribute specifies the character encodings that are to be used for the form submission.

    Browser Support

    Syntax

    Attribute Values

    Value Description
    character_set A space-separated list of one or more character encodings that are to be used for the form submission.
    • UTF-8 — Character encoding for Unicode
    • ISO-8859-1 — Character encoding for the Latin alphabet

    In theory, any character encoding can be used, but no browser understands all of them. The more widely a character encoding is used, the better the chance that a browser will understand it.

    To view all available character encodings, go to our Character sets reference.

    Unlock Full Access 50% off

    COLOR PICKER

    colorpicker

    Join our Bootcamp!

    Report Error

    If you want to report an error, or if you want to make a suggestion, do not hesitate to send us an e-mail:

    Thank You For Helping Us!

    Your message has been sent to W3Schools.

    Top Tutorials
    Top References
    Top Examples
    Get Certified

    W3Schools is optimized for learning and training. Examples might be simplified to improve reading and learning. Tutorials, references, and examples are constantly reviewed to avoid errors, but we cannot warrant full correctness of all content. While using W3Schools, you agree to have read and accepted our terms of use, cookie and privacy policy.

    Источник

    # UTF-8

      If you’re using the [PDO](http://www.php.net/manual/en/book.pdo.php) abstraction layer with PHP ≥ 5.3.6, you can specify `charset` in the [DSN](http://www.php.net/manual/en/ref.pdo-mysql.connection.php):

    $handle = new PDO('mysql:charset=utf8mb4'); 
    $conn = mysqli_connect('localhost', 'my_user', 'my_password', 'my_db'); $conn->set_charset('utf8mb4'); // object oriented style mysqli_set_charset($conn, 'utf8mb4'); // procedural style 
    $conn = mysql_connect('localhost', 'my_user', 'my_password'); $conn->set_charset('utf8mb4'); // object oriented style mysql_set_charset($conn, 'utf8mb4'); // procedural style 

    Источник

    Читайте также:  Создание css сервера steam
Оцените статью