Set encoding php file

Перекодировка текста UTF-8 и WINDOWS-1251

Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.

windows-1251 в UTF-8

$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251'); echo $text;

UTF-8 в windows-1251

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8'); echo $text;

Когда ни что не помогает

$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text); $text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

Иногда доходит до бреда, но работает:

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); $text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

File_get_contents / CURL

Бывают случаи когда file_get_contents() или CURL возвращают иероглифы (Алмазные борÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.

$text = file_get_contents('https://example.com'); $text = "\xEF\xBB\xBF" . $text; echo $text;

Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:

Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:

function getcontents($url) < $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); $output = curl_exec($ch); curl_close($ch); return $output; >echo getcontents('https://example.com');

Источник

mb_internal_encoding

encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module. You should notice that the internal encoding is totally different from the one for multibyte regex.

Return Values

If encoding is set, then Returns true on success or false on failure. In this case, the character encoding for multibyte regex is NOT changed. If encoding is omitted, then the current character encoding name is returned.

Читайте также:  Start java with console

Errors/Exceptions

As of PHP 8.0.0, a ValueError is thrown if the value of encoding is an invalid encoding. Prior to PHP 8.0.0, a E_WARNING was emitted instead.

Changelog

Version Description
8.0.0 encoding is nullable now.
8.0.0 Now throws a ValueError if encoding is an invalid encoding. Previously a E_WARNING was emitted instead.

Examples

Example #1 mb_internal_encoding() example

/* Set internal character encoding to UTF-8 */
mb_internal_encoding ( «UTF-8» );

/* Display current internal character encoding */
echo mb_internal_encoding ();
?>

See Also

  • mb_http_input() — Detect HTTP input character encoding
  • mb_http_output() — Set/Get HTTP output character encoding
  • mb_detect_order() — Set/Get character encoding detection order
  • mb_regex_encoding() — Set/Get character encoding for multibyte regex

User Contributed Notes 7 notes

Especially when writing PHP scripts for use on different servers, it is a very good idea to explicitly set the internal encoding somewhere on top of every document served, e.g.

This, in combination with mysql-statement «SET NAMES ‘utf8′», will save a lot of debugging trouble.

Also, use the multi-byte string functions instead of the ones you may be used to, e.g. mb_strlen() instead of strlen(), etc.

header ( ‘Content-Type: text/html; charset=UTF-8’ );

mb_internal_encoding ( ‘UTF-8’ );
mb_http_output ( ‘UTF-8’ );
mb_http_input ( ‘UTF-8’ );
mb_regex_encoding ( ‘UTF-8’ );

Be aware that the strings in your source files must match the encoding you specify by mb_internal_encoding. It appears the Parser loads raw bytes from the file and refers to its internal encoding to determine their actual encoding.

To demonstrate, the following outputs as espected when the /source/ file is Latin-1 encoded:

mb_internal_encoding ( «iso-8859-1» );
mb_http_output ( «UTF-8» );
ob_start ( «mb_output_handler» );

Читайте также:  Convert unicode string javascript

Now, a typical use of mb_internal_encoding is shown as follows. Make the change to «utf-8» but leave the /source/ file encoding unchanged:

mb_internal_encoding ( «UTF-8» );
mb_http_output ( «UTF-8» );
ob_start ( «mb_output_handler» );

The output will just show the
tag and no text.

Save the file as UTF-8 encoding and then the results will be as expected.

Источник

Оцените статью