Php encode file to utf 8

PHP 8.2: utf8_encode and utf8_decode functions deprecated

utf8_encode and utf8_decode functions, despite their names, are used to convert strings between ISO-8859-1 (Also known as «Latin 1») and UTF-8 encodings. These functions do not attempt to detect the actual character encoding in a given text, and always convert character encodings between ISO-8859-1 and UTF-8, even if the source text is not encoded in ISO-8859-1.

Although PHP includes utf8_encode and utf8_decode functions in its standard library, these functions cannot be used to detect and convert other character encodings such as Windows-1252, UTF-16, and UTF-32 to UTF-8. Passing arbitrary text to utf8_encode function is prone to bugs that do not result in any warnings or errors but may lead to undesired results.

Some frequent examples of bugs include:

  • The Euro sign ( € , character sequence \xE2\x82\xAC ), when passed to utf8_encode function as utf8_encode(«€») results in a a garbled (also called as «Mojibake») text output of ⬠.
  • The German Eszett character ( ß , character sequence \xDF ), when passed through utf8_encode(«ß») results in à .

Both of the examples above do not emit any warnings or errors although their resulting text is wrong.

Because of the misleading function names, lack of error messages and warnings, and the lack of support for character encodings other than ISO-8859-1, utf8_encode and utf8_decode functions are deprecated in PHP 8.2.

Using utf8_encode and utf8_decode functions emit a deprecation notice in PHP 8.2, and the functions will be removed in PHP 9.0.

utf8_encode('foo'); uft8_decode('foo');
Function utf8_encode() is deprecated in . on line . Function uft8_decode() is deprecated in . on line . 

Replacements for the deprecated functions

utf8_encode function encodes a ISO-8859-1 encoded string text into UTF-8. Most of the utf8_encode calls in legacy PHP applications use this function as an additional safe-guard to prevent any potential malformed text to UTF-8, but as shown in the examples above, using this function often results in undesired outcomes rather than fixing any malformed text.

Similarly, calling utf8_decode function on a string decodes that string to ISO-8859-1 character encoding. Majority of the web applications, web sites, and text formats in fact expect UTF-8 encoded text and not ISO-8859-1.

Читайте также:  Opencv python cv2 cvtcolor

It might be ideal to reevaluate the need of utf8_encode and utf8_decode function calls prior to replacing them, because more often than not, these function calls are not required, and only result in undesired outcomes.

PHP does not bundle multi-byte character encoding functions in its core, but PHP core mbstring , intl , and iconv extensions provide a robust and accurate functionality to detect and convert character encodings. Both mbstring and iconv are core extensions, but mbstring is used widely in modern PHP applications, and can be polyfilled as well.

Replacements for utf8_encode

If the actual use case of an existing utf8_encode function call is to convert a known ISO-8859-1 string to UTF-8, it is possible to use iconv , intl , or mbstring extensions to properly convert the encoding. Alternatively, it is possible to directly convert code-points to UTF-8 string as well using user-land PHP albeit with a small performance penalty.

When the use case of utf8_encode is to automatically detect the character encoding and convert it to UTF-8, even though the function did not detect character encodings in the first place, the replacement would be detecting the character encoding first, and then converting it to UTF-8.

ISO-8859-1 to UTF-8 Any encoding to UTF-8
PHP Standard Functions ISO-8859-1 to UTF-8 using Standard PHP Functions N/A
With mbstring ISO-8859-1 to UTF-8 using mbstring Any encoding to UTF-8 using mbstring
With intl ISO-8859-1 to UTF-8 using intl N/A
With iconv ISO-8859-1 to UTF-8 using iconv N/A

ISO-8859-1 to UTF-8 using Standard PHP Functions

symfony/polyfill-php72 library provides a PHP function that mimics the utf8_encode functionality using standard PHP functions. For better readability and to convey the meaning of the function, it is renamed to iso8859_1_to_utf8 in the example below.

function iso8859_1_to_utf8(string $s): string < $s .= $s; $len = \strlen($s); for ($i = $len >> 1, $j = 0; $i < $len; ++$i, ++$j) < switch (true) < case $s[$i] < "\x80": $s[$j] = $s[$i]; break; case $s[$i] < "\xC0": $s[$j] = "\xC2"; $s[++$j] = $s[$i]; break; default: $s[$j] = "\xC3"; $s[++$j] = \chr(\ord($s[$i]) - 64); break; >> return substr($s, 0, $j); >

With the function above declared in application code, it is now possible to replace all utf8_encode calls with the new iso8859_1_to_utf8 function to avoid the deprecation notice:

- utf8_encode($string); + iso8859_1_to_utf8($string);

ISO-8859-1 to UTF-8 using mbstring

mbstring extension, one of the most widely used optional PHP extensions, provides a cleaner and straight-forward approach to convert ISO-8859-1 encoded strings to UTF-8. This can be used to replace the utf8_encode function deprecated in PHP 8.2.

- utf8_encode($string); + mb_convert_encoding($string, 'UTF-8', 'ISO-8859-1');

Any encoding to UTF-8 using mbstring

Without knowing the actual character encoding used in the input text, it might lead to erroneous results when PHP is forced to detect the input character encoding. However, it is possible to make a reasonable guess of the source character encoding and convert it to UTF-8 using mbstring extension.

- utf8_encode($string); + mb_convert_encoding($string, 'UTF-8', mb_list_encodings());

ISO-8859-1 to UTF-8 using intl

The UConverter class in the intl extension also provides a way to convert character encodings from one to another. It follows a similar function signature as mbstring counterparts as well. Using UConverter::transcode , it is possible to replicate utf8_encode functionality:

- utf8_encode($string); + UConverter::transcode($latin1, 'UTF8', 'ISO-8859-1');

ISO-8859-1 to UTF-8 using iconv

Applications that can use the iconv extension can replace the utf8_encode function using iconv function:

- utf8_encode($string); + iconv('ISO-8859-1', 'UTF-8', $string);

Replacements for utf8_decode

utf8_decode function decodes a UTF-8 encoded string to ISO-8859-1. With the utf8_decode function deprecated, it is possible to replicate this functionality using PHP standard functions, mbstring extension, intl extension, or iconv extension.

Читайте также:  Php writing config file
UTF-8 to ISO-8859-1
PHP Standard Functions UTF-8 to ISO-8859-1 using Standard PHP Functions
With mbstring UTF-8 to ISO-8859-1 using mbstring
With intl UTF-8 to ISO-8859-1 using intl
With iconv UTF-8 to ISO-8859-1 using iconv

UTF-8 to ISO-8859-1 using Standard PHP Functions

Similar the the utf8_encode polyfill, symfony/polyfill-php72 library provides a PHP function that mimics the utf8_decode functionality:

function utf8_to_iso8859_1(string $string): string < $s = (string) $string; $len = \strlen($s); for ($i = 0, $j = 0; $i < $len; ++$i, ++$j) < switch ($s[$i] & "\xF0") < case "\xC0": case "\xD0": $c = (\ord($s[$i] & "\x1F") > return substr($s, 0, $j); >

With the function above included, it is now possible to replace utf8_decode calls with the new utf8_to_iso8859_1 function:

- utf8_decode($string); + utf8_to_iso8859_1($string);

UTF-8 to ISO-8859-1 using mbstring

Using mbstring , the following example replaces the deprecated utf8_decode function with mb_convert_encoding :

- utf8_decode($string); + mb_convert_encoding($string, 'ISO-8859-1', 'UTF-8');

UTF-8 to ISO-8859-1 using intl

With help of UConverter::transcode in the intl extension, the following example shows a utf8_decode replacement:

- utf8_encode($string); + UConverter::transcode($string, 'ISO-8859-1', 'UTF8', ['to_subst' => '?']);

UTF-8 to ISO-8859-1 using iconv

iconv function can also be used to mimic and replace the utf8_decode functionality to avoid the utf8_decode deprecation in PHP 8.2:

- utf8_encode($string); + iconv('UTF-8', 'ISO-8859-1', $string);

Backwards Compatibility Impact

utf8_encode and utf8_decode functions are sometimes used in legacy PHP applications and applications that process incoming data and files with various character encodings. These functions are deprecated in PHP 8.2, and will be removed in PHP 9.0 because these functions are misleadingly named, and are prone to unexpected and undesired results that emit no warnings or errors.

Since PHP 8.2 and later, using these functions result in a deprecation notice for each time the functions are called.

utf8_encode and utf8_decode functions are to be removed from PHP in PHP 9.0.

A large number of applications that use these functions use them without being aware that they only work with ISO-8859-1 character encoding and nothing else for the source character encoding. It is possible that the ideal fix for the deprecation is to see why these functions are used in the first place, and determine if they are absolutely necessary.

Читайте также:  Checked button in javascript

Depending on the availability of PHP extensions and the willingness to use a somewhat slower PHP implementation, it is possible to replace utf8_encode and utf8_decode function calls.

Источник

Перекодировка текста UTF-8 и WINDOWS-1251

Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.

windows-1251 в UTF-8

$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251'); echo $text;

UTF-8 в windows-1251

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8'); echo $text;

Когда ни что не помогает

$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text); $text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

Иногда доходит до бреда, но работает:

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); $text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

File_get_contents / CURL

Бывают случаи когда file_get_contents() или CURL возвращают иероглифы (Алмазные борÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.

$text = file_get_contents('https://example.com'); $text = "\xEF\xBB\xBF" . $text; echo $text;

Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:

Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:

function getcontents($url) < $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); $output = curl_exec($ch); curl_close($ch); return $output; >echo getcontents('https://example.com');

Источник

PHP UTF-8 Conversion

PHP UTF-8 Conversion

  1. Use utf8_encode() and utf8_decode() to Encode and Decode Strings in PHP
  2. Use iconv() to Convert a String to UTF-8

The UTF-8 is a way to encode Unicode characters, each character in between one to four bytes.

It is used to handle the special character or characters from languages other than English.

PHP has different ways to convert text into UTF-8 .

Use utf8_encode() and utf8_decode() to Encode and Decode Strings in PHP

Both utf8_encode() and utf8_decode() are built-in functions in PHP.

It is used to encode and decode ISO-8859-1 , and other types of strings to UTF-8 , both of these function takes a string as a parameter.

php $demo="\xE0\xE9\xED"; //ISO-8859-1 String àéí  echo "UTF-8 Encoded String: "; echo utf8_encode($demo) ."
"
;
echo "UTF-8 Decoded String: "; echo utf8_decode(utf8_encode($demo)) ."
"
;
echo "UTF-8 Encoded String from the decoded: "; echo utf8_encode(utf8_decode(utf8_encode($demo))) ."
"
;
?>

The code above encodes an ISO-8859-1 string to UTF and then decodes the output again. The input string you see is with ISO-8859-1 encoding.

UTF-8 Encoded String: àéí UTF-8 Decoded String: ��� UTF-8 Encoded String from the decoded: àéí 

The utf8_decode() converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1 .

When reading an ISO-8859-1 encoded text as UTF-8 , you will often see that question mark.

Use iconv() to Convert a String to UTF-8

iconv() is another built-in PHP function used to convert string from one Unicode.

It takes three parameters, one is the string’s Unicode, the second is the Unicode you want to convert, and the third is the string itself.

Источник

Оцените статью