Php convert unicode to utf

Converting Unicode code points to UTF-8

Currently I have something like this \u4eac\u90fd and I want to convert it to UTF-8 so I can insert it into a database.

3 Answers 3

Most likely, the \u escape sequence was already sent by the web browser. This would be the original source of your problem — you need to make the web browser stop doing that.

For that, you need to make sure that the browser knows what encoding to use when submitting the form. The browser will, by default, always use the encoding of the HTML page that contains the form. Make sure that this web page is encoded in UTF-8, and has an UTF-8 charset declaration in a meta header. With that done, the browser should submit UTF-8 data correctly, and you shouldn’t need to convert anything at all.

Credit for using JSON @bobince https://stackoverflow.com/a/7107750 where the reverse is sought (UTF-8 to code points). There ASCII characters will not be converted to code points, but with json_decode, ASCII code points will be converted to characters, e.g. ‘»\u0041″‘ -> ‘A’.

(Remember that you need the double quotes inside your string. I was confused why json_decode(‘\u4eac\u90fd’); was giving no output 🙂

Note there will be special requirements for 4-byte UTF-8 encodings, where the code point consists of 5 or 6 hexadecimal digits. JSON doesn’t use curly braces.

echo json_encode('𐍈'); //output: "\ud800\udf48" 

𐍈 is U+10348. The separation into two code points is not obvious to me. Please research if dealing with 4-byte UTF-8 encodings (e.g. emoticons).

Читайте также:  Поддержка php на апачей

This is one of those frustrating examples of where a standard purpose-made function should exist* but instead one has to use a workaround and finds many complicated user functions online.

Источник

PHP Snippet: How to decode unicode in PHP strings (`ሴ`)

The json_decode function can automatically convert Unicode strings in PHP:

$text = 'Elon \u2018Technoking\u2019 Musk';

echo json_decode('"' . $text . '"'); // prints "Elon 'Technoking' Musk"

You can also create a helper function for this:

function unicode_decode(string $text): string  
return json_decode('"' . $text . '"');
>

$text = 'Elon \u2018Technoking\u2019 Musk';

echo unicode_decode($text); // prints "Elon 'Technoking' Musk"

For more information, please refer to this Stackoverflow post.

Method 2: Decoding using Unicode Encoding in PHP 7+ #

Since PHP 7.0, you can use a Unicode escape syntax to represent code points:

echo "\u"; // represents a unicode char.

If you have a string that contains \u9999, you can use a regular expression to add the required brackets:

// The string with Unicode
$text = 'Elon \u2018Technoking\u2019 Musk';

// The regular expression to match Unicode code point escapes
$regex = '/\\\\u([0-9a-fA-F])/';

// Replace the Unicode code point escapes with their corresponding UTF-8 characters
$decoded = preg_replace_callback($regex, function ($match)
// Convert the HTML entity (`&#xXXXX;`) to its UTF-8 character equivalent.
return mb_convert_encoding('&#x' . $match[1] . ';', 'UTF-8', 'HTML-ENTITIES');
>, $text);

echo $decoded; // Outputs "Elon ‘Technoking’ Musk"

For more information, please refer to the PHP documentation.

Again, you can create a helper function as shown above. If you are wondering about whether to use a Snippet such as this or not, you might find my considerations on the VanillaPHP vs. using packages of interest.

Since you’ve made it this far, sharing this article on your favorite social media network would be highly appreciated 💖! For feedback, please ping me on Twitter.

Источник

Converting these types of unicode to UTF8 in PHP

None of the other answers work perfectly as is. I’ve combined them together and my addition results in this one:

$replacedString = preg_replace("/\\\\u([0-9abcdef])/", "&#x$1;", $originalString); $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); 

This one definitely does work 🙂

I must mention that using mb_convert_encoding() method will convert any " in the original string into » because it involves parsing HTML. beware

I encountered the same problem recently, so was glad to see this question. Doing some tests, I found the following code works:

$replacedString = preg_replace("/\\\\u([0-9abcdef])/", "&#x$1;", $original_string); //$unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); 

The only thing I changed is that I commented out the 2nd line of code. Webpage, however, must be set to display UTF-8.

it doesn’t always work, because /uXXXX code sometimes can contain digits AND letters. try replacing \d (just digits) with \w (\w matches both words and digits).

function unicode_conv($originalString) < // The four \\\\ in the pattern here are necessary to match \u in the original string $replacedString = preg_replace("/\\\\u(\w)/", "&#$1;", $originalString); $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); return $unicodeString; > 

See this comment for a way to get a unicode character from its numerical code. Then, you could write a regex replace that will replace each \uXXXX pattern with the equivalent character.

Alternatively, you could replace each \uXXXX pattern with its matching &#XXXX; html entity form, and then use the following:

mb_convert_encoding(string_with_html_entities, 'UTF-8', 'HTML-ENTITIES'); 
// The four \\\\ in the pattern here are necessary to match \u in the original string $replacedString = preg_replace("/\\\\u(\d)/", "&#$1;", $originalString); $unicodeString = mb_convert_encoding($replacedString, 'UTF-8', 'HTML-ENTITIES'); 

Источник

Оцените статью