Php get utf8 code

Содержание

mb_convert_encoding
Parameters
Return Values
Errors/Exceptions
Changelog
Examples
See Also
User Contributed Notes 35 notes

mb_convert_encoding

Converts string from from_encoding , or the current internal encoding, to to_encoding . If string is an array , all its string values will be converted recursively.

Parameters

The string or array to be converted.

The desired encoding of the result.

The current encoding used to interpret string . Multiple encodings may be specified as an array or comma separated list, in which case the correct encoding will be guessed using the same algorithm as mb_detect_encoding() .

If from_encoding is omitted or null , the mbstring.internal_encoding setting will be used if set, otherwise the default_charset setting.

See supported encodings for valid values of to_encoding and from_encoding .

Return Values

The encoded string or array on success, or false on failure.

Errors/Exceptions

As of PHP 8.0.0, a ValueError is thrown if the value of to_encoding or from_encoding is an invalid encoding. Prior to PHP 8.0.0, a E_WARNING was emitted instead.

Changelog

Version	Description
8.0.0	mb_convert_encoding() will now throw a ValueError when to_encoding is passed an invalid encoding.
8.0.0	mb_convert_encoding() will now throw a ValueError when from_encoding is passed an invalid encoding.
8.0.0	from_encoding is nullable now.
7.2.0	This function now also accepts an array as string . Formerly, only string s have been supported.

Examples

Example #1 mb_convert_encoding() example

/* Convert internal character encoding to SJIS */
$str = mb_convert_encoding ( $str , «SJIS» );

/* Convert EUC-JP to UTF-7 */
$str = mb_convert_encoding ( $str , «UTF-7» , «EUC-JP» );

/* Auto detect encoding from JIS, eucjp-win, sjis-win, then convert str to UCS-2LE */
$str = mb_convert_encoding ( $str , «UCS-2LE» , «JIS, eucjp-win, sjis-win» );

/* If mbstring.language is «Japanese», «auto» is expanded to «ASCII,JIS,UTF-8,EUC-JP,SJIS» */
$str = mb_convert_encoding ( $str , «EUC-JP» , «auto» );
?>

User Contributed Notes 35 notes

For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made this:

I’ve been trying to find the charset of a norwegian (with a lot of ø, æ, å) txt file written on a Mac, i’ve found it in this way:

$text = «A strange string to pass, maybe with some ø, æ, å characters.» ;

foreach( mb_list_encodings () as $chr ) <
echo mb_convert_encoding ( $text , ‘UTF-8’ , $chr ). » : » . $chr . «
» ;
>
?>

The line that looks good, gives you the encoding it was written in.

Hey guys. For everybody who’s looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here’s your solution:

public function encodeToUtf8($string) return mb_convert_encoding($string, «UTF-8», mb_detect_encoding($string, «UTF-8, ISO-8859-1, ISO-8859-15», true));
>

public function encodeToIso($string) return mb_convert_encoding($string, «ISO-8859-1», mb_detect_encoding($string, «UTF-8, ISO-8859-1, ISO-8859-15», true));
>

Читайте также: Php my admin data types

For me these functions are working fine. Give it a try

aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:

in your php.ini. Be sure to include the quotes around none. Or at run-time with

ini_set ( ‘mbstring.substitute_character’ , «none» );
?>

My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!)

Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding.

function convert_to ( $source , $target_encoding )
// detect the character encoding of the incoming file
$encoding = mb_detect_encoding ( $source , «auto» );

// escape all of the question marks so we can remove artifacts from
// the unicode conversion process
$target = str_replace ( «?» , «[question_mark]» , $source );

// convert the string to the target encoding
$target = mb_convert_encoding ( $target , $target_encoding , $encoding );

// remove any question marks that have been introduced because of illegal characters
$target = str_replace ( «?» , «» , $target );

// replace the token string «[question_mark]» with the symbol «?»
$target = str_replace ( «[question_mark]» , «?» , $target );

return $target ;
>
?>

Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity)
-A

many people below talk about using
mb_convert_encode ( $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it’s beautiful UTF-8 glory.

So [insert korean characters here] turned into .

I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)
so instead of
mb_convert_encode (& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>
which worked perfectly until I upgraded, so I had to use
call_user_func_array ( ‘mb_convert_encoding’ , array(& $s , ‘HTML-ENTITIES’ , ‘UTF-8’ ));
?>

Hope it helps someone else out

To add to the Flash conversion comment below, here’s how I convert back from what I’ve stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:

function htmltoflash($htmlstr)
return str_replace(«<br />»,»\n»,
str_replace(» str_replace(«>»,»>»,
mb_convert_encoding(html_entity_decode($htmlstr),
«UTF-8″,»ISO-8859-1»))));
>

When you need to convert from HTML-ENTITIES, but your UTF-8 string is partially broken (not all chars in UTF-8) — in this case passing string to mb_convert_encoding($string, ‘UTF-8’, ‘HTML-ENTITIES’); — corrupts chars in string even more. In this case you need to replace html entities gradually to preserve character good encoding. I wrote such closure for this job :
$decode_entities = function( $string ) preg_match_all ( «/&#?\w+;/» , $string , $entities , PREG_SET_ORDER );
$entities = array_unique ( array_column ( $entities , 0 ));
foreach ( $entities as $entity ) $decoded = mb_convert_encoding ( $entity , ‘UTF-8’ , ‘HTML-ENTITIES’ );
$string = str_replace ( $entity , $decoded , $string );
>
return $string ;
>;
?>

If you are trying to generate a CSV (with extended chars) to be opened at Exel for Mac, the only that worked for me was:

I also tried this:

//Separado OK, chars MAL
iconv ( ‘MACINTOSH’ , ‘UTF8’ , $CSV );
//Separado MAL, chars OK
chr ( 255 ). chr ( 254 ). mb_convert_encoding ( $CSV , ‘UCS-2LE’ , ‘UTF-8’ );
?>

But the first one didn’t show extended chars correctly, and the second one, did’t separe fields correctly

If you have what looks like ISO-8859-1, but it includes «smart quotes» courtesy of Microsoft software, or people cutting and pasting content from Microsoft software, then what you’re actually dealing with is probably Windows-1252. Try this:

$cleanText = mb_convert_encoding ( $text , ‘UTF-8’ , ‘Windows-1252’ );
?>

The annoying part is that the auto detection (ie: the mb_detect_encoding function) will often think Windows-1252 is ISO-8859-1. Close, but no cigar. This is critical if you’re then trying to do unserialize on the resulting text, because the byte count of the string needs to be perfect.

Text-encoding HTML-ENTITIES will be deprecated as of PHP 8.2.

To convert all non-ASCII characters into entities (to produce pure 7-bit HTML output), I was using:

echo mb_convert_encoding ( htmlspecialchars ( $text , ENT_QUOTES , ‘UTF-8’ ), ‘HTML-ENTITIES’ , ‘UTF-8’ );
?>

I can get the identical result with:

echo mb_encode_numericentity ( htmlentities ( $text , ENT_QUOTES , ‘UTF-8’ ), [ 0x80 , 0x10FFFF , 0 , ~ 0 ], ‘UTF-8’ );
?>

The output contains well-known named entities for some often used characters and numeric entities for the rest.

/**
* Convert Windows-1250 to UTF-8
* Based on https://www.php.net/manual/en/function.mb-convert-encoding.php#112547
*/
class TextConverter
private const ENCODING_TO = ‘UTF-8’;
private const ENCODING_FROM = ‘ISO-8859-2’;

private array $mapChrChr = [
0x8A => 0xA9,
0x8C => 0xA6,
0x8D => 0xAB,
0x8E => 0xAE,
0x8F => 0xAC,
0x9C => 0xB6,
0x9D => 0xBB,
0xA1 => 0xB7,
0xA5 => 0xA1,
0xBC => 0xA5,
0x9F => 0xBC,
0xB9 => 0xB1,
0x9A => 0xB9,
0xBE => 0xB5,
0x9E => 0xBE
];

/**
* @param $text
* @return string
*/
public function execute($text): string
$map = $this->prepareMap();

return html_entity_decode(
mb_convert_encoding(strtr($text, $map), self::ENCODING_TO, self::ENCODING_FROM),
ENT_QUOTES,
self::ENCODING_TO
);
>

/**
* @return array
*/
private function prepareMap(): array
$maps[] = $this->arrayMapAssoc(function ($k, $v) return [chr($k), chr($v)];
>, $this->mapChrChr);

$maps[] = $this->arrayMapAssoc(function ($k, $v) return [chr($k), $v];
>, $this->mapChrString);

/**
* @param callable $function
* @param array $array
* @return array
*/
private function arrayMapAssoc(callable $function, array $array): array
return array_column(
array_map(
$function,
array_keys($array),
$array
),
1,
0
);
>
>

If you are attempting to convert «UTF-8» text to «ISO-8859-1» and the result is always returning in «ASCII», place the following line of code before the mb_convert_encoding:

It is necessary to force a specific search order for the conversion to work

It appears that when dealing with an unknown «from encoding» the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the «to encoding».

instead of ini_set(), you can try this

Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful.

Читайте также: Form input required css

For those wanting to convert from $set to MacRoman, use iconv():

$string = iconv ( ‘UTF-8’ , ‘macintosh’ , $string );

(‘macintosh’ is the IANA name for the MacRoman character set.)

Why did you use the php html encode functions? mbstring has it’s own Encoding which is (as far as I tested it) much more usefull:

$text = mb_convert_encoding($text, ‘HTML-ENTITIES’, «UTF-8»);

// convert UTF8 to DOS = CP850
//
// $utf8_text=UTF8-Formatted text;
// $dos=CP850-Formatted text;

$dos = mb_convert_encoding($utf8_text, «CP850», mb_detect_encoding($utf8_text, «UTF-8, CP850, ISO-8859-15», true));

Another sample of recoding without MultiByte enabling.
(Russian koi->win, if input in win-encoding already, function recode() returns unchanged string)

// 0 — win
// 1 — koi
function detect_encoding ( $str ) $win = 0 ;
$koi = 0 ;

for( $i = 0 ; $i < strlen ( $str ); $i ++) if( ord ( $str [ $i ]) > 224 && ord ( $str [ $i ]) < 255 ) $win ++;
if( ord ( $str [ $i ]) > 192 && ord ( $str [ $i ]) < 223 ) $koi ++;
>

if( $win < $koi ) return 1 ;
> else return 0 ;

// recodes koi to win
function koi_to_win ( $string )

$kw = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 254 , 224 , 225 , 246 , 228 , 229 , 244 , 227 , 245 , 232 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 255 , 240 , 241 , 242 , 243 , 230 , 226 , 252 , 251 , 231 , 248 , 253 , 249 , 247 , 250 , 222 , 192 , 193 , 214 , 196 , 197 , 212 , 195 , 213 , 200 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 223 , 208 , 209 , 210 , 211 , 198 , 194 , 220 , 219 , 199 , 216 , 221 , 217 , 215 , 218 );
$wk = array( 128 , 129 , 130 , 131 , 132 , 133 , 134 , 135 , 136 , 137 , 138 , 139 , 140 , 141 , 142 , 143 , 144 , 145 , 146 , 147 , 148 , 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 , 158 , 159 , 160 , 161 , 162 , 163 , 164 , 165 , 166 , 167 , 168 , 169 , 170 , 171 , 172 , 173 , 174 , 175 , 176 , 177 , 178 , 179 , 180 , 181 , 182 , 183 , 184 , 185 , 186 , 187 , 188 , 189 , 190 , 191 , 225 , 226 , 247 , 231 , 228 , 229 , 246 , 250 , 233 , 234 , 235 , 236 , 237 , 238 , 239 , 240 , 242 , 243 , 244 , 245 , 230 , 232 , 227 , 254 , 251 , 253 , 255 , 249 , 248 , 252 , 224 , 241 , 193 , 194 , 215 , 199 , 196 , 197 , 214 , 218 , 201 , 202 , 203 , 204 , 205 , 206 , 207 , 208 , 210 , 211 , 212 , 213 , 198 , 200 , 195 , 222 , 219 , 221 , 223 , 217 , 216 , 220 , 192 , 209 );

$end = strlen ( $string );
$pos = 0 ;
do $c = ord ( $string [ $pos ]);
if ( $c > 128 ) $string [ $pos ] = chr ( $kw [ $c — 128 ]);
>

$enc = detect_encoding ( $str );
if ( $enc == 1 ) $str = koi_to_win ( $str );
>

Источник