Php check encoding string

References

Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

$string = ‘漢字はユニコード’ ;
$needle = ‘は’ ;
$replace = ‘Foo’ ;

echo str_replace ( $needle , $replace , $string );
// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value «from outside» somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving «from outside». Therefore the binary representations don’t match and nothing happens.

PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says «Unicode», it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP’s «UTF-16» means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, ‘UTF-16LE’, ‘UTF-8’);

SOME multibyte encodings can safely be used in str_replace() and the like, others cannot. It’s not enough to ensure that all the strings involved use the same encoding: obviously they have to, but it’s not enough. It has to be the right sort of encoding.

UTF-8 is one of the safe ones, because it was designed to be unambiguous about where each encoded character begins and ends in the string of bytes that makes up the encoded text. Some encodings are not safe: the last bytes of one character in a text followed by the first bytes of the next character may together make a valid character. str_replace() knows nothing about «characters», «character encodings» or «encoded text». It only knows about the string of bytes. To str_replace(), two adjacent characters with two-byte encodings just looks like a sequence of four bytes and it’s not going to know it shouldn’t try to match the middle two bytes.

While real-world examples can be found of str_replace() mangling text, it can be illustrated by using the HTML-ENTITIES encoding. It’s not one of the safe ones. All of the strings being passed to str_replace() are valid HTML-ENTITIES-encoded text so the «all inputs use the same encoding» rule is satisfied.

Читайте также:  Php no deprecated errors

$string = ‘x<y’ ;
mb_internal_encoding ( ‘HTML-ENTITIES’ );

echo «Text length: » , mb_strlen ( $string ), «\tString length: » , strlen ( $string ), » . » , $string , «\n» ;
// Three characters, six bytes; the text reads «x

$newstring = str_replace ( ‘l’ , ‘g’ , $string );
echo «Text length: » , mb_strlen ( $newstring ), «\tString length: » , strlen ( $newstring ), » . » , $newstring , «\n» ;
// Three characters, six bytes, but now the text reads «x>y»; the wrong characters have changed.

$newstring = str_replace ( ‘;’ , ‘:’ , $string );
echo «Text length: » , mb_strlen ( $newstring ), «\tString length: » , strlen ( $newstring ), » . » , $newstring , «\n» ;
// Now even the length of the text is wrong and the text is trashed.

?>

Even though neither ‘l’ nor ‘;’ appear in the text «xy» and in the other it broke the encoding completely.

One more reason to use UTF-8 if you can, I guess.

A small note for those who will follow rawsrc at gmail dot com’s advice: mb_split uses regular expressions, in which case it may make sense to use built-in function mb_ereg_replace.

Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.

function mb_str_pad ( $input , $pad_length , $pad_string , $pad_style , $encoding = «UTF-8» ) <
return str_pad ( $input ,
strlen ( $input )- mb_strlen ( $input , $encoding )+ $pad_length , $pad_string , $pad_style );
>
?>

Yet another single-line mb_trim() function

function mb_trim ( $string , $trim_chars = ‘\s’ ) return preg_replace ( ‘/^[‘ . $trim_chars . ‘]*(?U)(.*)[‘ . $trim_chars . ‘]*$/u’ , ‘\\1’ , $string );
>
$string = ‘ «some text.» ‘ ;
echo mb_trim ( $string , ‘\s».’ );
//some text
?>

This would be one way to create a multibyte substr_replace function

function mb_substr_replace ( $output , $replace , $posOpen , $posClose ) <
return mb_substr ( $output , 0 , $posOpen ). $replace . mb_substr ( $output , $posClose + 1 );
>
?>

str_replace is NOT multi-bite safe.

This Ukrainian word gives a bug when used in the next code: відео

$result = str_replace(str_split($rubishcharacters), ‘ ‘, $searchstring);

PHP5 has no mb_trim(), so here’s one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Unlike other approaches that I’ve seen to this problem, I wanted to emulate the full functionality of trim() — in particular, the ability to customise the character list.

/**
* Trim characters from either (or both) ends of a string in a way that is
* multibyte-friendly.
*
* Mostly, this behaves exactly like trim() would: for example supplying ‘abc’ as
* the charlist will trim all ‘a’, ‘b’ and ‘c’ chars from the string, with, of
* course, the added bonus that you can put unicode characters in the charlist.
*
* We are using a PCRE character-class to do the trimming in a unicode-aware
* way, so we must escape ^, \, — and ] which have special meanings here.
* As you would expect, a single \ in the charlist is interpretted as
* «trim backslashes» (and duly escaped into a double-\ ). Under most circumstances
* you can ignore this detail.
*
* As a bonus, however, we also allow PCRE special character-classes (such as ‘\s’)
* because they can be extremely useful when dealing with UCS. ‘\pZ’, for example,
* matches every ‘separator’ character defined in Unicode, including non-breaking
* and zero-width spaces.
*
* It doesn’t make sense to have two or more of the same character in a character
* class, therefore we interpret a double \ in the character list to mean a
* single \ in the regex, allowing you to safely mix normal characters with PCRE
* special classes.
*
* *Be careful* when using this bonus feature, as PHP also interprets backslashes
* as escape characters before they are even seen by the regex. Therefore, to
* specify ‘\\s’ in the regex (which will be converted to the special character
* class ‘\s’ for trimming), you will usually have to put *4* backslashes in the
* PHP code — as you can see from the default value of $charlist.
*
* @param string
* @param charlist list of characters to remove from the ends of this string.
* @param boolean trim the left?
* @param boolean trim the right?
* @return String
*/
function mb_trim ( $string , $charlist = ‘\\\\s’ , $ltrim = true , $rtrim = true )
<
$both_ends = $ltrim && $rtrim ;

Читайте также:  Радиус

if( $both_ends )
<
$pattern_middle = $left_pattern . ‘|’ . $right_pattern ;
>
elseif( $ltrim )
<
$pattern_middle = $left_pattern ;
>
else
<
$pattern_middle = $right_pattern ;
>

return preg_replace ( «/ $pattern_middle /usSD» , » , $string ) );
>
?>

Источник

mb_check_encoding

Checks if the specified byte stream is valid for the specified encoding. If value is of type array , all keys and values are validated recursively. It is useful to prevent so-called «Invalid Encoding Attack».

Parameters

The byte stream or array to check. If it is omitted, this function checks all the input from the beginning of the request.

As of PHP 8.1.0, omitting this parameter or passing null is deprecated.

Return Values

Returns true on success or false on failure.

Changelog

Version Description
8.1.0 Calling this function with null as value or without argument is deprecated.
8.0.0 value and encoding are nullable now.
7.2.0 This function now also accepts an array as value . Formerly, only string s have been supported.

User Contributed Notes

  • Multibyte String Functions
    • mb_​check_​encoding
    • mb_​chr
    • mb_​convert_​case
    • mb_​convert_​encoding
    • mb_​convert_​kana
    • mb_​convert_​variables
    • mb_​decode_​mimeheader
    • mb_​decode_​numericentity
    • mb_​detect_​encoding
    • mb_​detect_​order
    • mb_​encode_​mimeheader
    • mb_​encode_​numericentity
    • mb_​encoding_​aliases
    • mb_​ereg_​match
    • mb_​ereg_​replace_​callback
    • mb_​ereg_​replace
    • mb_​ereg_​search_​getpos
    • mb_​ereg_​search_​getregs
    • mb_​ereg_​search_​init
    • mb_​ereg_​search_​pos
    • mb_​ereg_​search_​regs
    • mb_​ereg_​search_​setpos
    • mb_​ereg_​search
    • mb_​ereg
    • mb_​eregi_​replace
    • mb_​eregi
    • mb_​get_​info
    • mb_​http_​input
    • mb_​http_​output
    • mb_​internal_​encoding
    • mb_​language
    • mb_​list_​encodings
    • mb_​ord
    • mb_​output_​handler
    • mb_​parse_​str
    • mb_​preferred_​mime_​name
    • mb_​regex_​encoding
    • mb_​regex_​set_​options
    • mb_​scrub
    • mb_​send_​mail
    • mb_​split
    • mb_​str_​split
    • mb_​strcut
    • mb_​strimwidth
    • mb_​stripos
    • mb_​stristr
    • mb_​strlen
    • mb_​strpos
    • mb_​strrchr
    • mb_​strrichr
    • mb_​strripos
    • mb_​strrpos
    • mb_​strstr
    • mb_​strtolower
    • mb_​strtoupper
    • mb_​strwidth
    • mb_​substitute_​character
    • mb_​substr_​count
    • mb_​substr

    Источник

    PHP – How to detect character encoding using mb_detect_encoding()

    In PHP, mb_detect_encoding() is used to detect the character encoding. It can detect the character encoding for a string from an ordered list of candidates. This function is supported in PHP 4.0.6 or higher version.

    mb_detect_encoding() is useful with multibyte encoding, where not all sequences of bytes form a valid string. If the input string contains such type of a sequence, then that encoding will be rejected, and it will check for the next encoding.

    Syntax

    string mb_detect_encoding(str $string, str $encoding, bool $strcit)

    Automatic detection of character encoding is not entirely reliable without some additional information. We can say that character encoding detection is similar to decoding an encrypted string without the key. A content-Type HTTP header can be used for an indication of character encoding stored or transmitted with the data.

    Parameters

    The mb_detect_encoding function accepts three parameters −

    • $string − This parameter is used for the string being examined.
    • $encoding − This parameter is used for a list of character encoding to try in order. The list may be specified in any format like an array of strings or only a single string separated by commas. In case the encoding is omitted or null, then the current detect_order is set with the mbstring.detect_order configuration option or mb_detect_order() function will be used.
    • $strict − this parameter is used to control the behavior when the string is not valid in any of the listed encodings. If the strict is set to false, then it will return the closest matching encoding. If the strict is set to true, it will return false.

    Return Values

    It returns the detected character encoding, or it returns False if the string is not valid in any of the listed encoding.

    Example 1

    mb_detect_encoding() function without strict parameter

    Output

    Example 2

    mb_detect_encoding() function using strict parameter.

    Output

    string(5) "UTF-8" bool(false) string(10) "ISO-8859-1" string(10) "ISO-8859-1"

    Источник

Оцените статью