Php utf8 string to array

Преобразование строк в массив PHP

Примеры преобразования строк текста в массив по разным разделителям.

Разделить текст по переносам строк

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin blandit magna eu tempus ullamcorper. Sed porta justo sed nibh elementum condimentum. Quisque non eros sit amet elit commodo maximus eget a eros."; $array = explode("\n", $text); print_r($array);

Результат:

Array ( [0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit. [1] => Proin blandit magna eu tempus ullamcorper. [2] => Sed porta justo sed nibh elementum condimentum. [3] => Quisque non eros sit amet elit commodo maximus eget a eros. )

Разделить текст по предложениям

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin blandit magna eu tempus ullamcorper! Sed porta justo sed nibh elementum condimentum. Quisque non eros sit amet elit commodo maximus eget a eros?"; $text = str_replace("\n", '', $text); $array = preg_split('/(?<=[. ])\s+(?=[a-zа-яё])/i', $text); print_r($array);

Результат:

Array ( [0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit. [1] => Proin blandit magna eu tempus ullamcorper! [2] => Sed porta justo sed nibh elementum condimentum. [3] => Quisque non eros sit amet elit commodo maximus eget a eros? )

Разделить текст по словам

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin blandit magna eu tempus ullamcorper."; $text = preg_replace("/[^a-zа-яё0-9\s]/i", '', $text); $array = preg_split('/(\s)/', $text); $array = array_diff($array, array('')); print_r($array);

Результат:

Array ( [0] => Lorem [1] => ipsum [2] => dolor [3] => sit [4] => amet [5] => consectetur [6] => adipiscing [7] => elit [8] => Proin [9] => blandit [10] => magna [11] => eu [12] => tempus [13] => ullamcorper )

Разделить текст по буквам

$text = "Lorem ipsum dolor sit amet"; $array = str_split($text); print_r($array);

Результат:

Array ( [0] => L [1] => o [2] => r [3] => e [4] => m [5] => [6] => i [7] => p [8] => s [9] => u [10] => m [11] => [12] => d [13] => o [14] => l [15] => o [16] => r [17] => [18] => s [19] => i [20] => t [21] => [22] => a [23] => m [24] => e [25] => t )

Разделить текст по нескольким разделителям

$text = "Lorem ipsum dolor sit amet-proin blandit magna eu:Sed porta justo."; $array = preg_split('/[-|:]/u', $text, -1, PREG_SPLIT_NO_EMPTY); print_r($array);

Результат:

Array ( [0] => Lorem ipsum dolor sit amet [1] => proin blandit magna eu [2] => Sed porta justo. )

Если разделитель из нескольких символов, например
и
:

$text = "Lorem ipsum dolor sit amet,
proin blandit magna eu.
Sed porta justo."; $array = preg_split('/(
)|()/u', $text, -1, PREG_SPLIT_NO_EMPTY); print_r($array);

Результат:

Array ( [0] => Lorem ipsum dolor sit amet, [1] => proin blandit magna eu. [2] => Sed porta justo. )

Разделить текст на равные части

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin blandit magna eu tempus ullamcorper."; $chunks = 10; $array = str_split($text); $chunks = array_chunk($array, $chunks, false); $result = array(); foreach ($chunks as $chunk) < $result[] = implode($chunk); >print_r($result);

Результат:

Array ( [0] => Lorem ipsu [1] => m dolor si [2] => t amet, co [3] => nsectetur [4] => adipiscing [5] => elit. Pro [6] => in blandit [7] => magna eu [8] => tempus ull [9] => amcorper. )

Источник

Читайте также:  Php function type object

mb_str_split

This function will return an array of strings, it is a version of str_split() with support for encodings of variable character size as well as fixed-size encodings of 1,2 or 4 byte characters. If the length parameter is specified, the string is broken down into chunks of the specified length in characters (not bytes). The encoding parameter can be optionally specified and it is good practice to do so.

Parameters

The string to split into characters or chunks.

If specified, each element of the returned array will be composed of multiple characters instead of a single character.

The encoding parameter is the character encoding. If it is omitted or null , the internal character encoding value will be used.

A string specifying one of the supported encodings.

Return Values

mb_str_split() returns an array of strings.

Changelog

Version Description
8.0.0 encoding is nullable now.
8.0.0 This function no longer returns false on failure.

See Also

User Contributed Notes 3 notes

Note: function return NULL if can't convert argument type.

if (! in_array ( $encoding , mb_list_encodings (), true )) static $aliases ;
if ( $aliases === null ) $aliases = [];
foreach ( mb_list_encodings () as $encoding ) $encoding_aliases = mb_encoding_aliases ( $encoding );
if ( $encoding_aliases ) foreach ( $encoding_aliases as $alias ) $aliases [] = $alias ;
>
>
>
>
if (! in_array ( $encoding , $aliases , true )) trigger_error ( 'mb_str_split(): Unknown encoding "' . $encoding . '"' , E_USER_WARNING );
return null ;
>
>

$result = [];
$length = mb_strlen ( $string , $encoding );
for ( $i = 0 ; $i < $length ; $i += $split_length ) $result [] = mb_substr ( $string , $i , $split_length , $encoding );
>
return $result ;
>
?>

if( !function_exists('mb_str_split')) <
function mb_str_split( $string = '', $length = 1 , $encoding = null ) <
if(!empty($string)) <
$split = array();
$mb_strlen = mb_strlen($string,$encoding);
for($pi = 0; $pi < $mb_strlen; $pi += $length)<
$substr = mb_substr($string, $pi,$length,$encoding);
if( !empty($substr)) <
$split[] = $substr;
>
>
>
return $split;
>
>

Читайте также:  Javascript dom and jquery

Lazy polyfill for UTF-8 only:

function utf8_str_split(string $input, int $splitLength = 1)
$re = \sprintf('/\\G.+/us', $splitLength);
\preg_match_all($re, $input, $m);
return $m[0];
>

  • Multibyte String Functions
    • mb_​check_​encoding
    • mb_​chr
    • mb_​convert_​case
    • mb_​convert_​encoding
    • mb_​convert_​kana
    • mb_​convert_​variables
    • mb_​decode_​mimeheader
    • mb_​decode_​numericentity
    • mb_​detect_​encoding
    • mb_​detect_​order
    • mb_​encode_​mimeheader
    • mb_​encode_​numericentity
    • mb_​encoding_​aliases
    • mb_​ereg_​match
    • mb_​ereg_​replace_​callback
    • mb_​ereg_​replace
    • mb_​ereg_​search_​getpos
    • mb_​ereg_​search_​getregs
    • mb_​ereg_​search_​init
    • mb_​ereg_​search_​pos
    • mb_​ereg_​search_​regs
    • mb_​ereg_​search_​setpos
    • mb_​ereg_​search
    • mb_​ereg
    • mb_​eregi_​replace
    • mb_​eregi
    • mb_​get_​info
    • mb_​http_​input
    • mb_​http_​output
    • mb_​internal_​encoding
    • mb_​language
    • mb_​list_​encodings
    • mb_​ord
    • mb_​output_​handler
    • mb_​parse_​str
    • mb_​preferred_​mime_​name
    • mb_​regex_​encoding
    • mb_​regex_​set_​options
    • mb_​scrub
    • mb_​send_​mail
    • mb_​split
    • mb_​str_​split
    • mb_​strcut
    • mb_​strimwidth
    • mb_​stripos
    • mb_​stristr
    • mb_​strlen
    • mb_​strpos
    • mb_​strrchr
    • mb_​strrichr
    • mb_​strripos
    • mb_​strrpos
    • mb_​strstr
    • mb_​strtolower
    • mb_​strtoupper
    • mb_​strwidth
    • mb_​substitute_​character
    • mb_​substr_​count
    • mb_​substr

    Источник

    mb_split

    Разделяет многобайтную строку string , используя регулярное выражение pattern , и возвращает массив ( array ).

    Список параметров

    Шаблон регулярного выражения.

    Разбиваемая строка ( string ).

    limit Если необязательный аргумент limit задан, функция разобьёт строку не более, чем на limit частей.

    Возвращаемые значения

    Результат разбиения в виде массива ( array ) или false в случае возникновения ошибки.

    Примечания

    Замечание:

    Кодировка символов, указанная функцией mb_regex_encoding() , будет по умолчанию использована для данной функции.

    Смотрите также

    • mb_regex_encoding() - Устанавливает/получает текущую кодировку для многобайтового регулярного выражения
    • mb_ereg() - Совпадение с регулярным выражением с поддержкой многобайтовых кодировок

    User Contributed Notes 8 notes

    a (simpler) way to extract all characters from a UTF-8 string to array with a single call to a built-in function:

    $str = 'Ма-
    руся' ;
    print_r ( preg_split ( '//u' , $str , null , PREG_SPLIT_NO_EMPTY ));
    ?>

    Output:

    The $pattern argument doesn't use /pattern/ delimiters, unlike other regex functions such as preg_match.

    # Works. No slashes around the /pattern/
    print_r ( mb_split ( "\s" , "hello world" ) );
    Array (
    [ 0 ] => hello
    [ 1 ] => world
    )

    # Doesn't work:
    print_r ( mb_split ( "/\s/" , "hello world" ) );
    Array (
    [ 0 ] => hello world
    )
    ?>

    I figure most people will want a simple way to break-up a multibyte string into its individual characters. Here's a function I'm using to do that. Change UTF-8 to your chosen encoding method.

    function mbStringToArray ( $string ) <
    $strlen = mb_strlen ( $string );
    while ( $strlen ) <
    $array [] = mb_substr ( $string , 0 , 1 , "UTF-8" );
    $string = mb_substr ( $string , 1 , $strlen , "UTF-8" );
    $strlen = mb_strlen ( $string );
    >
    return $array ;
    >
    ?>

    To split an string like this: "日、に、本、ほん、語、ご" using the "、" delimiter i used:

    The solution was to set this before:

    mb_regex_encoding('UTF-8');
    mb_internal_encoding("UTF-8");
    $v = mb_split('、',"日、に、本、ほん、語、ご");

    In addition to Sezer Yalcin's tip.

    This function splits a multibyte string into an array of characters. Comparable to str_split().

    function mb_str_split ( $string ) <
    # Split at all position not after the start: ^
    # and not before the end: $
    return preg_split ( '/(?>

    $string = '火车票' ;
    $charlist = mb_str_split ( $string );

    I agree that some people might want a mb_explode('', $string);

    this is my solution for it:

    $array = array_map (function ( $i ) use ( $string ) <
    return mb_substr ( $string , $i , 1 );
    >, range ( 0 , mb_strlen ( $string ) - 1 ));

    expect ( $array )-> toEqual ([ 'H' , 'a' , 'l' , 'l' , 'ö' , 'l' , 'e' ]);

    an other way to str_split multibyte string:
    $s = 'әӘөүҗңһ' ;

    //$temp_s=iconv('UTF-8','UTF-16',$s);
    $temp_s = mb_convert_encoding ( $s , 'UTF-16' , 'UTF-8' );
    $temp_a = str_split ( $temp_s , 4 );
    $temp_a_len = count ( $temp_a );
    for( $i = 0 ; $i < $temp_a_len ; $i ++)//$temp_a[$i]=iconv('UTF-16','UTF-8',$temp_a[$i]);
    $temp_a [ $i ]= mb_convert_encoding ( $temp_a [ $i ], 'UTF-8' , 'UTF-16' );
    >

    echo( '

    ' ); 
    print_r ( $temp_a );
    echo( '

    ' );

    //also possible to directly use UTF-16:
    define ( 'SLS' , mb_convert_encoding ( '/' , 'UTF-16' ));
    $temp_s = mb_convert_encoding ( $s , 'UTF-16' , 'UTF-8' );
    $temp_a = str_split ( $temp_s , 4 );
    $temp_s = implode ( SLS , $temp_a );
    $temp_s = mb_convert_encoding ( $temp_s , 'UTF-8' , 'UTF-16' );
    echo( $temp_s );
    ?>

    We are talking about Multi Byte ( e.g. UTF-8) strings here, so preg_split will fail for the following string:

    'Weiße Rosen sind nicht grün!'

    And because I didn't find a regex to simulate a str_split I optimized the first solution from adjwilli a bit:

    $string = 'Weiße Rosen sind nicht grün!'
    $stop = mb_strlen ( $string );
    $result = array();

    for( $idx = 0 ; $idx < $stop ; $idx ++)
    <
    $result [] = mb_substr ( $string , $idx , 1 );
    >
    ?>

    Here is an example with adjwilli's function:

    mb_internal_encoding ( 'UTF-8' );
    mb_regex_encoding ( 'UTF-8' );

    function mbStringToArray
    ( $string
    )
    <
    $stop = mb_strlen ( $string );
    $result = array();

    for( $idx = 0 ; $idx < $stop ; $idx ++)
    <
    $result [] = mb_substr ( $string , $idx , 1 );
    >

    echo '

    ' , PHP_EOL , 
    print_r ( mbStringToArray ( 'Weiße Rosen sind nicht grün!' , true )), PHP_EOL ,
    '

    ' ;
    ?>

    Let me know [by personal email], if someone found a regex to simulate a str_split with mb_split.

    • Функции для работы с многобайтовыми строками
      • mb_​check_​encoding
      • mb_​chr
      • mb_​convert_​case
      • mb_​convert_​encoding
      • mb_​convert_​kana
      • mb_​convert_​variables
      • mb_​decode_​mimeheader
      • mb_​decode_​numericentity
      • mb_​detect_​encoding
      • mb_​detect_​order
      • mb_​encode_​mimeheader
      • mb_​encode_​numericentity
      • mb_​encoding_​aliases
      • mb_​ereg_​match
      • mb_​ereg_​replace_​callback
      • mb_​ereg_​replace
      • mb_​ereg_​search_​getpos
      • mb_​ereg_​search_​getregs
      • mb_​ereg_​search_​init
      • mb_​ereg_​search_​pos
      • mb_​ereg_​search_​regs
      • mb_​ereg_​search_​setpos
      • mb_​ereg_​search
      • mb_​ereg
      • mb_​eregi_​replace
      • mb_​eregi
      • mb_​get_​info
      • mb_​http_​input
      • mb_​http_​output
      • mb_​internal_​encoding
      • mb_​language
      • mb_​list_​encodings
      • mb_​ord
      • mb_​output_​handler
      • mb_​parse_​str
      • mb_​preferred_​mime_​name
      • mb_​regex_​encoding
      • mb_​regex_​set_​options
      • mb_​scrub
      • mb_​send_​mail
      • mb_​split
      • mb_​str_​split
      • mb_​strcut
      • mb_​strimwidth
      • mb_​stripos
      • mb_​stristr
      • mb_​strlen
      • mb_​strpos
      • mb_​strrchr
      • mb_​strrichr
      • mb_​strripos
      • mb_​strrpos
      • mb_​strstr
      • mb_​strtolower
      • mb_​strtoupper
      • mb_​strwidth
      • mb_​substitute_​character
      • mb_​substr_​count
      • mb_​substr

      Источник

Оцените статью