Php convert file encoding

How to unify text file encoding to utf-8 in PHP

I’m Nishimura, the creator of QuizGenerator, and I’ve been working on it since the release of version 2.0. In fact, I didn’t participate in the early development of learningBOX, but I started to participate in it from the release of version 2.0. This time, I’m going to take part in the development of Handling of character codes I thought I should talk a bit more about character codes, so I’ve put together an article about it.
In this article. How to unify text file encoding to utf-8 in PHP We are pleased to introduce you to the following. We hope you enjoy this paper as much as we do.

  • 1. Shift-JIS is unavoidable.
  • 2. how to avoid garbled text
  • 3. how do we determine the character code?
  • 4. Summary

Shift-JIS is unavoidable.

Systems such as learningBOX and QuizGenerator may receive text files such as CSVs. In a modern web system, you should use utf-8 as the text file’s character code, and you don’t want to accept any other character code, but as a practical matterShift_JISfiles are rarely often It will be uploaded and reported as a defect.
So, in QuizGeneratorShift_JISfiles are converted to utf-8 and then the process continues.

Shift JIS code is one of the character codes for various characters including Japanese that have been standardized as JIS standard. It is an improved version of the JIS code, and while the JIS code uses 7 bits to represent characters, the Shift JIS code uses 2 bytes (16 bits) to represent all characters.

I don’t trust mb_convert_encoding.

PHP

PHP has a function called mb_convert_encoding, which can convert the character code. At first glance, this function seems to be able to determine the character code and convert it to utf-8 just by using this function, but actually, this function is not trustworthy.

mb_convert_encoding(«ah», «utf-8», «utf-8, sjis-win» ), then if «ah» is utf-8, it should be converted to utf-8 as is, and if it’s Shift_JIS, it should be converted to utf-8 (at least as far as the official documentation is concerned). In fact, they do a hell of a lot of things.
If you take a string passed in utf-8 and force it to be interpreted as Shift_JIS, break it, and convert it to utf-8, you will get an incomprehensible value.

Читайте также:  Список пользователей

How to avoid garbled text

PHP - Garbled characters

If you specify the source character code, mb_convert_encoding will work properly.
mb_convert_encoding works correctly as long as you specify the source character code. In other words, convert from Shift_JIS to utf-8 only in the case of Shift_JIS and do nothing in the case of utf-8, which basically works.

mb_convert_encoding(«ah», «utf-8», «sjis-win»)
The above code will work fine if «Ahhh» is Shift_JIS. In the case of utf-8, you can use it as it is.

How do we determine the character code?

There is a function called mb_detect_encoding, which, if it works correctly in the first place, can be solved simply by using mb_convert_encoding.

If the standard function doesn’t work, you’ll have to do it on your own.

Just do it because it’s not that hard to determine if you meet the utf-8 specification.

What if it’s not utf-8?

If it’s not utf-8, ・・・・ treat it as Shift_JIS. I can’t support it until someone brings up euc-jp or utf-16 files. At least, people who do that should know about the encoding, so please do self-service. I can’t support people who want to use euc-jp or utf-16 files.

Another trap.

I’ve used the term Shift_JIS many times in this article, but what is now called Shift_JIS is an extension of Shift_JISWindows-31J (MS932)It is often the case that the
However, if you specify Shift_JIS in PHP, all characters other than those specified in the original Shift_JIS specification will be garbled. Unless you have a special reason, please use Windows-31J or sjis-win instead of Shift_JIS. The official document says to use Windows-31J, but only sjis-win is listed, which is a strange situation, but at least with PHP 7.3.13, both of these options worked fine.

Will a unified future ever come?

About 20 years ago. When I first started web programming, UTF-8 was non-standard and garbled characters were an everyday occurrence. It seems safe to say that the unification of
Smartphones were born after the spread of utf-8, so they are built on the premise of utf-8 (so they tend to be garbled by files encoded in other formats, such as Shift_JIS). (Therefore, Shift_JIS and other encoding files tend to be garbled) On the other hand, files exchanged on Windows are often Shift_JIS.

Summary

In this article, I have introduced «How to unify the encoding of text files into utf-8 in PHP». As we are a company from Japan, we would like to keep Shift_JIS in mind for a while longer as we try to make our products easy to use for the Japanese. (I really wish I could forget about IE11. )

Источник

Перекодировка текста UTF-8 и WINDOWS-1251

Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.

windows-1251 в UTF-8

$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251'); echo $text;

UTF-8 в windows-1251

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); echo $text;
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8'); echo $text;

Когда ни что не помогает

$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text); $text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

Иногда доходит до бреда, но работает:

$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text); $text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text); echo $text;

File_get_contents / CURL

Бывают случаи когда file_get_contents() или CURL возвращают иероглифы (Алмазные борÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.

$text = file_get_contents('https://example.com'); $text = "\xEF\xBB\xBF" . $text; echo $text;

Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:

Читайте также:  Css for ie and chrome

Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:

function getcontents($url) < $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); curl_setopt($ch, CURLOPT_ENCODING, 'gzip'); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); $output = curl_exec($ch); curl_close($ch); return $output; >echo getcontents('https://example.com');

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

descom-es/php-file-encoding

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

PHP convert files encoding

PHP class to convert files encoding

You can install it with composer:

composer require descom/file_encoding
encodeFile($file, $encoding_to, $encodings_detected);
use Descom\File\Encoding; $codification = new Encoding(); $file = 'file.txt'; $encoding_to = 'UTF-8'; $encodings_detected = 'UTF-8,ISO-8859-1,WINDOWS-1252'; $result = $codification->encodeFile($file, $encoding_to, $encodings_detected);
$encoding_to = 'UTF-8'; $encodings_detected = 'UTF-8,ISO-8859-1,WINDOWS-1252';

Источник

How to unify text file encoding to utf-8 in PHP

I’m Nishimura, the creator of QuizGenerator, and I’ve been working on it since the release of version 2.0. In fact, I didn’t participate in the early development of learningBOX, but I started to participate in it from the release of version 2.0. This time, I’m going to take part in the development of Handling of character codes I thought I should talk a bit more about character codes, so I’ve put together an article about it.
In this article. How to unify text file encoding to utf-8 in PHP We are pleased to introduce you to the following. We hope you enjoy this paper as much as we do.

  • 1. Shift-JIS is unavoidable.
  • 2. how to avoid garbled text
  • 3. how do we determine the character code?
  • 4. Summary

Shift-JIS is unavoidable.

Systems such as learningBOX and QuizGenerator may receive text files such as CSVs. In a modern web system, you should use utf-8 as the text file’s character code, and you don’t want to accept any other character code, but as a practical matterShift_JISfiles are rarely often It will be uploaded and reported as a defect.
So, in QuizGeneratorShift_JISfiles are converted to utf-8 and then the process continues.

Shift JIS code is one of the character codes for various characters including Japanese that have been standardized as JIS standard. It is an improved version of the JIS code, and while the JIS code uses 7 bits to represent characters, the Shift JIS code uses 2 bytes (16 bits) to represent all characters.

Читайте также:  Php array of resources

I don’t trust mb_convert_encoding.

PHP

PHP has a function called mb_convert_encoding, which can convert the character code. At first glance, this function seems to be able to determine the character code and convert it to utf-8 just by using this function, but actually, this function is not trustworthy.

mb_convert_encoding(«ah», «utf-8», «utf-8, sjis-win» ), then if «ah» is utf-8, it should be converted to utf-8 as is, and if it’s Shift_JIS, it should be converted to utf-8 (at least as far as the official documentation is concerned). In fact, they do a hell of a lot of things.
If you take a string passed in utf-8 and force it to be interpreted as Shift_JIS, break it, and convert it to utf-8, you will get an incomprehensible value.

How to avoid garbled text

PHP - Garbled characters

If you specify the source character code, mb_convert_encoding will work properly.
mb_convert_encoding works correctly as long as you specify the source character code. In other words, convert from Shift_JIS to utf-8 only in the case of Shift_JIS and do nothing in the case of utf-8, which basically works.

mb_convert_encoding(«ah», «utf-8», «sjis-win»)
The above code will work fine if «Ahhh» is Shift_JIS. In the case of utf-8, you can use it as it is.

How do we determine the character code?

There is a function called mb_detect_encoding, which, if it works correctly in the first place, can be solved simply by using mb_convert_encoding.

If the standard function doesn’t work, you’ll have to do it on your own.

Just do it because it’s not that hard to determine if you meet the utf-8 specification.

What if it’s not utf-8?

If it’s not utf-8, ・・・・ treat it as Shift_JIS. I can’t support it until someone brings up euc-jp or utf-16 files. At least, people who do that should know about the encoding, so please do self-service. I can’t support people who want to use euc-jp or utf-16 files.

Another trap.

I’ve used the term Shift_JIS many times in this article, but what is now called Shift_JIS is an extension of Shift_JISWindows-31J (MS932)It is often the case that the
However, if you specify Shift_JIS in PHP, all characters other than those specified in the original Shift_JIS specification will be garbled. Unless you have a special reason, please use Windows-31J or sjis-win instead of Shift_JIS. The official document says to use Windows-31J, but only sjis-win is listed, which is a strange situation, but at least with PHP 7.3.13, both of these options worked fine.

Will a unified future ever come?

About 20 years ago. When I first started web programming, UTF-8 was non-standard and garbled characters were an everyday occurrence. It seems safe to say that the unification of
Smartphones were born after the spread of utf-8, so they are built on the premise of utf-8 (so they tend to be garbled by files encoded in other formats, such as Shift_JIS). (Therefore, Shift_JIS and other encoding files tend to be garbled) On the other hand, files exchanged on Windows are often Shift_JIS.

Summary

In this article, I have introduced «How to unify the encoding of text files into utf-8 in PHP». As we are a company from Japan, we would like to keep Shift_JIS in mind for a while longer as we try to make our products easy to use for the Japanese. (I really wish I could forget about IE11. )

Источник

Оцените статью