Decoding utf 8 in javascript

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.

License

mathiasbynens/utf8.js

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

utf8.js is a well-tested UTF-8 encoder/decoder written in JavaScript. Unlike many other JavaScript solutions, it is designed to be a proper UTF-8 encoder/decoder: it can encode/decode any scalar Unicode code point values, as per the Encoding Standard. Here’s an online demo.

Feel free to fork if you see possible improvements!

script src pl-s">utf8.js">script>

Encodes any given JavaScript string ( string ) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)

// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9 utf8.encode('\xA9'); // → '\xC2\xA9' // U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001 utf8.encode('\uD800\uDC01'); // → '\xF0\x90\x80\x81'

Decodes any given UTF-8-encoded string ( byteString ) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)

utf8.decode('\xC2\xA9'); // → '\xA9' utf8.decode('\xF0\x90\x80\x81'); // → '\uD800\uDC01' // → U+10001 LINEAR B SYLLABLE B038 E

A string representing the semantic version number.

utf8.js has been tested in at least Chrome 27-39, Firefox 3-34, Safari 4-8, Opera 10-28, IE 6-11, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.11, PhantomJS 1.9.0, and Rhino 1.7RC4.

Unit tests & code coverage

After cloning this repository, run npm install to install the dependencies needed for development and testing. You may want to install Istanbul globally using npm install istanbul -g .

Once that’s done, you can run the unit tests in Node using npm test or node tests/tests.js . To run the tests in Rhino, Ringo, Narwhal, PhantomJS, and web browsers as well, use grunt test .

Читайте также:  Питон строка utf 8

To generate the code coverage report, use grunt cover .

Why is the first release named v2.0.0? Haven’t you heard of semantic versioning?

Long before utf8.js was created, the utf8 module on npm was registered and used by another (slightly buggy) library. @ryanmcgrath was kind enough to give me access to the utf8 package on npm when I told him about utf8.js. Since there has already been a v1.0.0 release of the old library, and to avoid breaking backwards compatibility with projects that rely on the utf8 npm package, I decided the tag the first release of utf8.js as v2.0.0 and take it from there.

utf8.js is available under the MIT license.

Источник

TextDecoder и TextEncoder

Что если бинарные данные фактически являются строкой? Например, мы получили файл с текстовыми данными.

Встроенный объект TextDecoder позволяет декодировать данные из бинарного буфера в обычную строку.

Для этого прежде всего нам нужно создать сам декодер:

let decoder = new TextDecoder([label], [options]);
  • label – тип кодировки, utf-8 используется по умолчанию, но также поддерживаются big5 , windows-1251 и многие другие.
  • options – объект с дополнительными настройками:
    • fatal – boolean, если значение true , тогда генерируется ошибка для невалидных (не декодируемых) символов, в ином случае (по умолчанию) они заменяются символом \uFFFD .
    • ignoreBOM – boolean, если значение true , тогда игнорируется BOM (дополнительный признак, определяющий порядок следования байтов), что необходимо крайне редко.

    …и после использовать его метод decode:

    let str = decoder.decode([input], [options]);
    • input – бинарный буфер ( BufferSource ) для декодирования.
    • options – объект с дополнительными настройками:
      • stream – true для декодирования потока данных, при этом decoder вызывается вновь и вновь для каждого следующего фрагмента данных. В этом случае многобайтовый символ может иногда быть разделён и попасть в разные фрагменты данных. Это опция указывает TextDecoder запомнить символ, на котором остановился процесс, и декодировать его со следующим фрагментом.
      let uint8Array = new Uint8Array([72, 101, 108, 108, 111]); alert( new TextDecoder().decode(uint8Array) ); // Hello
      let uint8Array = new Uint8Array([228, 189, 160, 229, 165, 189]); alert( new TextDecoder().decode(uint8Array) ); // 你好

      Мы можем декодировать часть бинарного массива, создав подмассив:

      let uint8Array = new Uint8Array([0, 72, 101, 108, 108, 111, 0]); // Возьмём строку из середины массива // Также обратите внимание, что это создаёт только новое представление без копирования самого массива. // Изменения в содержимом созданного подмассива повлияют на исходный массив и наоборот. let binaryString = uint8Array.subarray(1, -1); alert( new TextDecoder().decode(binaryString) ); // Hello

      TextEncoder

      TextEncoder поступает наоборот – кодирует строку в бинарный массив.

      Имеет следующий синтаксис:

      Источник

      ‘Decode UTF-8 with Javascript

      I have Javascript in an XHTML web page that is passing UTF-8 encoded strings. It needs to continue to pass the UTF-8 version, as well as decode it. How is it possible to decode a UTF-8 string for display?

        

      Solution 1: [1]

      To answer the original question: here is how you decode utf-8 in javascript:

      function encode_utf8(s) < return unescape(encodeURIComponent(s)); >function decode_utf8(s)

      We have been using this in our production code for 6 years, and it has worked flawlessly.

      Note, however, that escape() and unescape() are deprecated. See this.

      Solution 2: [2]

      // http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt /* utf.js - UTF-8 UTF-16 convertion * * Copyright (C) 1999 Masanao Izumo [email protected]> * Version: 1.0 * LastModified: Dec 25 1999 * This library is free. You can redistribute it and/or modify it. */ function Utf8ArrayToStr(array) < var out, i, len, c; var char2, char3; out = ""; len = array.length; i = 0; while(i < len) < c = array[i++]; switch(c >> 4) < case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: // 0xxxxxxx out += String.fromCharCode(c); break; case 12: case 13: // 110x xxxx 10xx xxxx char2 = array[i++]; out += String.fromCharCode(((c & 0x1F) > return out; > 

      Also see the related questions: here and here

      Solution 3: [3]

      Perhaps using the textDecoder will be sufficient.

      Not supported in IE though.

      var decoder = new TextDecoder('utf-8'), decodedMessage; decodedMessage = decoder.decode(message.data); 

      Handling non-UTF8 text

      In this example, we decode the Russian text «. . «, which means «Hello, world.» In our TextDecoder() constructor, we specify the Windows-1251 character encoding, which is appropriate for Cyrillic script.

       let win1251decoder = new TextDecoder('windows-1251'); let bytes = new Uint8Array([207, 240, 232, 226, 229, 242, 44, 32, 236, 232, 240, 33]); console.log(win1251decoder.decode(bytes)); // . . 

      The interface for the TextDecoder is described here.

      Retrieving a byte array from a string is equally simpel:

      const decoder = new TextDecoder(); const encoder = new TextEncoder(); const byteArray = encoder.encode('Größe'); // converted it to a byte array // now we can decode it back to a string if desired console.log(decoder.decode(byteArray));

      If you have it in a different encoding then you must compensate for that upon encoding. The parameter in the constructor for the TextEncoder is any one of the valid encodings listed here.

      Solution 4: [4]

      Update @Albert’s answer adding condition for emoji.

      function Utf8ArrayToStr(array) < var out, i, len, c; var char2, char3, char4; out = ""; len = array.length; i = 0; while(i < len) < c = array[i++]; switch(c >> 4) < case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: // 0xxxxxxx out += String.fromCharCode(c); break; case 12: case 13: // 110x xxxx 10xx xxxx char2 = array[i++]; out += String.fromCharCode(((c & 0x1F) return out; > 

      Solution 5: [5]

      Here is a solution handling all Unicode code points include upper (4 byte) values and supported by all modern browsers (IE and others > 5.5). It uses decodeURIComponent(), but NOT the deprecated escape/unescape functions:

      function utf8_to_str(a) < for(var i=0, s=''; ireturn decodeURIComponent(s) > 

      To create UTF-8 from a string:

      function utf8_from_str(s) < for(var i=0, enc = encodeURIComponent(s), a = []; i < enc.length;) < if(enc[i] === '%') < a.push(parseInt(enc.substr(i+1, 2), 16)) i += 3 >else < a.push(enc.charCodeAt(i++)) >> return a > 

      Solution 6: [6]

      @albert’s solution was the closest I think but it can only parse up to 3 byte utf-8 characters

      function utf8ArrayToStr(array) < var out, i, len, c; var char2, char3; out = ""; len = array.length; i = 0; // XXX: Invalid bytes are ignored while(i < len) < c = array[i++]; if (c >> 7 == 0) < // 0xxx xxxx out += String.fromCharCode(c); continue; >// Invalid starting byte if (c >> 6 == 0x02) < continue; >// #### MULTIBYTE #### // How many bytes left for thus character? var extraLength = null; if (c >> 5 == 0x06) < extraLength = 1; >else if (c >> 4 == 0x0e) < extraLength = 2; >else if (c >> 3 == 0x1e) < extraLength = 3; >else if (c >> 2 == 0x3e) < extraLength = 4; >else if (c >> 1 == 0x7e) < extraLength = 5; >else < continue; >// Do we have enough bytes in our data? if (i+extraLength > len) < var leftovers = array.slice(i-1); // If there is an invalid byte in the leftovers we might want to // continue from there. for (; i < len; i++) if (array[i] >> 6 != 0x02) break; if (i != len) continue; // All leftover bytes are valid. return ; > // Remove the UTF-8 prefix from the char (res) var mask = (1 > 6 != 0x02) ; res = (res if (count != extraLength) < i--; continue; >if (res res -= 0x10000; var high = ((res >> 10) & 0x3ff) + 0xd800, low = (res & 0x3ff) + 0xdc00; out += String.fromCharCode(high, low); > return ; > 

      EDIT: fixed the issue that @unhammer found.

      Solution 7: [7]

      // String to Utf8 ByteBuffer

      function strToUTF8(str)< return Uint8Array.from(encodeURIComponent(str).replace(/%(..)/g,(m,v)=>), c=>c.codePointAt(0)) > 

      Solution 8: [8]

      This is what I found after a more specific Google search than just UTF-8 encode/decode. so for those who are looking for a converting library to convert between encodings, here you go.

      var uint8array = new TextEncoder().encode(str); var str = new TextDecoder(encoding).decode(uint8array); 

      Paste from repo readme

      All encodings from the Encoding specification are supported:

      utf-8 ibm866 iso-8859-2 iso-8859-3 iso-8859-4 iso-8859-5 iso-8859-6 iso-8859-7 iso-8859-8 iso-8859-8-i iso-8859-10 iso-8859-13 iso-8859-14 iso-8859-15 iso-8859-16 koi8-r koi8-u macintosh windows-874 windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 x-mac-cyrillic gb18030 hz-gb-2312 big5 euc-jp iso-2022-jp shift_jis euc-kr replacement utf-16be utf-16le x-user-defined

      (Some encodings may be supported under other names, e.g. ascii, iso-8859-1, etc. See Encoding for additional labels for each encoding.)

      Solution 9: [9]

      Using my 1.6KB library, you can do

      ToString(FromUTF8(Array.from(usernameReceived))) 

      Solution 10: [10]

      You should take decodeURI for it.

      decodeURI('https://developer.mozilla.org/ru/docs/JavaScript_%D1%88%D0%B5%D0%BB%D0%BB%D1%8B'); // "https://developer.mozilla.org/ru/docs/JavaScript_. " 

      Consider to use it inside try catch block for not missing an URIError .

      Also it has full browsers support.

      Solution 11: [11]

      I reckon the easiest way would be to use a built-in js functions decodeURI() / encodeURI().

      Solution 12: [12]

      This is a solution with extensive error reporting.

      It would take an UTF-8 encoded byte array (where byte array is represented as array of numbers and each number is an integer between 0 and 255 inclusive) and will produce a JavaScript string of Unicode characters.

      function getNextByte(value, startByteIndex, startBitsStr, additional, index) < if (index >= value.length) < var startByte = value[startByteIndex]; throw new Error("Invalid UTF-8 sequence. Byte " + startByteIndex + " with value " + startByte + " (" + String.fromCharCode(startByte) + "; binary: " + toBinary(startByte) + ") starts with " + startBitsStr + " in binary and thus requires " + additional + " bytes after it, but we only have " + (value.length - startByteIndex) + "."); >var byteValue = value[index]; checkNextByteFormat(value, startByteIndex, startBitsStr, additional, index); return byteValue; > function checkNextByteFormat(value, startByteIndex, startBitsStr, additional, index) < if ((value[index] & 0xC0) != 0x80) < var startByte = value[startByteIndex]; var wrongByte = value[index]; throw new Error("Invalid UTF-8 byte sequence. Byte " + startByteIndex + " with value " + startByte + " (" +String.fromCharCode(startByte) + "; binary: " + toBinary(startByte) + ") starts with " + startBitsStr + " in binary and thus requires " + additional + " additional bytes, each of which shouls start with 10 in binary." + " However byte " + (index - startByteIndex) + " after it with value " + wrongByte + " (" + String.fromCharCode(wrongByte) + "; binary: " + toBinary(wrongByte) +") does not start with 10 in binary."); >> function fromUtf8 (str) < var value = []; var destIndex = 0; for (var index = 0; index < str.length; index++) < var code = str.charCodeAt(index); if (code else if (code > 6 ) & 0x1F) | 0xC0; value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80; > else if (code > 12) & 0x0F) | 0xE0; value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80; value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80; > else if (code > 18) & 0x07) | 0xF0; value[destIndex++] = ((code >> 12) & 0x3F) | 0x80; value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80; value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80; > else if (code > 24) & 0x03) | 0xF0; value[destIndex++] = ((code >> 18) & 0x3F) | 0x80; value[destIndex++] = ((code >> 12) & 0x3F) | 0x80; value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80; value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80; > else if (code > 30) & 0x01) | 0xFC; value[destIndex++] = ((code >> 24) & 0x3F) | 0x80; value[destIndex++] = ((code >> 18) & 0x3F) | 0x80; value[destIndex++] = ((code >> 12) & 0x3F) | 0x80; value[destIndex++] = ((code >> 6 ) & 0x3F) | 0x80; value[destIndex++] = ((code >> 0 ) & 0x3F) | 0x80; > else < throw new Error("Unsupported Unicode character \"" + str.charAt(index) + "\" with code " + code + " (binary: " + toBinary(code) + ") at index " + index + ". Cannot represent it as UTF-8 byte sequence."); >> return value; > 

      Solution 13: [13]

      const decoder = new TextDecoder(); console.log(decoder.decode(new Uint8Array([97]))); 

      enter image description here

      Sources

      This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

      Источник

Оцените статью