Javascript escape utf 8

encodeURIComponent()

The encodeURIComponent() function encodes a URI by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two surrogate characters). Compared to encodeURI() , this function encodes more characters, including those that are part of the URI syntax.

Try it

Syntax

encodeURIComponent(uriComponent) 

Parameters

A string to be encoded as a URI component (a path, query string, fragment, etc.). Other values are converted to strings.

Return value

A new string representing the provided uriComponent encoded as a URI component.

Exceptions

Thrown if uriComponent contains a lone surrogate.

Description

encodeURIComponent() is a function property of the global object.

encodeURIComponent() uses the same encoding algorithm as described in encodeURI() . It escapes all characters except:

Compared to encodeURI() , encodeURIComponent() escapes a larger set of characters. Use encodeURIComponent() on user-entered fields from forms POST ‘d to the server — this will encode & symbols that may inadvertently be generated during data entry for special HTML entities or other characters that require encoding/decoding. For example, if a user writes Jack & Jill , without encodeURIComponent() , the ampersand could be interpreted on the server as the start of a new field and jeopardize the integrity of the data.

For application/x-www-form-urlencoded , spaces are to be replaced by + , so one may wish to follow a encodeURIComponent() replacement with an additional replacement of %20 with + .

Examples

The following example provides the special encoding required within UTF-8 Content-Disposition and Link server response header parameters (e.g., UTF-8 filenames):

const fileName = "my file(2).txt"; const header = `Content-Disposition: attachment; filename*=UTF-8''$encodeRFC5987ValueChars( fileName, )>`; console.log(header); // "Content-Disposition: attachment; filename*=UTF-8''my%20file%282%29.txt" function encodeRFC5987ValueChars(str)  return ( encodeURIComponent(str) // The following creates the sequences %27 %28 %29 %2A (Note that // the valid encoding of "*" is %2A, which necessitates calling // toUpperCase() to properly encode). Although RFC3986 reserves "!", // RFC5987 does not, so we do not need to escape it. .replace( /['()*]/g, (c) => `%$c.charCodeAt(0).toString(16).toUpperCase()>`, ) // The following are not required for percent-encoding per RFC5987, // so we can allow for a little better readability over the wire: |`^ .replace(/%(7C|60|5E)/g, (str, hex) => String.fromCharCode(parseInt(hex, 16)), ) ); > 

Encoding for RFC3986

The more recent RFC3986 reserves !, ‘, (, ), and *, even though these characters have no formalized URI delimiting uses. The following function encodes a string for RFC3986-compliant URL component format. It also encodes [ and ], which are part of the IPv6 URI syntax. An RFC3986-compliant encodeURI implementation should not escape them, which is demonstrated in the encodeURI() example.

function encodeRFC3986URIComponent(str)  return encodeURIComponent(str).replace( /[!'()*]/g, (c) => `%$c.charCodeAt(0).toString(16).toUpperCase()>`, ); > 

Encoding a lone high surrogate throws

A URIError will be thrown if one attempts to encode a surrogate which is not part of a high-low pair. For example:

// High-low pair OK encodeURIComponent("\uD800\uDFFF"); // "%F0%90%8F%BF" // Lone high surrogate throws "URIError: malformed URI sequence" encodeURIComponent("\uD800"); // Lone low surrogate throws "URIError: malformed URI sequence" encodeURIComponent("\uDFFF"); 

You can use String.prototype.toWellFormed() , which replaces lone surrogates with the Unicode replacement character (U+FFFD), to avoid this error. You can also use String.prototype.isWellFormed() to check if a string contains lone surrogates before passing it to encodeURIComponent() .

Specifications

Browser compatibility

BCD tables only load in the browser

See also

Found a content problem with this page?

This page was last modified on Jul 24, 2023 by MDN contributors.

Your blueprint for a better internet.

Источник

UTF-8 and Javascript

First of all it’s important to note that JavaScript strings enconding is UCS-2, similar to UTF-16, different from UTF-8. Question: I am handling utf-8 strings in JavaScript and need to escape them.

UTF-8 and Javascript

I use Javascript to get data in a HTML pages define with a charset UTF8

my javascript method is load with a charset UTF8

but i have a encode problem, when i get data with «innerHTML»

is there something I missed ?

The encoding of the files must be set to UTF8.

How to decode UTF-8 encoded String using java?, It is mime-encoded — the «B» encoding, to be specific (rfc2047 section 4.1). I think you can decode it using javamail javax.mail.internet.InternetHeaders or MimeUtility class. Share. answered Jul 5, 2010 at 7:42. J-16 SDiZ. 25.8k 3 63 83. Add a comment.

How can I decode an utf-8 encoded string in JavaScript?

I am making a REST API project that gets data from a python script and prints it via node js . The data is sent from the python script to node js with the following code:

json_dump = json.dumps(data) print(data.encode("utf-8", "replace")) 

And js gets the data with the following code:

PythonShell.run('script.py', options, function (err, data) < if (err) throw err; res.json(JSON.parse(data)); >); 

But I get the following error:

Unexpected token b in JSON at position 0 

The JSON arrives correctly but starts with a ‘b’ and many characters are not getting printed or gets printed like this: «\xf0\x9f\xa4\x91». What can I do?

Remove the .encode(«utf-8», «replace») . This converts the string to a bytes object (the representation starts with the b». » )

json_dump = json.dumps(data) print(json_dump) 

Using Javascript’s atob to decode base64 doesn’t, Things change. The escape/unescape methods have been deprecated.. You can URI encode the string before you Base64-encode it. Note that this does’t produce Base64-encoded UTF8, …

Using encodeURI() vs. escape() for utf-8 strings in JavaScript

I am handling utf -8 strings in JavaScript and need to escape them.

Both escape() / unescape() and encodeURI() / decodeuri () work in my browser.

> var hello = "안녕하세요" > var hello_escaped = escape(hello) > hello_escaped "%uC548%uB155%uD558%uC138%uC694" > var hello_unescaped = unescape(hello_escaped) > hello_unescaped "안녕하세요" 
> var hello = "안녕하세요" > var hello_encoded = encodeURI(hello) > hello_encoded "%EC%95%88%EB%85%95%ED%95%98%EC%84%B8%EC%9A%94" > var hello_decoded = decodeURI(hello_encoded) > hello_decoded "안녕하세요" 

However, Mozilla says that escape() is deprecated.

Although encodeURI() and decodeURI() work with the above utf-8 string, the docs (as well as the function names themselves) tell me that these methods are for URIs; I do not see utf-8 strings mentioned anywhere.

Simply put, is it okay to use encodeURI() and decodeURI() for utf-8 strings?

When it comes to escape and unescape , I live by two rules:

Avoiding them when you easily can:

As mentioned in the question, both escape and unescape have been deprecated. In general, one should avoid using deprecated functions.

So, if encodeURIComponent or encodeURI does the trick for you, you should use that instead of escape .

Using them when you can’t easily avoid them:

Browsers will, as far as possible, strive to achieve backwards compatibility. All major browsers have already implemented escape and unescape ; why would they un-implement them?

Browsers would have to redefine escape and unescape if the new specification requires them to do so. But wait! The people who write specifications are quite smart. They too, are interested in not breaking backwards compatibility!

I realize that the above argument is weak. But trust me, . when it comes to browsers, deprecated stuff works. This even includes deprecated HTML tags like and .

Using escape and unescape :

So naturally, the next question is, when would one use escape or unescape ?

Recently, while working on CloudBrave, I had to deal with utf8 , latin1 and inter-conversions.

After reading a bunch of blog posts, I realized how simple this was:

var utf8_to_latin1 = function (s) < return unescape(encodeURIComponent(s)); >; var latin1_to_utf8 = function (s) < return decodeURIComponent(escape(s)); >; 

These inter-conversions, without using escape and unescape are rather involved. By not avoiding escape and unescape , life becomes simpler.

Hope this helps.

It is never okay to use encodeURI() or encodeURIComponent() . Let’s try it out:

console.log(encodeURIComponent('@#*'));

Input: @#* . Output: %40%23* . Wait, so, what exactly happened to the * character? Why wasn’t that converted? Imagine this: You ask a user what file to delete and their response is * . Server-side, you convert that using encodeURIComponent() and then run rm * . Well, got news for you: using encodeURIComponent() means you just deleted all files.

Use fixedEncodeURI() , when trying to encode a complete URL (i.e., all of example.com?arg=val ), as defined and further explained at the MDN encodeURI() Documentation .

function fixedEncodeURI(str) < return encodeURI(str).replace(/%5B/g, '[').replace(/%5D/g, ']'); >

Or, you may need to use use fixedEncodeURIComponent() , when trying to encode part of a URL (i.e., the arg or the val in example.com?arg=val ), as defined and further explained at the MDN encodeURIComponent() Documentation .

function fixedEncodeURIComponent(str) < return encodeURIComponent(str).replace(/[!'()*]/g, function(c) < return '%' + c.charCodeAt(0).toString(16); >); > 

If you are unable to distinguish them based on the above description, I always like to simplify it with:

Mozilla says that escape() is deprecated.

Yes, you should avoid both escape() and unescape()

Simply put, is it okay to use encodeURI() and decodeURI() for utf-8 strings?

Yes, but depending on the form of your input and the required form of your output you may need some extra work.

From your question I assume you have a JavaScript string and you want to convert encoding to UTF-8 and finally store the string in some escaped form.

First of all it’s important to note that JavaScript strings enconding is UCS-2, similar to UTF-16, different from UTF-8.

encodeURIComponent() is good for the job as turns the UCS-2 JavaScript string into UTF-8 and escapes it in the form a sequence of %nn substrings where each nn is the two hex digits of each byte.

However encodeURIComponent() does not escape letters, digits and few other characters in the ASCII range. But this is easy to fix.

For example, if you want to turn a JavaScript string into an array of numbers representing the bytes of the original string UTF-8 encoded you may use this function:

// // Convert JavaScript UCS2 string to array of bytes representing the string UTF8 encoded // function StringUTF8AsBytesArrayFromString( s ) < var i, n, u; u = []; s = encodeURIComponent( s ); n = s.length; for( i = 0; i < n; i++ ) < if( s.charAt( i ) == '%' ) < u.push( parseInt( s.substring( i + 1, i + 3 ), 16 ) ); i += 2; >else < u.push( s.charCodeAt( i ) ); >> return u; > 

If you want to turn the string in its hexadecimal representation:

// // Convert JavaScript UCS2 string to hex string representing the bytes of the string UTF8 encoded // function StringUTF8AsHexFromString( s ) < var u, i, n, s; u = StringUTF8AsBytesArrayFromString( s ); n = u.length; s = ''; for( i = 0; i < n; i++ ) < s += ( u[ i ] < 16 ? '0' : '' ) + u[ i ].toString( 16 ); >return s; > 

If you change the line in the for loop into

(adding the % sign before each hex digit)

The resulting escaped string (UTF-8 encoded) may be turned back into a JavaScript UCS-2 string with decodeURIComponent()

How does the UTF-8 encoding algorithm work on 8-bit, JS integers have 32bit binary operators, thus you can safely work with 4 x 8bit (4bytes) in one single number. That’s what your decoder receives as a parameter. UTF-8 encoding is variable in size. If the codepoint would only take 7bits (= ASCII), then it would fit into one byte, that has a leading zero to indicate that it …

UTF-8 to UTF-16LE Javascript

I need to convert an utf-8 string to utf-16LE in javascript like the iconv() php function.

The output should be like this:

49 00 6e 00 64 00 65 00 78 00

I found this func to decode UTF-16LE and it’s works fine but i don’t know how to do the same to encode.

function decodeUTF16LE( binaryStr ) < var cp = []; for( var i = 0; i < binaryStr.length; i+=2) < cp.push( binaryStr.charCodeAt(i) | ( binaryStr.charCodeAt(i+1) return String.fromCharCode.apply( String, cp ); > 

The conclusion is to create a binary file that can be downloaded.

function download(filename, text) < var a = window.document.createElement('a'); var byteArray = new Uint8Array(text.length); for (var i = 0; i < text.length; i++) < byteArray[i] = text.charCodeAt(i) & 0xff; >a.href = window.URL.createObjectURL(new Blob([byteArray.buffer], )); a.download = filename; // Append anchor to body. document.body.appendChild(a); a.click(); // Remove anchor from body document.body.removeChild(a); > 
var byteArray = new Uint8Array(text.length * 2); for (var i = 0; i < text.length; i++) < byteArray[i*2] = text.charCodeAt(i) // & 0xff; byteArray[i*2+1] = text.charCodeAt(i) >> 8 // & 0xff; > 

It’s the inverse of your decodeUTF16LE function. Notice that neither works with code points outside of the BMP.

Thanks a lot Bergi, this works perfectly combining to standard utf8 to utf16 encode function:

function encodeUTF16LE(str) < var out, i, ****, c; var char2, char3; out = ""; **** = str.length; i = 0; while(i < ****) < c = str.charCodeAt(i++); switch(c >> 4) < case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: // 0xxxxxxx out += str.charAt(i-1); break; case 12: case 13: // 110x xxxx 10xx xxxx char2 = str.charCodeAt(i++); out += String.fromCharCode(((c & 0x1F) > var byteArray = new Uint8Array(out.length * 2); for (var i = 0; i < out.length; i++) < byteArray[i*2] = out.charCodeAt(i); // & 0xff; byteArray[i*2+1] = out.charCodeAt(i) >> 8; // & 0xff; > return String.fromCharCode.apply( String, byteArray ); > 

Using encodeURI() vs. escape() for utf-8 strings in, First of all it’s important to note that JavaScript strings enconding is UCS-2, similar to UTF-16, different from UTF-8. encodeURIComponent () is good for the job as turns the UCS-2 JavaScript string into UTF-8 and escapes it in the form a sequence of %nn substrings where each nn is the two hex digits of each byte.

Источник

Читайте также:  Стили css для body
Оцените статью