Java convert byte to utf 8

Encode a String to UTF-8 in Java

When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8.

UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.

A code point can represent single characters, but also have other meanings, such as for formatting. «Variable-width» means that it encodes each code point with a different number of bytes (between one and four) and as a space-saving measure, commonly used code points are represented with fewer bytes than those used less frequently.

UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII.

Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?

Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn’t support as many characters as UTF-8. Additionally, not all output might handle UTF-16, so it makes sense to convert to a more universal UTF-8.

We’ll be working with a few Strings that contain Unicode characters you might not encounter on a daily basis — such as č , ß and あ , simulating user input.

Let’s write out a couple of Strings:

String serbianString = "Šta radiš?"; // What are you doing? String germanString = "Wie heißen Sie?"; // What's your name? String japaneseString = "よろしくお願いします"; // Pleased to meet you. 

Now, let’s leverage the String(byte[] bytes, Charset charset) constructor of a String, to recreate these Strings, but with a different Charset , simulating ASCII input that arrived to us in the first place:

String asciiSerbianString = new String(serbianString.getBytes(), StandardCharsets.US_ASCII); String asciigermanString = new String(germanString.getBytes(), StandardCharsets.US_ASCII); String asciijapaneseString = new String(japaneseString.getBytes(), StandardCharsets.US_ASCII); System.out.println(asciiSerbianString); System.out.println(asciigermanString); System.out.println(asciijapaneseString); 

Once we’ve created these Strings and encoded them as ASCII characters, we can print them:

While the first two Strings contain just a few characters that aren’t valid ASCII characters — the final one doesn’t contain any.

To avoid this issue, we can assume that not all input might already be encoded to our liking — and encode it to iron out such cases ourselves. There are several ways we can go about encoding a String to UTF-8 in Java.

Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String — providing additional information that can be used to format it once we form a String instance.

Using the getBytes() method

The String class, being made up of bytes, naturally offers a getBytes() method, which returns the byte array used to create the String. Since encoding is really just manipulating this byte array, we can put this array through a Charset to form it while getting the data.

Читайте также:  Python parse string with format

By default, without providing a Charset , the bytes are encoded using the platform’s default Charset — which might not be UTF-8 or UTF-16. Let’s get the bytes of a String and print them out:

String serbianString = "Šta radiš?"; // What are you doing? byte[] bytes = serbianString.getBytes(StandardCharsets.UTF_8); for (byte b : bytes) < System.out.print(String.format("%s ", b)); > 
-59 -96 116 97 32 114 97 100 105 -59 -95 63 

These are the code points for our encoded characters, and they’re not really useful to human eyes. Though, again, we can leverage String’s constructor to make a human-readable String from this very sequence. Considering the fact that we’ve encoded this byte array into UTF_8 , we can go ahead and safely make a new String from this:

String utf8String = new String(bytes); System.out.println(utf8String); 

Note: Instead of encoding them through the getBytes() method, you can also encode the bytes through the String constructor:

String utf8String = new String(bytes, StandardCharsets.UTF_8); 

This now outputs the exact same String we started with, but encoded to UTF-8:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Encode a String to UTF-8 with Java 7 StandardCharsets

Since Java 7, we’ve been introduced to the StandardCharsets class, which has several Charset s available such as US_ASCII , ISO_8859_1 , UTF_8 and UTF-16 among others.

Each Charset has an encode() and decode() method, which accepts a CharBuffer (which implements CharSequence , same as a String ). In practical terms — this means we can chuck in a String into the encode() methods of a Charset .

The encode() method returns a ByteBuffer — which we can easily turn into a String again.

Earlier when we used our getBytes() method, we stored the bytes we got in an array of bytes, but when using the StandardCharsets class, things are a bit different. We first need to use a class called ByteBuffer to store our bytes. Then, we need to both encode and then decode back our newly allocated bytes. Let’s see how this works in code:

String japaneseString = "よろしくお願いします"; // Pleased to meet you. ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(japaneseString); String utf8String = new String(byteBuffer.array(), StandardCharsets.UTF_8); System.out.println(utf8String); 

Running this code results in:

Encode a String to UTF-8 with Apache Commons

The Apache Commons Codec package contains simple encoders and decoders for various formats such as Base64 and Hexadecimal. In addition to these widely used encoders and decoders, the codec package also maintains a collection of phonetic encoding utilities.

For us to be able to use the Apache Commons Codec, we need to add it to our project as an external dependency.

Using Maven, let’s add the commons-codec dependency to our pom.xml file:

dependency> groupId>commons-codec groupId> artifactId>commons-codec artifactId> version>1.15 version> dependency> 

Alternatively if you’re using Gradle:

compile 'commons-codec:commons-codec:1.15' 

Now, we can utilize the utility classes of Apache Commons — and as usual, we’ll be leveraging the StringUtils class.

It allows us to convert Strings to and from bytes using various encodings required by the Java specification. This class is null-safe and thread-safe, so we’ve got an extra layer of protection when working with Strings.

Читайте также:  Передача одного аргумента php

To encode a String to UTF-8 with Apache Common’s StringUtils class, we can use the getBytesUtf8() method, which functions much like the getBytes() method with a specified Charset :

String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytesUtf8(germanString); String utf8String = StringUtils.newStringUtf8(bytes); System.out.println(utf8String); 

Or, you can use the regular StringUtils class from the commons-lang3 dependency:

dependency> groupId>org.apache.commons groupId> artifactId>commons-lang3 artifactId> dependency> 
implementation group: 'org.apache.commons', name: 'commons-lang3', version: $

And now, we can use much the same approach as with regular Strings:

String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytes(germanString, StandardCharsets.UTF_8); String utf8String = StringUtils.toEncodedString(bytes, StandardCharsets.UTF_8); System.out.println(utf8String); 

Though, this approach is thread-safe and null-safe:

Conclusion

In this tutorial, we’ve taken a look at how to encode a Java String to UTF-8. We’ve taken a look at a few approaches — manually creating a String using getBytes() and manipulating them, the Java 7 StandardCharsets class as well as Apache Commons.

Источник

Byte Encodings and Strings

If a byte array contains non-Unicode text, you can convert the text to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking either of these methods, you specify the encoding identifier as one of the parameters.

The example that follows converts characters between UTF-8 and Unicode. UTF-8 is a transmission format for Unicode that is safe for UNIX file systems. The full source code for the example is in the file StringConverter.java .

The StringConverter program starts by creating a String containing Unicode characters:

String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");

When printed, the String named original appears as:

To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:

try < byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); >catch (UnsupportedEncodingException e)

The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes.

The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java . Here is the printBytes method:

public static void printBytes(byte[] array, String name) < for (int k = 0; k < array.length; k++) < System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); >>

The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:

utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43

Источник

Читайте также:  Compare xml with python

How to convert byte[] array to String in Java

In Java, we can use new String(bytes, StandardCharsets.UTF_8) to convert a byte[] to a String .

 // string to byte[] byte[] bytes = "hello".getBytes(StandardCharsets.UTF_8); // byte[] to string String s = new String(bytes, StandardCharsets.UTF_8); 

1. byte[] in text and binary data

For text or character data, we use new String(bytes, StandardCharsets.UTF_8) to convert the byte[] to a String directly. However, for cases that byte[] is holding the binary data like the image or other non-text data, the best practice is to convert the byte[] into a Base64 encoded string.

 // convert file to byte[] byte[] bytes = Files.readAllBytes(Paths.get("/path/image.png")); // Java 8 - Base64 class, finally. // encode, convert byte[] to base64 encoded string String s = Base64.getEncoder().encodeToString(bytes); System.out.println(s); // decode, convert base64 encoded string back to byte[] byte[] decode = Base64.getDecoder().decode(s); // This Base64 encode decode string is still widely use in // 1. email attachment // 2. embed image files inside HTML or CSS 
  • For text data byte[] , try new String(bytes, StandardCharsets.UTF_8) .
  • For binary data byte[] , try Base64 encoding.

2. Convert byte[] to String (text data)

The below example convert a string to a byte array or byte[] and vice versa.

Warning
The common mistake is trying to use the bytes.toString() to get the string from the bytes; The bytes.toString() only returns the address of the object in memory, NOT converting byte[] to a string ! The correct way to convert byte[] to string is new String(bytes, StandardCharsets.UTF_8) .

 package com.mkyong.string; import java.nio.charset.StandardCharsets; public class ConvertBytesToString2 < public static void main(String[] args) < String str = "This is raw text!"; // string to byte[] byte[] bytes = str.getBytes(StandardCharsets.UTF_8); System.out.println("Text : " + str); System.out.println("Text [Byte Format] : " + bytes); // no, don't do this, it returns the address of the object in memory System.out.println("Text [Byte Format] toString() : " + bytes.toString()); // convert byte[] to string String s = new String(bytes, StandardCharsets.UTF_8); System.out.println("Output : " + s); // old code, UnsupportedEncodingException // String s1 = new String(bytes, "UTF_8"); >> 
 Text : This is raw text! Text [Byte Format] : [B@372f7a8d Text [Byte Format] toString() : [B@372f7a8d Output : This is raw text! 

3. Convert byte[] to String (binary data)

The below example converts an image phone.png into a byte[] , and uses the Java 8 Base64 class to convert the byte[] to a Base64 encoded String.

Later, we convert the Base64 encoded string back to the original byte[] and save it into another image named phone2.png .

 package com.mkyong.string; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.Base64; public class ConvertBytesToStringBase64 < public static void main(String[] args) < String filepath = "/Users/mkyong/phone.png"; Path path = Paths.get(filepath); if (Files.notExists(path)) < throw new IllegalArgumentException("File is not exists!"); >try < // convert the file's content to byte[] byte[] bytes = Files.readAllBytes(path); // encode, byte[] to Base64 encoded string String s = Base64.getEncoder().encodeToString(bytes); System.out.println(s); // decode, Base64 encoded string to byte[] byte[] decode = Base64.getDecoder().decode(s); // save into another image file. Files.write(Paths.get("/Users/mkyong/phone2.png"), decode); >catch (IOException e) < e.printStackTrace(); >> > 
 bh5aLyZALN4othXL2mByHo1aZA5ts5k/uw/sc7DBngGY. # if everything ok, it save the byte[] into a new image phone2.png 

Источник

Оцените статью