Unicode to utf8 java

Encode a String to UTF-8 in Java

When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8.

UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.

A code point can represent single characters, but also have other meanings, such as for formatting. «Variable-width» means that it encodes each code point with a different number of bytes (between one and four) and as a space-saving measure, commonly used code points are represented with fewer bytes than those used less frequently.

UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII.

Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?

Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn’t support as many characters as UTF-8. Additionally, not all output might handle UTF-16, so it makes sense to convert to a more universal UTF-8.

We’ll be working with a few Strings that contain Unicode characters you might not encounter on a daily basis — such as č , ß and あ , simulating user input.

Let’s write out a couple of Strings:

String serbianString = "Šta radiš?"; // What are you doing? String germanString = "Wie heißen Sie?"; // What's your name? String japaneseString = "よろしくお願いします"; // Pleased to meet you. 

Now, let’s leverage the String(byte[] bytes, Charset charset) constructor of a String, to recreate these Strings, but with a different Charset , simulating ASCII input that arrived to us in the first place:

String asciiSerbianString = new String(serbianString.getBytes(), StandardCharsets.US_ASCII); String asciigermanString = new String(germanString.getBytes(), StandardCharsets.US_ASCII); String asciijapaneseString = new String(japaneseString.getBytes(), StandardCharsets.US_ASCII); System.out.println(asciiSerbianString); System.out.println(asciigermanString); System.out.println(asciijapaneseString); 

Once we’ve created these Strings and encoded them as ASCII characters, we can print them:

While the first two Strings contain just a few characters that aren’t valid ASCII characters — the final one doesn’t contain any.

To avoid this issue, we can assume that not all input might already be encoded to our liking — and encode it to iron out such cases ourselves. There are several ways we can go about encoding a String to UTF-8 in Java.

Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String — providing additional information that can be used to format it once we form a String instance.

Using the getBytes() method

The String class, being made up of bytes, naturally offers a getBytes() method, which returns the byte array used to create the String. Since encoding is really just manipulating this byte array, we can put this array through a Charset to form it while getting the data.

By default, without providing a Charset , the bytes are encoded using the platform’s default Charset — which might not be UTF-8 or UTF-16. Let’s get the bytes of a String and print them out:

String serbianString = "Šta radiš?"; // What are you doing? byte[] bytes = serbianString.getBytes(StandardCharsets.UTF_8); for (byte b : bytes) < System.out.print(String.format("%s ", b)); > 
-59 -96 116 97 32 114 97 100 105 -59 -95 63 

These are the code points for our encoded characters, and they’re not really useful to human eyes. Though, again, we can leverage String’s constructor to make a human-readable String from this very sequence. Considering the fact that we’ve encoded this byte array into UTF_8 , we can go ahead and safely make a new String from this:

String utf8String = new String(bytes); System.out.println(utf8String); 

Note: Instead of encoding them through the getBytes() method, you can also encode the bytes through the String constructor:

String utf8String = new String(bytes, StandardCharsets.UTF_8); 

This now outputs the exact same String we started with, but encoded to UTF-8:

Читайте также:  Add python path to system path

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Encode a String to UTF-8 with Java 7 StandardCharsets

Since Java 7, we’ve been introduced to the StandardCharsets class, which has several Charset s available such as US_ASCII , ISO_8859_1 , UTF_8 and UTF-16 among others.

Each Charset has an encode() and decode() method, which accepts a CharBuffer (which implements CharSequence , same as a String ). In practical terms — this means we can chuck in a String into the encode() methods of a Charset .

The encode() method returns a ByteBuffer — which we can easily turn into a String again.

Earlier when we used our getBytes() method, we stored the bytes we got in an array of bytes, but when using the StandardCharsets class, things are a bit different. We first need to use a class called ByteBuffer to store our bytes. Then, we need to both encode and then decode back our newly allocated bytes. Let’s see how this works in code:

String japaneseString = "よろしくお願いします"; // Pleased to meet you. ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(japaneseString); String utf8String = new String(byteBuffer.array(), StandardCharsets.UTF_8); System.out.println(utf8String); 

Running this code results in:

Encode a String to UTF-8 with Apache Commons

The Apache Commons Codec package contains simple encoders and decoders for various formats such as Base64 and Hexadecimal. In addition to these widely used encoders and decoders, the codec package also maintains a collection of phonetic encoding utilities.

For us to be able to use the Apache Commons Codec, we need to add it to our project as an external dependency.

Using Maven, let’s add the commons-codec dependency to our pom.xml file:

dependency> groupId>commons-codec groupId> artifactId>commons-codec artifactId> version>1.15 version> dependency> 

Alternatively if you’re using Gradle:

compile 'commons-codec:commons-codec:1.15' 

Now, we can utilize the utility classes of Apache Commons — and as usual, we’ll be leveraging the StringUtils class.

It allows us to convert Strings to and from bytes using various encodings required by the Java specification. This class is null-safe and thread-safe, so we’ve got an extra layer of protection when working with Strings.

To encode a String to UTF-8 with Apache Common’s StringUtils class, we can use the getBytesUtf8() method, which functions much like the getBytes() method with a specified Charset :

String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytesUtf8(germanString); String utf8String = StringUtils.newStringUtf8(bytes); System.out.println(utf8String); 

Or, you can use the regular StringUtils class from the commons-lang3 dependency:

dependency> groupId>org.apache.commons groupId> artifactId>commons-lang3 artifactId> dependency> 
implementation group: 'org.apache.commons', name: 'commons-lang3', version: $

And now, we can use much the same approach as with regular Strings:

String germanString = "Wie heißen Sie?"; // What's your name? byte[] bytes = StringUtils.getBytes(germanString, StandardCharsets.UTF_8); String utf8String = StringUtils.toEncodedString(bytes, StandardCharsets.UTF_8); System.out.println(utf8String); 

Though, this approach is thread-safe and null-safe:

Читайте также:  Css to rem online

Conclusion

In this tutorial, we’ve taken a look at how to encode a Java String to UTF-8. We’ve taken a look at a few approaches — manually creating a String using getBytes() and manipulating them, the Java 7 StandardCharsets class as well as Apache Commons.

Источник

Конвертация Unicode в UTF-8 в Java

Unicode – это международный стандарт кодировки символов, который может представлять большинство письменных языков по всему миру. Юникод использует шестнадцатеричное для представления символа. Unicode – это 16-битная система кодирования символов. Наименьшее значение равно \ u0000, а самое высокое значение равно \ uFFFF.

UTF-8 – кодировка символов переменной ширины. UTF-8 может быть сжатым как ASCII, но также может содержать любые символы Юникода с некоторым увеличением размера файла. UTF расшифровывается как Unicode Transformation Format. «8» означает, что он выделяет 8-битные блоки для обозначения символа. Количество блоков, необходимых для представления символа, варьируется от 1 до 4.

Чтобы конвертировать Unicode в UTF-8 в Java, мы используем метод getBytes(). Он кодирует строку в последовательность байтов и возвращает массив байтов.

Объявление

Метод getBytes() объявляется следующим образом.

public byte[] getBytes(String charsetName)

где charsetName – это конкретный набор символов, с помощью которого String кодируется в массив байтов.

Давайте посмотрим программу для преобразования Unicode в UTF-8 в Java с использованием метода getBytes().

Пример 1

public class Example < public static void main(String[] args) throws Exception < String str1 = "\u0000"; String str2 = "\uFFFF"; byte[] arr = str1.getBytes("UTF-8"); byte[] brr = str2.getBytes("UTF-8"); System.out.println("UTF-8 for \\u0000"); for(byte a: arr) < System.out.print(a); >System.out.println("\nUTF-8 for \\uffff" ); for(byte b: brr) < System.out.print(b); >> >

Итог

UTF-8 for \u0000 0 UTF-8 for \uffff -17-65-65

Описание

String str1 = "\u0000"; String str2 = "\uFFFF";
  1. Строка str1 назначается \ u0000, что является самым низким значением в Юникоде. Строке str2 присваивается значение \ uFFFF, которое является самым высоким значением в Юникоде.
  2. Чтобы преобразовать их в UTF-8, мы используем метод getBytes(“UTF-8”). Это дает нам массив байтов следующим образом –
byte[] arr = str1.getBytes("UTF-8"); byte[] brr = str2.getBytes("UTF-8");

for(byte a: arr) < System.out.print(a); >for(byte b: brr)

  1. Чтобы преобразовать UTF-8 в Unicode, мы создаем объект String, который имеет параметры в качестве имени массива байтов UTF-8, а charset – массив байтов, которым он является, то есть в UTF-8.

Пример 2

Давайте посмотрим программу для преобразования UTF-8 в Unicode путем создания нового объекта String.

Итог

Сначала мы преобразовали данную строку Unicode в UTF-8 для последующей проверки с помощью метода getBytes() –

String str = "hey\u6366"; byte[] charset = str.getBytes("UTF-8")

Затем мы преобразовали байтовый массив charset в Unicode, создав новый объект String следующим образом:

String result = new String(charset, "UTF-8"); System.out.println(result);

Средняя оценка 4.1 / 5. Количество голосов: 9

Спасибо, помогите другим — напишите комментарий, добавьте информации к статье.

Видим, что вы не нашли ответ на свой вопрос.

Напишите комментарий, что можно добавить к статье, какой информации не хватает.

Источник

Java Programs and Examples with Output

It is a java program to convert unicode characters of input file to utf 8 encoding format.

import java.io.*; import java.util.Date; import java.text.DateFormat; import java.text.SimpleDateFormat; public class Unicode2UTFConverter < /** * Creates a new UTF8-encoded byte array representing the * char[] passed in. The output array will NOT be null-terminated. * * * @param unicode An array of Unicode characters, which may have UCS4 * characters encoded in UTF-16. This array must not be null. * @exception CharConversionException If the input characters are invalid. */ protected static byte[] UnicodeToUTF8(char[] unicode, boolean nullTerminate) throws CharConversionException < int uni; // unicode index int utf; // UTF8 index int maxsize; // maximum size of UTF8 output byte[] utf8 = null; // UTF8 output buffer byte[] temp = null; // used to create an array of the correct size char ch; // Unicode character int ucs; // UCS4 encoding of a character boolean failed = true; if(unicode == null) < return null;>try < // Allocate worst-case size (UTF8 bytes == 1.5 times Unicode bytes) maxsize = unicode.length * 3; //chars are 2 bytes each if(nullTerminate) < maxsize++; >utf8 = new byte[maxsize]; for(uni=0, utf=0; uni < unicode.length; uni++) < // Convert UCS2 to UCS4 // Assuming that character may have UTF-16 encoding ch = unicode[uni]; if( ch >= 0xd800 && ch //There is no lower half ch = unicode[++uni]; if(ch < 0xdc00 || ch >0xdfff) // not in the low-half zone ucs |= ch-0xdc00; ucs += 0x00010000; > else if(ch >=0xdc00 && ch <=0xdfff) else < ucs = unicode[uni]; // UCS2 char to UCS4 >// UCS4 to UTF8 conversion // Note that the Standard UTF encoding is allowed till 4 bytes i.e < 10FFFF. However this program can encode till 6 bytes of unicode character if(ucs < 0x80) < // 0000 0000 - 0000 007f (ASCII) utf8[utf++] = (byte)ucs; >else if(ucs < 0x800) < // 0000 0080 - 0000 07ff utf8[utf++] = (byte) (0xc0 | ucs>>6); utf8[utf++] = (byte) (0x80 | (ucs & 0x3f) ); > else if(ucs < 0x0010000) < // 0000 0800 - 0000 ffff utf8[utf++] = (byte) (0xe0 | ucs>>12); utf8[utf++] = (byte) (0x80 | ((ucs>>6) & 0x3f) ); utf8[utf++] = (byte) (0x80 | (ucs & 0x3f) ); > else if(ucs < 0x00200000) < // 001 0000 - 001f ffff utf8[utf++] = (byte) (0xf0 | ucs>>18); utf8[utf++] = (byte) (0x80 | ((ucs>>12) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>6) & 0x3f) ); utf8[utf++] = (byte) (0x80 | (ucs & 0x3f) ); > else if(ucs < 0x00200000) < // 0020 0000 - 03ff ffff utf8[utf++] = (byte) (0xf8 | ucs>>24); utf8[utf++] = (byte) (0x80 | ((ucs>>18) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>12) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>6) & 0x3f) ); utf8[utf++] = (byte) (0x80 | (ucs & 0x3f) ); System.out.println(currentDate() + " :Warning: UTF-8 code for Unicode Character is Illegal"); > else < // 0400 0000 - 7fff ffff utf8[utf++] = (byte) (0xfc | ucs>>30); utf8[utf++] = (byte) (0x80 | ((ucs>>24) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>18) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>12) & 0x3f) ); utf8[utf++] = (byte) (0x80 | ((ucs>>6) & 0x3f) ); utf8[utf++] = (byte) (0x80 | (ucs & 0x3f) ); System.out.println(currentDate() + " :Warning: UTF-8 code for Unicode Character is Illegal"); > > if(nullTerminate) < utf8[utf++] = (byte)0x0a; >// CR+LF // Copy into a correct-sized array try < int i; // last index is the size of the UTF8 temp = new byte[utf]; for(i=0; i < utf; i++) < temp[i] = utf8[i]; utf8[i] = 0; >utf8 = temp; temp = null; > finally < >failed = false; return utf8; > finally < // Cleanup data locations where the password was written if(failed && utf8 != null) ucs = 0; ch = 0; > > /** * Main method */ public static void main(String[] args) < char[] unicode; byte[] utf8; if (args.length !=2) < System.out.println("Usage: java UnicodeToUTF8 "); System.exit(0); > String InFilePath = args[0]; //Input filename is first argument. String OutFilePath = args[1]; //Output filename is Second argument. System.out.println(currentDate() +" : Starting Unicode to UTF8 Conversion"); try < BufferedReader lin= new BufferedReader(new InputStreamReader (new FileInputStream(InFilePath))); FileOutputStream fos = new FileOutputStream(OutFilePath); String ls = new String(); // A temp val to hold each line. while((ls = lin.readLine()) != null) < unicode = ls.toCharArray(); utf8 = UnicodeToUTF8(unicode,true); fos.write(utf8); >lin.close(); fos.close(); System.out.println(currentDate() +" : Unicode to UTF8 Conversion Successful"); > catch(CharConversionException e) < System.out.println("Error converting Unicode "+e); >catch(Exception e) > private static String currentDate() < DateFormat shortFormatter = SimpleDateFormat.getDateTimeInstance( SimpleDateFormat.SHORT, SimpleDateFormat.MEDIUM ); long currentTimeInMillis = System.currentTimeMillis(); Date today = new Date( currentTimeInMillis); return shortFormatter.format( today ).toString(); >>

Источник

Оцените статью