Java character code to character

Содержание

How do I convert unicode codepoints to their character representation?
9 Answers 9
How do I convert a single character code to a `char` given a character set?
2 Answers 2

How do I convert unicode codepoints to their character representation?

How do I convert strings representing code points to the appropriate character? For example, I want to have a function which gets U+00E4 and returns ä . I know that in the character class I have a function toChars(int codePoint) which takes an integer but there is no function which takes a string of this type. Is there a built in function or do I have to do some transformation on the string to get the integer which I can send to the function?

9 Answers 9

Code points are written as hexadecimal numbers prefixed by U+

int codepoint=Integer.parseInt(yourString.substring(2),16); char[] ch=Character.toChars(codepoint);

@k-den Yes, with something like new StringBuilder().appendCodePoint(codepoint).toString().charAt(0) , but be aware that codepoints above 64k will result in two chars, a high and low surrogate pair. You may prefer to leave off the .charAt(0) and simply get the result as a String .

The question asked for a function to convert a string value representing a Unicode code point (i.e. «+Unnnn» rather than the Java formats of «\unnnn» or «0xnnnn ). However, newer releases of Java have enhancements which simplify the processing of a string contain multiple code points in Unicode format:

The introduction of Streams in Java 8.
Method public static String toString(int codePoint) which was added to the Character class in Java 11. It returns a String rather than a char[] , so Character.toString(0x00E4) returns «ä» .

Those enhancements allow a different approach to solving the issue raised in the OP. This method transforms a set of code points in Unicode format to a readable String in a single statement:

void processUnicode() < // Create a test string containing "Hello World 😁" with code points in Unicode format. // Include an invalid code point (+U0wxyz), and a code point outside the Unicode range (+U70FFFF). String data = "+U0048+U0065+U006c+U006c+U0wxyz+U006f+U0020+U0057+U70FFFF+U006f+U0072+U006c+U0000064+U20+U1f601"; String text = Arrays.stream(data.split("\\+U")) .filter(s ->! s.isEmpty()) // First element returned by split() is a zero length string. .map(s -> < try < return Integer.parseInt(s, 16); >catch (NumberFormatException e) < System.out.println("Ignoring element [" + s + "]: NumberFormatException from parseInt(\"" + s + "\">"); > return null; // If the code point is not represented as a valid hex String. >) .filter(v -> v != null) // Ignore syntactically invalid code points. .filter(i -> Character.isValidCodePoint(i)) // Ignore code points outside of Unicode range. .map(i -> Character.toString(i)) // Obtain the string value directly from the code point. (Requires JDK >= 11 ) .collect(Collectors.joining()); System.out.println(text); // Prints "Hello World 😁" >

run: Ignoring element [0wxyz]: NumberFormatException from parseInt("0wxyz"> Hello World 😁 BUILD SUCCESSFUL (total time: 0 seconds)

With this approach there is no longer any need for a specific function to convert a code point in Unicode format. That’s dispersed instead, through multiple intermediate operations in the Stream processing. Of course the same code could still be used to process just a single code point in Unicode format.
It’s easy to add intermediate operations to perform further validation and processing on the Stream , such as case conversion, removal of emoticons, etc.

"\u00E4" new String(new int[] < 0x00E4 >, 0, 1);

@Anirudh right, and you dealt with it appropiately. But I wonder whether «\u00e4» was known to be equivalent (that is, in java source code). You got +1 from me.

 public String codepointToString(int cp) < StringBuilder sb = new StringBuilder(); if (Character.isBmpCodePoint(cp)) < sb.append((char) cp); >else if (Character.isValidCodePoint(cp)) < sb.append(Character.highSurrogate(cp)); sb.append(Character.lowSurrogate(cp)); >else < sb.append('?'); >return sb.toString(); >

this example does not use char[].

// this code is Kotlin, but you can write same thing in Java val sb = StringBuilder() val cp :Int // codepoint when < Character.isBmpCodePoint(cp) ->sb.append(cp.toChar()) Character.isValidCodePoint(cp) -> < sb.append(Character.highSurrogate(cp)) sb.append(Character.lowSurrogate(cp)) >else -> sb.append('?') >

jshell> Character.toString(Integer.parseInt("U+00E4".substring(2), 16)) $1 ==> "ä"

Well, second part is not possible because codepoint may be of 4 bytes and char datatype can hold only 2 bytes.

So if it may be more generalized approach to never use char data type in java, use int or String instead.

What datatype is used to hold code point? Single code point can be hold in an int data types. A unicode String is technically an array of int, rather than array of char.

String smiley = new String(new int[] < 0x1F600 >, 0, 1); //int[] array of int codepoints can be converted to string System.out.println(" print smiley = "+smiley );

If you are using IntelliJ idea, You can copy output smiley and paste within double quoted string. You will get this «\uD83D\uDE00»

If you print this string you again get a smiley

System.out.println("\uD83D\uDE00");

Why we cannot use single «\u» to represent smiley within a string? Because when \u escape was designed, all unicode chars could be represented by 2 bytes or 4 hexadecimal digits. So there are always 4 hexadecimal digits after \u in a java string literal. To represent a larger value of Unicode you need a larger hexadecimal number but that will break existing java strings. So there java uses same approach as utf-16 representation.

Following 2 are equivalent.

String smiley = new String(new int[] < 0x1F600 >, 0, 1); //using single code point number String smiley = "\uD83D\uDE00";//split code point in 2 parts of 2 bytes each (utf-16)

Unicode Character Representations The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode Standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter(‘\uD840’) returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter. The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph). In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding. For more information on Unicode terminology, refer to the Unicode Glossary.

Источник

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.

public static void main(String[] args) < char ret= (char)146; System.out.println(ret);// returns nothing.

I expect to get character single "'" as per http://www.ascii-code.com/ Anyone came across this? Thanks.

"Extended ASCII" is a misnomer. ASCII is by definition up to 127 only. There are character sets which extend that range, but how they do that varies wildly. So you really need to know what you're talking about. What your doing in your code, is printing the Unicode codepoint 146 which coincides with the single upper quote, luckily: fileformat.info/info/unicode/char/92/index.htm

in my input I have decimals greater than 127. Some gets converted correctly but some like '146' gives trouble.

Your characters are not ASCII. They are most likely windows-1252. In the windows-1252 charset, 146 is actually '\u2019' . If you want to see the correct character, change your code to (char)0x2019 or '\u2019' .

Your terminal might not be using the same encoding as the one whose numeric code you are assuming will output the character you want. So, the program and terminal are both doing exactly the right thing, it's just that they don't interpret the numeric value the way you expect, because they are using a different encoding.

2 Answers 2

First of all the page you linked to says this about the code point range in question:

The extended ASCII codes (character code 128-255)

There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.

This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String :

String s = new String(new byte[] , "iso-8859-1"); System.out.println(s);

Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:

String s = new String(new byte[] , "windows-1252"); System.out.println(s);

So the first mistake is that page is confusing.

But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).

Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.

So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:

String s = new String(new byte[] , "windows-1252"); System.out.println(s); char c = s.charAt(0); System.out.println(c);

You could grab the char again via s.charAt(0) . Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.

However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.

As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:

Charset cs = Charset.forName("windows-1252"); CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] )); char c = cb.get(0); System.out.println(c);

Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).

As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.

Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:

System.out.println((char)0x2019);

You can also see this for yourself by looking at the value after the conversion from windows-1252:

String s = new String(new byte[] , "windows-1252"); char c = s.charAt(0); System.out.printf("0x%x\n", (int)c); // outputs 0x2019

String s = new String(new byte[] , "windows-1252"); int cp = s.codePointAt(0); System.out.printf("0x%x\n", cp); // outputs 0x2019

Источник