Unicode to ansi python

Python Cookbook by

Get full access to Python Cookbook and 60K+ other titles, with a free 10-day trial of O’Reilly.

There are also live events, courses curated by job role, and more.

Converting Between Unicode and Plain Strings

Credit: David Ascher, Paul Prescod

Problem

You need to deal with data that doesn’t fit in the ASCII character set.

Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:

# Convert Unicode to plain Python string: "encode" unicodestring = u"Hello world" utf8string = unicodestring.encode("utf-8") asciistring = unicodestring.encode("ascii") isostring = unicodestring.encode("ISO-8859-1") utf16string = unicodestring.encode("utf-16") # Convert plain Python string to Unicode: "decode" plainstring1 = unicode(utf8string, "utf-8") plainstring2 = unicode(asciistring, "ascii") plainstring3 = unicode(isostring, "ISO-8859-1") plainstring4 = unicode(utf16string, "utf-16") assert plainstring1==plainstring2==plainstring3==plainstring4

Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode—what it is, how it works, and how Python uses it.

Unicode is a big topic. Luckily, you don’t need to know everything about Unicode to be able to solve real-world problems with it: a few basic bits of knowledge are enough. First, you must understand the difference between bytes and characters. In older, ASCII-centric languages and environments, bytes and characters are treated as the same thing. Since a byte can hold up to 256 values, these environments are limited to 256 characters. Unicode, on the other hand, has tens of thousands of characters. That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes.

Standard Python strings are really byte strings, and a Python character is really a byte. Other terms for the standard Python type are “8-bit string” and “plain string.” In this recipe we will call them byte strings, to remind you of their byte-orientedness.

Conversely, a Python Unicode character is an abstract object big enough to hold the character, analogous to Python’s long integers. You don’t have to worry about the internal representation; the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method for files or the send method for network sockets. At that point, you must choose how to represent the characters as bytes. Converting from Unicode to a byte string is called encoding the string. Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.

There are many ways of converting Unicode objects to byte strings, each of which is called an encoding . For a variety of historical, political, and technical reasons, there is no one “right” encoding. Every encoding has a case-insensitive name, and that name is passed to the decode method as a parameter. Here are a few you should know about:

  • The UTF-8 encoding can handle any Unicode character. It is also backward compatible with ASCII, so a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters. This property makes UTF-8 very backward-compatible, especially with older Unix tools. UTF-8 is far and away the dominant encoding on Unix. It’s primary weakness is that it is fairly inefficient for Eastern texts.
  • The UTF-16 encoding is favored by Microsoft operating systems and the Java environment. It is less efficient for Western languages but more efficient for Eastern ones. A variant of UTF-16 is sometimes known as UCS-2.
  • The ISO-8859 series of encodings are 256-character ASCII supersets. They cannot support all of the Unicode characters; they can support only some particular language or family of languages. ISO-8859-1, also known as Latin-1 , covers most Western European and African languages, but not Arabic. ISO-8859-2, also known as Latin-2 , covers many Eastern European languages such as Hungarian and Polish.
Читайте также:  Php начать цикл заново

If you want to be able to encode all Unicode characters, you probably want to use UTF-8. You will probably need to deal with the other encodings only when you are handed data in those encodings created by some other application.

See Also

Unicode is a huge topic, but a recommended book is Unicode: A Primer , by Tony Graham (Hungry Minds, Inc.)—details are available at http://www.menteith.com/unicode/primer/.

Get Python Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Источник

Convert Unicode Characters to ASCII String in Python

Convert Unicode Characters to ASCII String in Python

Unicode Characters is the global encoding standard for characters for all languages. Unlike ASCII, which only supports a single byte per character, Unicode characters extend this capability to 4 bytes, making it support more characters in any language.

This tutorial demonstrates how to convert Unicode characters into an ASCII string. The goal is to either remove the characters that aren’t supported in ASCII or replace the Unicode characters with their corresponding ASCII character.

Use unicodedata.normalize() and encode() to Convert Unicode to ASCII String in Python

The Python module unicodedata provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.

unicodedata has a function called normalize() that accepts two parameters, the normalized form of the Unicode string and the given string.

There are 4 types of normalized Unicode forms: NFC , NFKC , NFD , and NFKD . To learn more about this, the official documentation is readily available for a thorough and in-depth explanation for each type. The NFKD normalized form will be used throughout this tutorial.

Читайте также:  Opencv optical flow python

Let’s declare a string with multiple unicode characters.

import unicodedata  stringVal = u'Här är ett exempel på en svensk mening att ge dig.'  print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore')) 

After calling the normalize() method, chain a call to the function encode() , which does the conversion from Unicode to ASCII.

The u character before the string value helps Python recognize that the string value contains unicode characters; this is done for type safety purposes.

The first parameter specifies the conversion type, and the second parameter enforces what should be done if a character cannot be converted. In this case, the 2nd parameter passes ignore , which ignores any character that can’t be converted.

b'Har ar ett exempel pa en svensk mening att ge dig.' 

Notice that the unicode characters from the original string ( ä and å ) have been replaced with its ASCII character counterpart ( a ).

The b symbol at the beginning of the string denotes that the string is a byte literal since the encode() function is used on the string. To remove the symbol and the single quotes encapsulating the string, then chain call the function decode() after calling encode() to re-convert it into a string literal.

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode()) 
Har ar ett exempel pa en svensk mening att ge dig. 

Let’s try another example using the replace as the second parameter in the encode() function.

For this example, let’s try out a string having characters that do not have ASCII counterparts.

import unicodedata  stringVal = u'áæãåāœčćęßßßわた'  print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'replace').decode()) 

All the characters within this example string are not registered in ASCII but may have a counterpart symbol.

Читайте также:  Вызов программы из java

The replace parameter outright replaces the characters without ASCII counterparts with a question mark ? symbol. If we were to use ignore on the same string:

print(unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore').decode()) 

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts. The ignore option will remove the character, and the replace option will replace it with question marks.

Skilled in Python, Java, Spring Boot, AngularJS, and Agile Methodologies. Strong engineering professional with a passion for development and always seeking opportunities for personal and career growth. A Technical Writer writing about comprehensive how-to articles, environment set-ups, and technical walkthroughs. Specializes in writing Python, Java, Spring, and SQL articles.

Related Article — Python Unicode

Related Article — Python String

Источник

(CkPython) Convert a File from utf-8 to ANSI (such as Windows-1252)

This example is to satisfy a particular user’s support question:

I have a file that contains this text:

Original file text Converted using notepad
Text CAFÉ CAFÉ
Hex 43 41 46 c3 89 43 41 46 c9

The utf-8 representation of the character É is the two bytes 0xC3 0x89. When Notepad is displaying the utf-8 file, it is intepreting the bytes as if they are ANSI (1 byte per char), and thus it is showing the ANSI char for 0xC3 (Ã) and the ANSI char for 0x89 (‰). After converting to ANSI, the É is represented by the single byte 0xC9.

Chilkat Python Downloads

import sys import chilkat # This example assumes the Chilkat API to have been previously unlocked. # See Global Unlock Sample for sample code. charset = chilkat.CkCharset() charset.put_FromCharset("utf-8") charset.put_ToCharset("ANSI") # We could alternatively be more specific and say "Windows-1252". # The term "ANSI" means -- whatever character encoding is defined as the ANSI # encoding for the computer. In Poland, for example, it would be the single-byte-per-char # used to represnt Eastern European language chars, which is Windows-1250. charset.put_ToCharset("Windows-1252") success = charset.ConvertFile("qa_data/txt/cafeUtf8.txt","qa_output/cafeAnsi.txt") if (success != True): print(charset.lastErrorText()) sys.exit() print("Success.")

© 2000-2023 Chilkat Software, Inc. All Rights Reserved.

Источник

Оцените статью