Python encode unicode to bytes

Python Convert Unicode to Bytes, ASCII, UTF-8, Raw String

Be on the Right Side of Change

Python Convert Unicode to Bytes

Converting Unicode strings to bytes is quite common these days because it is necessary to convert strings to bytes to process files or machine learning. Let’s take a look at how this can be accomplished.

Method 1 Built-in function bytes()

A string can be converted to bytes using the bytes() generic function. This function internally points to the CPython library, which performs an encoding function to convert the string to the specified encoding. Let’s see how it works and immediately check the data type:

A = ‘Hello’ >>>print(bytes(A, ‘utf-8’), type(bytes(A, ‘utf-8′))) # b’Hello’

A literal b appeared – a sign that it is a string of bytes. Unlike the following method, the bytes() function does not apply any encoding by default, but requires it to be explicitly specified and otherwise raises the TypeError: string argument without an encoding.

Method 2 Built-in function encode()

Perhaps the most common method to accomplish this task uses the encoding function to perform the conversion and does not use one additional reference to a specific library, this function calls it directly.

The built-in function encode() is applied to a Unicode string and produces a string of bytes in the output, used in two arguments: the input string encoding scheme and an error handler. Any encoding can be used in the encoding scheme: ASCII, UTF-8 (used by default), UTF-16, latin-1, etc. Error handling can work in several ways:

Читайте также:  Capture all javascript events

strict – used by default, will raise a UnicodeError when checking for a character that is not supported by this encoding;

ignore – unsupported characters are skipped;

replace – unsupported characters are replaced with “?”;

xmlcharrefreplace – unsupported characters are replaced with their corresponding XML-representation;

backslashreplace – unsupported characters are replaced with sequences starting with a backslash;

namereplace – unsupported characters are replaced with sequences like \N;surrogateescape – replaces each byte with a surrogate code, from U+DC80 to U+DCFF;

surrogatepass – ignores surrogate codes, is used with the following encodings: utf-8, utf-16, utf-32, utf-16-be, utf-16-le, utf-32-be, utf-32-le.

A = '\u0048\u0065\u006C\u006C\u006F' >>>print(A.encode()) # b'Hello'

In this example, we did not explicitly specify either the encoding or the error handling method, we used the default values – UTF-8 encoding and the strict method, which did not cause any errors. But this is highly discouraged, since other developers may not only use encodings other than UTF-8 and not declare it in the header, but the metacharacters used may differ from the content.

Python Convert Unicode to ASCII

Now let’s look at methods for further converting byte strings. We need to get a Unicode ASCII string.

Method 1 Built-in function decode()

The decode() function, like encode(), works with two arguments – encoding and error handling. Let’s see how it works:

>>>print(A.encode('ascii').decode('ascii')) # Hello

This method is good if the input Unicode string is encoded in ASCII or other developers are responsible and explicitly declared the encoding in the header, but as soon as a codepoint appears in the range from 0 to 127, the method does not work:

A = '\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1' >>>print(A.encode('ascii').decode('ascii')) # UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-7: ordinal not in range(128)

You can use various error handlers, for example, backslashreplace (to replace unsupported characters with sequences starting with backslashes) or namereplace (to insert sequences like \ N ):

Читайте также:  What is abstract method error in java

A = ‘\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1’ >>>print(A.encode(‘ascii’, ‘backslashreplace’).decode(‘ascii’,’backslashreplace’)) # Hello \u5316\u4eb1 >>>print(A.encode(‘ascii’, ‘namereplace’).decode(‘ascii’,’namereplace’)) # Hello \N\N

As a result, we can get a not quite expected or uninformative answer, which can lead to further errors or waste of time on additional processing.

Method 2 Module unidecode()

PyPi has a unidecode module, it exports a function that takes a Unicode string and returns a string that can be encoded into ASCII bytes in Python 3.x:

>>>from unidecode import unidecode >>>print(unidecode(A)) # Hello Hua Ye

You can also provide an error argument to unidecode(), which determines what to do with characters not present in its transliteration tables. The default is ignore, which means that Unidecode ignores these characters (replaces them with an empty string). strict will raise UnidecodeError. The exclusion object will contain an index attribute that can be used to find the invalid character. replace will replace them with “?” (or another string specified in the replace_str argument). The preserve will save the original non-ASCII character in the string. Note that if preserve is used, the string returned by unidecode() will not be ASCII encoded! Read more here.

Python Convert Unicode to UTF-8

Due to the fact that UTF-8 encoding is used by default in Python and is the most popular or even becoming a kind of standard, as well as making the assumption that other developers treat it the same way and do not forget to declare the encoding in the script header, we can say that almost all string handling tasks boil down to encoding/decoding from/to UTF-8.

For this task, both of the above methods are applicable.

Читайте также:  Ajax json php database

Method 1 Built-in function encode() and decode()

With encode(), we first get a byte string by applying UTF-8 encoding to the input Unicode string, and then use decode(), which will give us a UTF-8 encoded Unicode string that is already readable and can be displayed or to the console to the user or printed.

B = '\u0048\u0065\u006C\u006C\u006F\t\u5316\u4EB1\t\u041f\u0440\u0438\u0432\u0435\u0442' >>>print(B.encode('utf-8').decode('utf-8')) # Hello 化亱 Привет

Since it is difficult to imagine a character used in popular applications, environments, or operating environments that does not have its own code point in UTF-8, specifying the error handling method can be neglected.

Method 2 Module unidecode

>>>print(list(map(float, [ord(i) for i in B]))) # [72.0, 101.0, 108.0, 108.0, 111.0]

Or we can use a for loop, and the data type of each character will be float, since we explicitly indicated to convert to this type:

>>>for i in B: print(float(ord(i)), sep=' ') # 72.0 101.0 108.0 108.0 111.0

Be on the Right Side of Change 🚀

  • The world is changing exponentially. Disruptive technologies such as AI, crypto, and automation eliminate entire industries. 🤖
  • Do you feel uncertain and afraid of being replaced by machines, leaving you without money, purpose, or value? Fear not! There a way to not merely survive but thrive in this new world!
  • Finxter is here to help you stay ahead of the curve, so you can keep winning as paradigms shift.

Learning Resources 🧑‍💻

⭐ Boost your skills. Join our free email academy with daily emails teaching exponential with 1000+ tutorials on AI, data science, Python, freelancing, and Blockchain development!

Join the Finxter Academy and unlock access to premium courses 👑 to certify your skills in exponential technologies and programming.

New Finxter Tutorials:

Finxter Categories:

Источник

Оцените статью