Python encode json to utf 8

Python Encode Unicode and non-ASCII characters into JSON

This article will provide a comprehensive guide on how to work with Unicode and non-ASCII characters in Python when generating and parsing JSON data. We will look at the different ways to handle Unicode and non-ASCII characters in JSON. By the end of this article, you should have a good understanding of how to work with Unicode and non-ASCII characters in JSON using Python. Also, we are going to cover the following topics related to encoding and serializing Unicode and non-ASCII characters in Python:

  1. How to encode Unicode and non-ASCII characters into JSON in Python.
  2. How to save non-ASCII or Unicode data as-is, without converting it to a \u escape sequence, in JSON.
  3. How to serialize Unicode data and write it into a file.
  4. How to serialize Unicode objects into UTF-8 JSON strings, instead of \u escape sequences.
  5. How to escape non-ASCII characters while encoding them into JSON in Python.

What is a UTF-8 Character?

Unicode is a standardized encoding system that represents most of the world’s written languages. It includes characters from many different scripts, such as Latin, Greek, and Chinese, and is capable of representing a wide range of characters and symbols. Non-ASCII characters are characters that are not part of the ASCII (American Standard Code for Information Interchange) character set, which consists of only 128 characters.

UTF-8 is a character encoding that represents each Unicode code point using one to four bytes. It is the most widely used character encoding for the Web and is supported by all modern web browsers and most other applications. UTF-8 is also backward-compatible with ASCII, so any ASCII text is also a valid UTF-8 text.

What is JSON?

The JSON module is a built-in module in Python that provides support for working with JSON (JavaScript Object Notation) data. It provides methods for encoding and decoding JSON objects, as well as for working with the data structures that represent them. The json.dumps() method is a method of the JSON module that serializes an object (e.g. a Python dictionary or list) to a JSON-formatted string. This string can then be saved to a file, sent over a network connection, or used in any other way that requires the data to be represented as a string.

Читайте также:  Php com object class

BHere is how you could use the json.dumps() method to encode a Python dictionary as a JSON string.

Источник

Python Encode Unicode and non-ASCII characters as-is into JSON

In this article, we will address the following frequently asked questions about working with Unicode JSON data in Python.

  • How to serialize Unicode or non-ASCII data into JSON as-is strings instead of \u escape sequence (Example, Store Unicode string ø as-is instead of \u00f8 in JSON)
  • Encode Unicode data in utf-8 format.
  • How to serialize all incoming non-ASCII characters escaped (Example, Store Unicode string ø as \u00f8 in JSON)

Further Reading:

The Python RFC 7159 requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability.

The ensure_ascii parameter

Use Python’s built-in module json provides the json.dump() and json.dumps() method to encode Python objects into JSON data.

The json.dump() and json.dumps() has a ensure_ascii parameter. The ensure_ascii is by-default true so the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii=False , these characters will be output as-is.

The json module always produces str objects. You get a string back, not a Unicode string. Because the escaping is allowed by JSON.

  • using a ensure_ascii=True , we can present a safe way of representing Unicode characters. By setting it to true we make sure the resulting JSON is valid ASCII characters (even if they have Unicode inside).
  • Using a ensure_ascii=False , we make sure resulting JSON store Unicode characters as-is instead of \u escape sequence.

Save non-ASCII or Unicode data as-is not as \u escape sequence in JSON

In this example, we will try to encode the Unicode Data into JSON. This solution is useful when you want to dump Unicode characters as characters instead of escape sequences.

Set ensure_ascii=False in json.dumps() to encode Unicode as-is into JSON

import json unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) encodedUnicode = json.dumps(unicodeData, ensure_ascii=False) # use dump() method to write it in file print("JSON character encoding by setting ensure_ascii=False", encodedUnicode) print("Decoding JSON", json.loads(encodedUnicode))

unicode Data is JSON character encoding by setting ensure_ascii=False Decoding JSON

Note: This example is useful to store the Unicode string as-is in JSON.

JSON Serialize Unicode Data and Write it into a file.

In the above example, we saw how to Save non-ASCII or Unicode data as-is not as \u escape sequence in JSON. Now, Let’s see how to write JSON serialized Unicode data as-is into a file.

import json sampleDict= < "string1": "明彦", "string2": u"\u00f8" >with open("unicodeFile.json", "w", encoding='utf-8') as write_file: json.dump(sampleDict, write_file, ensure_ascii=False) print("Done writing JSON serialized Unicode Data as-is into file") with open("unicodeFile.json", "r", encoding='utf-8') as read_file: print("Reading JSON serialized Unicode data from file") sampleData = json.load(read_file) print("Decoded JSON serialized Unicode data") print(sampleData["string1"], sampleData["string1"])
Done writing JSON serialized Unicode Data as-is into file Reading JSON serialized Unicode data from file Decoded JSON serialized Unicode data 明彦 明彦

JSON file after writing Unicode data as-is

Serialize Unicode objects into UTF-8 JSON strings instead of \u escape sequence

You can also set JSON encoding to UTF-8. UTF-8 is the recommended default for maximum interoperability. set ensure_ascii=False to and encode Unicode data into JSON using ‘UTF-8‘.

import json # encoding in UTF-8 unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) print("Unicode JSON Data encoding using utf-8") encodedUnicode = json.dumps(unicodeData, ensure_ascii=False).encode('utf-8') print("JSON character encoding by setting ensure_ascii=False", encodedUnicode) print("Decoding JSON", json.loads(encodedUnicode))

unicode Data is Unicode JSON Data encoding using utf-8 JSON character encoding by setting ensure_ascii=False b» Decoding JSON

Читайте также:  Epub to html php

Encode both Unicode and ASCII (Mix Data) into JSON using Python

In this example, we will see how to encode Python dictionary into JSON which contains both Unicode and ASCII data.

import json sampleDict = print("unicode Data is ", sampleDict) # set ensure_ascii=True jsonDict = json.dumps(sampleDict, ensure_ascii=True) print("JSON character encoding by setting ensure_ascii=True") print(jsonDict) print("Decoding JSON", json.loads(jsonDict)) # set ensure_ascii=False jsonDict = json.dumps(sampleDict, ensure_ascii=False) print("JSON character encoding by setting ensure_ascii=False") print(jsonDict) print("Decoding JSON", json.loads(jsonDict)) # set ensure_ascii=False and encode using utf-8 jsonDict = json.dumps(sampleDict, ensure_ascii=False).encode('utf-8') print("JSON character encoding by setting ensure_ascii=False and UTF-8") print(jsonDict) print("Decoding JSON", json.loads(jsonDict))

unicode Data is JSON character encoding by setting ensure_ascii=True Decoding JSON JSON character encoding by setting ensure_ascii=False Decoding JSON JSON character encoding by setting ensure_ascii=False and UTF-8 b» Decoding JSON

Python Escape non-ASCII characters while encoding it into JSON

Let’ see how store all incoming non-ASCII characters escaped in JSON. It is a safe way of representing Unicode characters. By setting ensure_ascii=True we make sure resulting JSON is valid ASCII characters (even if they have Unicode inside).

import json unicodeData= < "string1": "明彦", "string2": u"\u00f8" >print("unicode Data is ", unicodeData) # set ensure_ascii=True encodedUnicode = json.dumps(unicodeData, ensure_ascii=True) print("JSON character encoding by setting ensure_ascii=True") print(encodedUnicode) print("Decoding JSON") print(json.loads(encodedUnicode))

unicode Data is JSON character encoding by setting ensure_ascii=True Decoding JSON

Did you find this page helpful? Let others know about it. Sharing helps me continue to create free Python resources.

About Vishal

I’m Vishal Hule, Founder of PYnative.com. I am a Python developer, and I love to write articles to help students, developers, and learners. Follow me on Twitter

Python Exercises and Quizzes

Free coding exercises and quizzes cover Python basics, data structure, data analytics, and more.

  • 15+ Topic-specific Exercises and Quizzes
  • Each Exercise contains 10 questions
  • Each Quiz contains 12-15 MCQ
Читайте также:  Css высота от высоты экрана

Источник

Как получить русский текст из json на Python?

Александр, значит у вас этот файлик открыт не в кодировке UTF-8, в которой кодированы записываемые данные.

trapwalker

Александр, Я там ниже ответил. Он у вас UTF-8, но открыт как бинарный файл, видимо. Конвертируйте в utf-8 явно. Ну или показывайте код и я подскажу вам как настроить logging или что там у вас

json.dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)

trapwalker

Не надо там ничего энкодить и декодить.
В ответе текст в юникоде и парсится функцией `json.loads` адекватно.
Проблема скорее всего у вас из-за кодировки в консоли винды. там какая-нибудь однобайтовая кодировка вроде cp1251 или cp866.
При попытке напечатать юникод в этом терминале вы получаете ошибку из-за того, что при автоматическом преобразовании из юникода в кодировку консоли питон пытается взять кодек по умолчанию, который, конечно ‘ascii’.

Винда такая винда со своим беспощадным терминалом и кодировками по умолчанию.

Но вы можете напечатать этот текст, в нём нет непечатных символов для однобайтовой кодировки. Попробуйте так:

x = obj['result'][-1]['status'] try: print('cp1251:', x.encode('cp1251')) except: try: print('cp866:', x.encode('cp866')) except: print('no way')

Общие правила работы с кодировками такие:
— на входе в программу мы всё преобразовываем в юникод.
— на выходе всё кодируем в нужную кодировку.
— если выход — это печать в стандартный вывод (stdout), то ситуаций может быть 4:
1) печатаем в терминал винды и терминал у нас в кодировке 1251
2) печатаем в терминал и он у нас в 866 кодировке
3) печатаем в stdout, который перенаправлен в файл и пайп не знает в какой он кодировке, то есть кодировка не задана и мы можем заэнкодить в любую и в файл это запишется. Пользутейс utf8 — самая правильная кодировка для всего.
4) вы в линуксе и терминал у вас в кодировке по умолчанию — utf8 и всё хорошо.

В любом случае, печатая или сохраняя что-то в файл вы должны понимать, что текст нужно закодировать в кодировку. Это может произойти неявно (как в ашем случае) но при попытке закодировать в кодировку по умолчанию (ascii) не каждый символ в ней можно представить. В ASCII всего 127 символов. Получилась закономерная ошибка.

У потоков стандартного ввода/вывода есть атрибут encoding:

import sys sys.stdout.encoding # 'UTF-8'

В вашем случае будет либо None, если вывод перенаправлен в файл,
либо ‘cp1251’, либо ‘cp866’ ну или ещё что-нибудь эдакое.
Если не None, то в эту кодировку можно постараться заэнкодить вашу строку. По-прежнему некоторые символы могут не конвертнуться (не в вашем случае), их можно игнорировать специальным аргументом метода encode.

Источник

Оцените статью