Python pickle utf 8

Содержание

pickle a list as UTF-8
1 Answer 1
How to pickle unicodes and save them in utf-8 databases
1 Answer 1
How to pickle and unpickle to portable string in Python 3
Update
4 Answers 4

pickle a list as UTF-8

so far I have managed to create a list of all files. My next step would be to write that list to 1 single file which I can then edit. The problem I have is that I get a format issue when I save the list do a file. I want the final file to have utf8 format. this is what I want my file to look like:

Hub Hub mm 150.000000000000 Bohrung Bohru mm 135.000000000000

but what I get at the moment is:

”ŒHub Hub mm 150.000000000000 ”Œ%Bohrung Bohru mm 135.000000000000

import os import pickle folderpath = r"C:/Users/l-reh/Desktop/HTB" filepaths = [os.path.join("C:/Users/l-reh/Desktop/HTB/", name) for name in os.listdir(folderpath)] all_files = [] for path in filepaths: with open(path, 'r') as f: file = f.readlines() all_files.append(file) with open("C:/Users/l-reh/Desktop/Bachelorarbeit/DB Testdatensatz/HTB.htb", 'wb') as f: pickle.dump(all_files, f)

BTW: Without the input data, your issue is impossible to reproduce. However, you should hard-code example data in order to create a minimal reproducible example. As a new user here, please also take the tour and read How to Ask.

1 Answer 1

pickle produces a binary format, which includes per field «header» bytes (describing type, length, and for some pickle protocols, framing data) that are going to look like garbage text if you view the output as text. You can’t say «I want it to be pickle , but not have these bytes» because those bytes are part of the pickle serialization format. If you don’t want those bytes, you need to choose a different serialization format (presumably using a custom serializer that matches this HTB format). This has nothing to do with UTF-8 encoding or lack thereof (your input is ASCII), the problem is that you are demanding a result that’s literally impossible within the limits of your design.

Источник

How to pickle unicodes and save them in utf-8 databases

I have a database (mysql) where I want to store pickled data. The data can be for instance a dictionary, which may contain unicode, e.g.

import pickle pickled_data = pickle.dumps(data) print type(pickled_data) # returns

the resulting pickled_data is a string. When I try to store this in a database (e.g. in a Textfield) this can causes problems. In particular, I’m getting at some point a

UnicodeDecodeError "'utf8' codec can't decode byte 0xe9 in position X"

Encode the result of the pickle.dump to utf-8 and store it. When I want to pickle.load, I have to decode it.
Store the pickled string in binary format (how?), which forces all characters to be within ascii.

My issue is that I’m not seeing what are the consequences of choosing one of this options in the long run. Since the change already requires some effort, I’m driven to ask for an opinion on this issue, asking for eventual better candidates.

(P.S. This is for instance useful in Django)

I’m not the only one caught out by the wording in the Python 2 documentation for pickle protocol 0 then. I see they have improved the wording in the Python 3 docs to make it clear that all protocols are binary.

1 Answer 1

Pickle data is opaque, binary data, even when you use protocol version 0:

>>> pickle.dumps(data, 0) '(dp0\nI1\nV\xe9\np1\ns.'

When you try to store that in a TextField , Django will try to decode that data to UTF8 to store it; this is what fails because this is not UTF-8 encoded data; it is binary data instead:

>>> pickled_data.decode('utf8') Traceback (most recent call last): File "", line 1, in File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 9: invalid continuation byte

The solution is to not try to store this in a TextField . Use a BinaryField instead:

A field to store raw binary data. It only supports bytes assignment. Be aware that this field has limited functionality. For example, it is not possible to filter a queryset on a BinaryField value.

You have a bytes value (Python 2 strings are byte strings, renamed to bytes in Python 3).

If you insist on storing the data in a text field, explicitly decode it as latin1 ; the Latin 1 codec maps bytes one-on-one to Unicode codepoints:

>>> pickled_data.decode('latin1') u'(dp0\nI1\nV\xe9\np1\ns.'

and make sure you encode it again before unpickling again:

>>> encoded = pickled_data.decode('latin1') >>> pickle.loads(encoded) Traceback (most recent call last): File "", line 1, in File "/Users/mj/Development/Libraries/buildout.python/parts/opt/lib/python2.7/pickle.py", line 1381, in loads file = StringIO(str) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128) >>> pickle.loads(encoded.encode('latin1'))

Do note that if you let this value go to the browser and back again in a text field, the browser is likely to have replaced characters in that data. Internet Explorer will replace \n characters with \r\n , for example, because it assumes it is dealing with text.

Not that you ever should allow accepting pickle data from a network connection in any case, because that is a security hole waiting for exploitation.

Источник

How to pickle and unpickle to portable string in Python 3

I need to pickle a Python3 object to a string which I want to unpickle from an environmental variable in a Travis CI build. The problem is that I can’t seem to find a way to pickle to a portable string (unicode) in Python3:

import os, pickle from my_module import MyPickleableClass obj = pickled = pickle.dumps(obj) # raises TypeError: str expected, not bytes os.environ['pickled'] = pickled # raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb (. ) os.environ['pickled'] = pickled.decode('utf-8') pickle.loads(os.environ['pickled'])

Is there a way to serialize complex objects like datetime.datetime to unicode or to some other string representation in Python3 which I can transfer to a different machine and deserialize?

Читайте также: Fork join framework java

Update

I have tested the solutions suggested by @kindall, but the pickle.dumps(obj, 0).decode() raises a UnicodeDecodeError . Nevertheless the base64 approach works but it needed an extra decode/encode step. The solution works on both Python2.x and Python3.x.

# encode returns bytes so it needs to be decoded to string pickled = pickle.loads(codecs.decode(pickled.encode(), 'base64')).decode() type(pickled) # unpickled = pickle.loads(codecs.decode(pickled.encode(), 'base64'))

Yes, I would prefer a safer format like JSON if possible. Pickles are as good as executable code and running arbitrary code out of an envvar seems pretty dirty to me even if due to the application it is not at present a security hole. Don’t resort to pickle until you actually need that flexibility; you certainly don’t for datetime .

4 Answers 4

pickle.dumps() produces a bytes object. Expecting these arbitrary bytes to be valid UTF-8 text (the assumption you are making by trying to decode it to a string from UTF-8) is pretty optimistic. It’d be a coincidence if it worked!

One solution is to use the older pickling protocol that uses entirely ASCII characters. This still comes out as bytes , but since those bytes contain only ASCII code points, it can be converted to a string without stress:

pickled = str(pickle.dumps(obj, 0))

You could also use some other encoding method to encode a binary-pickled object to text, such as base64:

import codecs pickled = codecs.encode(pickle.dumps(obj), "base64").decode()

unpickled = pickle.loads(codecs.decode(pickled.encode(), "base64"))

Using pickle with protocol 0 seems to result in shorter strings than base64-encoding binary pickles (and abarnert’s suggestion of hex-encoding is going to be even larger than base64), but I haven’t tested it rigorously or anything. Test it with your data and see.

If space efficiency really matters, you definitely want to use pickle protocol 4 and environb when possible, or maybe even pickle protocol 4 plus bzip. For Windows, I’d test pickle protocol 4 plus bzip plus base64 vs. pickle protocol 0, but my guess would be that the former is smaller. But I was assuming a few hundred bytes of memory/bandwidth/etc. for each CI build isn’t going to be worth worrying about either way.

@abarnert The first solution doesn’t work with protocol 4, only with 0, which is ascii, and thus makes it possible to decode. The other solution with base64 can use any protocol, and will by default use whatever is the pickle.DEFAULT_PROTOCOL, which is 3 in py3, and 4 since py3.8.

pickled = str(pickle.dumps(obj, 0)) doesn’t work for me. The code runs, but it produces a string that cannot be passed to pickle.loads because it embeds the b’. ‘ . I went with pickled = pickle.dumps(obj, 0).decode() which can then be passed to loads as unpickled = pickle.loads(pickled.encode())

This answer plus @Mathieson ‘s reply are the ones clearly explaining the problem and fixing it for me. Thanks!

If you want to store bytes in the environment, instead of encoded text, that’s what environb is for.

This doesn’t work on Windows. (As the docs imply, you should check os.supports_bytes_environ if you’re on 3.2+ instead of just assuming that Unix does and Windows doesn’t…) So for that, you’ll need to smuggle the bytes into something that can be encoded no matter what your system encoding is, e.g., using backslash-escape , or even hex . So, for example:

if os.supports_bytes_environ: environb['pickled'] = pickled else: environ['pickled'] = codecs.encode(pickled, 'hex')

The assignment to environ item was just for the sake of the example. What I really need is to put the serialized string into a Travis CI environmental variable through a web form.

@PeterHudec: Good thing it’s just for the sake of example, because bobince is right; this wouldn’t have worked in general.

I think the simplest answer, especially if you don’t care about Windows, is to just store the bytes in the environment, as suggested in my other answer.

But if you want something clean and debuggable, you might be happier using something designed as a text-based format.

pickle does have a «plain text» protocol 0, as explained in kindall’s answer. It’s certainly more readable than protocol 3 or 4, but it’s still not something I’d actually want to read.

JSON is much nicer, but it can’t handle datetime out of the box. You can come up with your own encoding (the stdlib’s json module is extensible) for the handful of types you need to encode, or use something like jsonpickle . It’s generally safer, more efficient, and more readable to come up with custom encodings for each type you care about than a general «pack arbitrary types in a turing-complete protocol» scheme like pickle or jsonpickle , but of course it’s also more work, especially if you have a lot of extra types.

JSON Schema lets you define languages in JSON, similar to what you’d do in XML. It comes with a built-in date-time String format, and the jsonschema library for Python knows how to use it.

YAML has a standard extension repository that includes many types JSON doesn’t, including a timestamp. Most of the zillion ‘yaml’ modules for Python already know how to encode datetime objects to and from this type. If you need additional types beyond what YAML includes, it was designed to be extensible declaratively. And there are libraries that do the equivalent of jsonpickle , defining new types on the fly, if you really need that.

And finally, you can always write an XML language.

Источник