Python pandas read csv encoding windows 1251

Содержание

Saved searches
Use saved searches to filter your results more quickly
read_csv fails to read file if there are cyrillic symbols in filename #17773
read_csv fails to read file if there are cyrillic symbols in filename #17773
Comments
Code Sample, a copy-pastable example if possible
Problem description
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
Reading russian language data from csv
Solution 2
Solution 3
Erba Aitbayev
Comments
Чтение данных на русском языке из CSV

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails to read file if there are cyrillic symbols in filename #17773

Comments

Code Sample, a copy-pastable example if possible

import pandas cyrillic_filename = "./файл_1.csv" # 'c' engine fails: df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251") --------------------------------------------------------------------------- OSError Traceback (most recent call last) ipython-input-18-9cb08141730c> in module>() 2 3 cyrillic_filename = "./файл_1.csv" ----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251") d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 403 404 # Create the parser. --> 405 parser = TextFileReader(filepath_or_buffer, **kwds) 406 407 if chunksize or iterator: d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 762 self.options['has_index_names'] = kwds['has_index_names'] 763 --> 764 self._make_engine(self.engine) 765 766 def close(self): d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine) 983 def _make_engine(self, engine='c'): 984 if engine == 'c': --> 985 self._engine = CParserWrapper(self.f, **self.options) 986 else: 987 if engine == 'python': d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds) 1603 kwds['allow_leading_cols'] = self.index_col is not False 1604 -> 1605 self._reader = parsers.TextReader(src, **kwds) 1606 1607 # XXX pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)() OSError: Initializing from file failed # 'python' engine work: df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251") df.size >>172440 # 'c' engine works if filename can be encoded to utf-8 latin_filename = "./file_1.csv" df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251") df.size >>172440

Problem description

The ‘c’ engine should read the files with non-UTF-8 filenames

Expected Output

File content readed into dataframe

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.2
scipy: 0.19.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.0.0
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
None

The text was updated successfully, but these errors were encountered:

Источник

Reading russian language data from csv

\ea is the windows-1251 / cp5347 encoding for к . Therefore, you need to use windows-1251 decoding, not UTF-8.

In Python 2.7, the CSV library does not support Unicode properly — See «Unicode» in https://docs.python.org/2/library/csv.html

They propose a simple work around using:

class UnicodeReader: """ A CSV reader which will iterate over lines in the CSV file "f", which is encoded in the given encoding. """ def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): f = UTF8Recoder(f, encoding) self.reader = csv.reader(f, dialect=dialect, **kwds) def next(self): row = self.reader.next() return [unicode(s, "utf-8") for s in row] def __iter__(self): return self

This would allow you to do:

def loadCsv(filename): lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" ) # if you really need lists then uncomment the next line # this will let you do call exact lines by doing `line_12 = lines[12]` # return list(lines) # this will return an "iterator", so that the file is read on each call # use this if you'll do a `for x in x` return lines

If you try to print dataset , then you’ll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u . To print the values, do:

for csv_line in loadCsv("myfile.csv"): print u", ".join(csv_line)

If you need to write your results to another file (fairly typical), you could do:

with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput: for csv_line in loadCsv("myfile.csv"): my_output.write(u", ".join(csv_line))

This will automatically convert/encode your output to UTF-8.

Solution 2

import pandas as pd pd.read_csv(path_file , "cp1251")

import csv with open(path_file, encoding="cp1251", errors='ignore') as source_file: reader = csv.reader(source_file, delimiter=",")

Solution 3

Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.

Erba Aitbayev

Comments

2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы

Delimiter is ; symbol. I want to read data and put it into array. I tried to read this data using this code:

def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset

mydata = loadCsv('krish(csv3).csv') print mydata

[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]

import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text

newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte

What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.

return ([f.decode(‘cp1251’) if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, «rb»),delimiter=»;»))

Источник

Чтение данных на русском языке из CSV

Разделитель ; символ. Я хочу читать данные и помещать их в массив. Я попытался прочитать эти данные, используя этот код:

def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset

mydata = loadCsv('krish(csv3).csv') print mydata

[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]

import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text

newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte

В чем проблема? При использовании кодеков, как указать разделитель в моих данных? Я просто хочу прочитать данные из файла и поместить его в 2-мерный массив.

return ([f.decode(‘cp1251’) if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, «rb»),delimiter=»;»))

Источник