- Saved searches
- Use saved searches to filter your results more quickly
- read_csv fails to read file if there are cyrillic symbols in filename #17773
- read_csv fails to read file if there are cyrillic symbols in filename #17773
- Comments
- Code Sample, a copy-pastable example if possible
- Problem description
- Expected Output
- Output of pd.show_versions()
- INSTALLED VERSIONS
- Reading russian language data from csv
- Solution 2
- Solution 3
- Erba Aitbayev
- Comments
- Чтение данных на русском языке из CSV
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv fails to read file if there are cyrillic symbols in filename #17773
read_csv fails to read file if there are cyrillic symbols in filename #17773
Comments
Code Sample, a copy-pastable example if possible
import pandas cyrillic_filename = "./файл_1.csv" # 'c' engine fails: df = pandas.read_csv(cyrillic_filename, engine="c", encoding="cp1251") --------------------------------------------------------------------------- OSError Traceback (most recent call last) ipython-input-18-9cb08141730c> in module>() 2 3 cyrillic_filename = "./файл_1.csv" ----> 4 df = pandas.read_csv(cyrillic_filename , engine="c", encoding="cp1251") d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision) 653 skip_blank_lines=skip_blank_lines) 654 --> 655 return _read(filepath_or_buffer, kwds) 656 657 parser_f.__name__ = name d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds) 403 404 # Create the parser. --> 405 parser = TextFileReader(filepath_or_buffer, **kwds) 406 407 if chunksize or iterator: d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds) 762 self.options['has_index_names'] = kwds['has_index_names'] 763 --> 764 self._make_engine(self.engine) 765 766 def close(self): d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine) 983 def _make_engine(self, engine='c'): 984 if engine == 'c': --> 985 self._engine = CParserWrapper(self.f, **self.options) 986 else: 987 if engine == 'python': d:\0_dev\services\protocol_sort\venv\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds) 1603 kwds['allow_leading_cols'] = self.index_col is not False 1604 -> 1605 self._reader = parsers.TextReader(src, **kwds) 1606 1607 # XXX pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:4209)() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source (pandas\_libs\parsers.c:8895)() OSError: Initializing from file failed # 'python' engine work: df = pandas.read_csv(cyrillic_filename, engine="python", encoding="cp1251") df.size >>172440 # 'c' engine works if filename can be encoded to utf-8 latin_filename = "./file_1.csv" df = pandas.read_csv(latin_filename, engine="c", encoding="cp1251") df.size >>172440
Problem description
The ‘c’ engine should read the files with non-UTF-8 filenames
Expected Output
File content readed into dataframe
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.1.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.2
scipy: 0.19.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 4.0.0
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
None
The text was updated successfully, but these errors were encountered:
Reading russian language data from csv
\ea is the windows-1251 / cp5347 encoding for к . Therefore, you need to use windows-1251 decoding, not UTF-8.
In Python 2.7, the CSV library does not support Unicode properly — See «Unicode» in https://docs.python.org/2/library/csv.html
They propose a simple work around using:
class UnicodeReader: """ A CSV reader which will iterate over lines in the CSV file "f", which is encoded in the given encoding. """ def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds): f = UTF8Recoder(f, encoding) self.reader = csv.reader(f, dialect=dialect, **kwds) def next(self): row = self.reader.next() return [unicode(s, "utf-8") for s in row] def __iter__(self): return self
This would allow you to do:
def loadCsv(filename): lines = UnicodeReader(open(filename, "rb"), delimiter=";", encoding="windows-1251" ) # if you really need lists then uncomment the next line # this will let you do call exact lines by doing `line_12 = lines[12]` # return list(lines) # this will return an "iterator", so that the file is read on each call # use this if you'll do a `for x in x` return lines
If you try to print dataset , then you’ll get a representation of a list within a list, where the first list is rows, and the second list is colums. Any encoded bytes or literals will be represented with \x or \u . To print the values, do:
for csv_line in loadCsv("myfile.csv"): print u", ".join(csv_line)
If you need to write your results to another file (fairly typical), you could do:
with io.open("my_output.txt", "w", encoding="utf-8") as my_ouput: for csv_line in loadCsv("myfile.csv"): my_output.write(u", ".join(csv_line))
This will automatically convert/encode your output to UTF-8.
Solution 2
import pandas as pd pd.read_csv(path_file , "cp1251")
import csv with open(path_file, encoding="cp1251", errors='ignore') as source_file: reader = csv.reader(source_file, delimiter=",")
Solution 3
Can your .csv be another encoding, not UTF-8? (considering error message, it even should be). Try other cyrillic encodings such as Windows-1251 or CP866 or KOI8.
Erba Aitbayev
Comments
2-комнатная квартира РДТ', мкр Тастак-3, Аносова — Толе би;Алматы 2-комнатная квартира БГР', мкр Таугуль, Дулати (Навои) — Токтабаева;Алматы 2-комнатная квартира ЦФМ', мкр Тастак-2, Тлендиева — Райымбека;Алматы
Delimiter is ; symbol. I want to read data and put it into array. I tried to read this data using this code:
def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset
mydata = loadCsv('krish(csv3).csv') print mydata
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text
newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
What is the problem? When using codecs how to specify delimiter in my data? I just want to read data from file and put it in 2-dimensional array.
return ([f.decode(‘cp1251’) if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, «rb»),delimiter=»;»))
Чтение данных на русском языке из CSV
Разделитель ; символ. Я хочу читать данные и помещать их в массив. Я попытался прочитать эти данные, используя этот код:
def loadCsv(filename): lines = csv.reader(open(filename, "rb"),delimiter=";" ) dataset = list(lines) for i in range(len(dataset)): dataset[i] = [str(x) for x in dataset[i]] return dataset
mydata = loadCsv('krish(csv3).csv') print mydata
[['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-3, \xc0\xed\xee\xf1\xee\xe2\xe0 \x97 \xd2\xee\xeb\xe5 \xe1\xe8', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf3\xe3\xf3\xeb\xfc, \xc4\xf3\xeb\xe0\xf2\xe8 (\xcd\xe0\xe2\xee\xe8) \x97 \xd2\xee\xea\xf2\xe0\xe1\xe0\xe5\xe2\xe0', '\xc0\xeb\xec\xe0\xf2\xfb'], ['2-\xea\xee\xec\xed\xe0\xf2\xed\xe0\xff \xea\xe2\xe0\xf0\xf2\xe8\xf0\xe0, \xec\xea\xf0 \xd2\xe0\xf1\xf2\xe0\xea-2, \xd2\xeb\xe5\xed\xe4\xe8\xe5\xe2\xe0 \x97 \xd0\xe0\xe9\xfb\xec\xe1\xe5\xea\xe0', '\xc0\xeb\xec\xe0\xf2\xfb']]
import codecs with codecs.open('krish(csv3).csv','r',encoding='utf8') as f: text = f.read() print text
newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 2: invalid continuation byte
В чем проблема? При использовании кодеков, как указать разделитель в моих данных? Я просто хочу прочитать данные из файла и поместить его в 2-мерный массив.
return ([f.decode(‘cp1251’) if isinstance(s, bytes) else f for f in row] for row in csv.reader(open(filename, «rb»),delimiter=»;»))