python 3 read csv UnicodeDecodeError - csv

I have a very simple bit of code that takes in a CVS and puts it into a 2D array. It runs fine on Python2 but in Python3 I get the error below. Looking through the documentation,I think I need to use .decode() Could someone please explain how to use it in the context of my code and why I don't need to do anything in Python2
Error:
line 21, in
for row in datareader:
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 5002: invalid start byte
import csv
import sys
fullTable = sys.argv[1]
datareader = csv.reader(open(fullTable, 'r'), delimiter=',')
full_table = []
for row in datareader:
full_table.append(row)
print(full_table)

open(argv[1], encoding='ISO-8859-1')
CSV contained characters where were not UTF-8 which seemed like the default. I am however surprised that python2 dealt with this issue without any problems.

Related

How to read a gzip jsonl from a byte offset to debug a BigQuery error?

I am exporting json newline files into BigQuery and the BigQuery errors give me a byte offset of the original gziped jsonl file such as
JSON parsing error in row starting at position 727720: Repeated field must be imported as a JSON array. Field: named_entities.alt_form."
I have tried using the Python package indexed gzip to read from the offset but indexed gzip mangles the lines sadly. I have also tried using the builtin python gzip package to try and get the relevant line unsuccessfully:
import gzip
import ujson as json
f = open('myfile.json.gz', 'rb')
g = gzip.GzipFile(fileobj=f)
fasz = g.read()
byte_offset_to_line = {}
for line in g:
byte_offset = f.tell()
byte_offset_to_line[byte_offset] = line
target = 727720
ls = sorted([(abs(target-k),k) for k in byte_offset_to_line.keys() if k < target])
line_of_interest = byte_offset_to_line[ls[0]]
text = str(line_of_interest)
malformed_json = json.loads(text[2:-3])
With the above snippet I can get the nearest line's byte offset. But then when I tried just uploading that line to a test table in BQ it works sadly so I think I am not getting the correct line.
I was wondering if there's a better approach to solve this problem? I am not sure why my above snippet doesn't work to be honest.

Unable to print output of JSON code into a .csv file

I'm getting the following errors when trying to decode this data, and the 2nd error after trying to compensate for the unicode error:
Error 1:
write.writerows(subjects)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 160: ordinal not in range(128)
Error 2:
with open("data.csv", encode="utf-8", "w",) as writeFile:
SyntaxError: non-keyword arg after keyword arg
Code
import requests
import json
import csv
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encode="utf-8", "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
Using requests and with the correction to the second part (as below) I have no problem running. I think your first problem is due to the second error (is a consequence of that being incorrect).
I am on Python3 and can run yours with my fix to open line and with
r = urllib.request.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
I personally would use requests.
import requests
import csv
data = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1').json()
subjects = []
for post in data['posts']:
subjects.append([post['title'], post['episodeNumber'],
post['audioSource'], post['image']['large'], post['excerpt']['long']])
with open("data.csv", encoding ="utf-8", mode = "w",) as writeFile:
write = csv.writer(writeFile)
write.writerows(subjects)
For your second, looking at documentation for open function, you need to use the right argument names and add the name of the mode argument if not positional matching.
with open("data.csv", encoding ="utf-8", mode = "w") as writeFile:

SPARK read.json throwing java.io.IOException: Too many bytes before newline

I am getting following error on reading a large 6gb single line json file:
Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648
spark does not read json files with new lines hence the entire 6 gb json file is on a single line:
jf = sqlContext.read.json("jlrn2.json")
configuration:
spark.driver.memory 20g
Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.
Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
So if your JSON document is in the form...
[
{ [record] },
{ [record] }
]
You'll want to change it to
{ [record] }
{ [record] }
I stumbled upon this while reading a huge JSON file in PySpark and getting the same error. So ,if anyone else is also wondering how to save a JSON file in the format that PySpark can read properly, here is a quick example using pandas:
import pandas as pd
from collections import dict
# create some dict you want to dump
list_of_things_to_dump = [1, 2, 3, 4, 5]
dump_dict = defaultdict(list)
for number in list_of_things_to_dump:
dump_dict["my_number"].append(number)
# save data like this using pandas, will work of the bat with PySpark
output_df = pd.DataFrame.from_dict(dump_dict)
with open('my_fancy_json.json', 'w') as f:
f.write(output_df.to_json(orient='records', lines=True))
After that loading JSON in PySpark is as easy as:
df = spark.read.json("hdfs:///user/best_user/my_fancy_json.json", schema=schema)

Python 3 Pandas Error: pandas.parser.CParserError: Error tokenizing data. C error: Expected 11 fields in line 5, saw 13

I checked out this answer as I am having a similar problem.
Python Pandas Error tokenizing data
However, for some reason ALL of my rows are being skipped.
My code is simple:
import pandas as pd
fname = "data.csv"
input_data = pd.read_csv(fname)
and the error I get is:
File "preprocessing.py", line 8, in <module>
input_data = pd.read_csv(fname) #raw data file ---> pandas.core.frame.DataFrame type
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 465, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 251, in _read
return parser.read()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 710, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 1154, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 754, in pandas.parser.TextReader.read (pandas/parser.c:7391)
File "pandas/parser.pyx", line 776, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7631)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._read_rows (pandas/parser.c:8253)
File "pandas/parser.pyx", line 816, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8127)
File "pandas/parser.pyx", line 1728, in pandas.parser.raise_parser_error (pandas/parser.c:20357)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 11 fields in line 5, saw 13
Solution is to use pandas built-in delimiter "sniffing".
input_data = pd.read_csv(fname, sep=None)
For those landing here, I got this error when the file was actually an .xls file not a true .csv. Try resaving as a csv in a spreadsheet app.
I had the same error, I read my csv data using this :
d1 = pd.read_json('my.csv')
then I try this
d1 = pd.read_json('my.csv', sep='\t')
and this time it's right.
So you could try this method if your delimiter is not ',', because the default is ',', so if you don't indicate clearly, it go wrong.
pandas.read_csv
This error means, you get unequal number of columns for each row. In your case, until row 5, you've had 11 columns but in line 5 you have 13 inputs (columns).
For this problem, you can try the following approach to open read your file:
import csv
with open('filename.csv', 'r') as file:
reader = csv.reader(file, delimiter=',') #if you have a csv file use comma delimiter
for row in reader:
print (row)
This parsing error could occur for multiple reasons and solutions to the different reasons have been posted here as well as in Python Pandas Error tokenizing data.
I posted a solution to one possible reason for this error here: https://stackoverflow.com/a/43145539/6466550
I have had similar problems. With my csv files it occurs because they were created in R, so it has some extra commas and different spacing than a "regular" csv file.
I found that if I did a read.table in R, I could then save it using write.csv and the option of row.names = F.
I could not get any of the read options in pandas to help me.
The problem could be that one or multiple rows of csv file contain more delimiters (commas ,) than expected. It is solved when each row matches the amount of delimiters of the first line of the csv file where the column names are defined.
use \t+ in the separator pattern instead of \t.
import pandas as pd
fname = "data.csv"
input_data = pd.read_csv(fname, sep='\t+`, header=None)

How do you read a file inside a zip file as text, not bytes?

A simple program for reading a CSV file inside a ZIP archive:
import csv, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
for row in csv.DictReader(items_file):
pass
works in Python 2.7:
$ python2.7 test_zip_file_py3k.py ~/data.zip
$
but not in Python 3.2:
$ python3.2 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 8, in <module>
for row in csv.DictReader(items_file):
File "/somedir/python3.2/csv.py", line 109, in __next__
self.fieldnames
File "/somedir/python3.2/csv.py", line 96, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not bytes (did you open the file
in text mode?)
The csv module in Python 3 wants to see a text file, but zipfile.ZipFile.open returns a zipfile.ZipExtFile that is always treated as binary data.
How does one make this work in Python 3?
I just noticed that Lennart's answer didn't work with Python 3.1, but it does work with Python 3.2. They've enhanced zipfile.ZipExtFile in Python 3.2 (see release notes). These changes appear to make zipfile.ZipExtFile work nicely with io.TextWrapper.
Incidentally, it works in Python 3.1, if you uncomment the hacky lines below to monkey-patch zipfile.ZipExtFile, not that I would recommend this sort of hackery. I include it only to illustrate the essence of what was done in Python 3.2 to make things work nicely.
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
# items_file.readable = lambda: True
# items_file.writable = lambda: False
# items_file.seekable = lambda: False
# items_file.read1 = items_file.read
items_file = io.TextIOWrapper(items_file)
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0} -- row = {1}'.format(idx, row))
If I had to support py3k < 3.2, then I would go with the solution in my other answer.
Update for 3.6+
Starting w/3.6, support for mode='U' was removed^1:
Changed in version 3.6: Removed support of mode='U'. Use io.TextIOWrapper for reading compressed text files in universal newlines mode.
Starting w/3.8, a Path object was added which gives us an open() method that we can call like the built-in open() function (passing newline='' in the case of our CSV) and we get back an io.TextIOWrapper object the csv readers accept. See Yuri's answer, here.
You can wrap it in a io.TextIOWrapper.
items_file = io.TextIOWrapper(items_file, encoding='your-encoding', newline='')
Should work.
And if you just like to read a file into a string:
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
eggs = myfile.read().decode('UTF-8'))
Lennart's answer is on the right track (Thanks, Lennart, I voted up your answer) and it almost works:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(items_file, encoding='iso-8859-1', newline='')
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Traceback (most recent call last):
File "test_zip_file_py3k.py", line 7, in <module>
items_file = io.TextIOWrapper(items_file,
encoding='iso-8859-1',
newline='')
AttributeError: readable
The problem appears to be that io.TextWrapper's first required parameter is a buffer; not a file object.
This appears to work:
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
This seems a little complex and also it seems annoying to have to read in a whole (perhaps huge) zip file into memory. Any better way?
Here it is in action:
$ cat test_zip_file_py3k.py
import csv, io, sys, zipfile
zip_file = zipfile.ZipFile(sys.argv[1])
items_file = zip_file.open('items.csv', 'rU')
items_file = io.TextIOWrapper(io.BytesIO(items_file.read()))
for idx, row in enumerate(csv.DictReader(items_file)):
print('Processing row {0}'.format(idx))
$ python3.1 test_zip_file_py3k.py ~/data.zip
Processing row 0
Processing row 1
Processing row 2
...
Processing row 250
Starting with Python 3.8, the zipfile module has the Path object, which we can use with its open() method to get an io.TextIOWrapper object, which can be passed to the csv readers:
import csv, sys, zipfile
# Give a string path to the ZIP archive, and
# the archived file to read from
items_zipf = zipfile.Path(sys.argv[1], at='items.csv')
# Then use the open method, like you'd usually
# use the built-in open()
items_f = items_zipf.open(newline='')
# Pass the TextIO-like file to your reader as normal
for row in csv.DictReader(items_f):
print(row)
Here's a minimal recipe to open a zip file and read a text file inside that zip. I found the trick to be the TextIOWrapper read() method, not mentioned in any answers above (BytesIO.read() was mentioned above, but Python docs recommend TextIOWrapper).
import zipfile
import io
# Create the ZipFile object
zf = zipfile.ZipFile('my_zip_file.zip')
# Read a file that is inside the zip...reads it as a binary file-like object
my_file_binary = zf.open('my_text_file_inside_zip.txt')
# Convert the binary file-like object directly to text using TextIOWrapper and it's read() method
my_file_text = io.TextIOWrapper(my_file_binary, encoding='utf-8', newline='').read()
I wish they kept the mode='U' parameter in the ZipFile open() method to do this same thing since that was so succinct but, alas, that is obsolete.