I'm getting UnicodeDecodeError when trying to load a JSON file into a dataframe - json

So, I'm using the following code to get pandas to read my JSON text file-
f = open('C:/Users/stans/WFH Project/data.json')
data = json.load(f)
df = pd.DataFrame(data, index=[0])
f.close()
Once I execute the cell, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1535: character maps to
I used the above coding for a smaller sample of JSON data and it worked. But, since I updated the file to include a much larger sample, I get that error.
I verified that the JSON format is correct, and I also tried in the open statement-
encoding='utf-8'
and
errors='ignore'
Both produced value errors. Any ideas? Thanks in advance for your help!

Related

In Python 3.6 JSON module why do I have to use both loads and load?

I am trying to persist some data to disk, I am attempting to use Python's JSON module but I can't access the data on simple json.load and I can't figure out why. Here's my code:
jsondata=json.dumps({'a':1,
'b':'string',
'c':{'k1':(1,3),'k2':(12,3)}})
f= open('jsonfile.json', 'w')
json.dump(jsondata,f)
f.close()
g=open('jsonfile.json', 'r')
result=json.load(g)
g.close()
print(result['b'])
This gives me the error "TypeError: string indicies must be integers"
However if I replace the access block with
g=open('jsonfile.json', 'r')
result=json.loads(json.load(g))
g.close()
print(result['b'])
It gives me the result I expect. I have read through the documentation a number of times and it seems like the simple json.load by itself should be sufficient. I can't figure out why I would have to use json.loads as well. I feel like I'm missing something. Any insight would be welcome.
Thanks петр костюкевич
Problem was I was converting it to a string before the dump so needed to convert it back. This code worked.
jsondata=({'a':1,
'b':'string',
'c':{'k1':(1,3),'k2':(12,3)}})
f= open('jsonfile.json', 'w')
json.dump(jsondata,f)
f.close()
g=open('jsonfile.json', 'r')
result=json.load(g)
g.close()
print(result['b'])

How open and read JSON file?

I have json file but this file have weight 186 mb. I try read via python .
import json
f = open('file.json','r')
r = json.loads(f.read())
ValueError: Extra data: line 88 column 2 -...
FILE
How to open it? Help me
Your JSON file isn't a JSON file, it's several JSON files mashed together.
The first instance of this occurs in the 1630070th character:
'шова"}]}]}{"response":[{"count'
^ here
That said, jq appears to be able to handle it, so the individual parts are fine.
You'll need to split the file at the boundaries of the individual JSON objects. Try catching the JSONDecodeError and use its .colno to slice the text into correct chunks.
It should be:
r = json.loads(f)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am trying to read twitter data from json file using python 2.7.12.
Code I used is such:
import json
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def get_tweets_from_file(file_name):
tweets = []
with open(file_name, 'rw') as twitter_file:
for line in twitter_file:
if line != '\r\n':
line = line.encode('ascii', 'ignore')
tweet = json.loads(line)
if u'info' not in tweet.keys():
tweets.append(tweet)
return tweets
Result I got:
Traceback (most recent call last):
File "twitter_project.py", line 100, in <module>
main()
File "twitter_project.py", line 95, in main
tweets = get_tweets_from_dir(src_dir, dest_dir)
File "twitter_project.py", line 59, in get_tweets_from_dir
new_tweets = get_tweets_from_file(file_name)
File "twitter_project.py", line 71, in get_tweets_from_file
line = line.encode('ascii', 'ignore')
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte
I went through all the answers from similar issues and came up with this code and it worked last time. I have no clue why it isn't working now.
In my case(mac os), there was .DS_store file in my data folder which was a hidden and auto generated file and it caused the issue. I was able to fix the problem after removing it.
It doesn't help that you have sys.setdefaultencoding('utf-8'), which is confusing things further - It's a nasty hack and you need to remove it from your code.
See https://stackoverflow.com/a/34378962/1554386 for more information
The error is happening because line is a string and you're calling encode(). encode() only makes sense if the string is a Unicode, so Python tries to convert it Unicode first using the default encoding, which in your case is UTF-8, but should be ASCII. Either way, 0x80 is not valid ASCII or UTF-8 so fails.
0x80 is valid in some characters sets. In windows-1252/cp1252 it's €.
The trick here is to understand the encoding of your data all the way through your code. At the moment, you're leaving too much up to chance. Unicode String types are a handy Python feature that allows you to decode encoded Strings and forget about the encoding until you need to write or transmit the data.
Use the io module to open the file in text mode and decode the file as it goes - no more .decode()! You need to make sure the encoding of your incoming data is consistent. You can either re-encode it externally or change the encoding in your script. Here's I've set the encoding to windows-1252.
with io.open(file_name, 'r', encoding='windows-1252') as twitter_file:
for line in twitter_file:
# line is now a <type 'unicode'>
tweet = json.loads(line)
The io module also provide Universal Newlines. This means \r\n are detected as newlines, so you don't have to watch for them.
For others who come across this question due to the error message, I ran into this error trying to open a pickle file when I opened the file in text mode instead of binary mode.
This was the original code:
import pickle as pkl
with open(pkl_path, 'r') as f:
obj = pkl.load(f)
And this fixed the error:
import pickle as pkl
with open(pkl_path, 'rb') as f:
obj = pkl.load(f)
I got a similar error by accidentally trying to read a parquet file as a csv
pd.read_csv(file.parquet)
pd.read_parquet(file.parquet)
The error occurs when you are trying to read a tweet containing sentence like
"#Mike http:\www.google.com \A8&^)((&() how are&^%()( you ". Which cannot be read as a String instead you are suppose to read it as raw String .
but Converting to raw String Still gives error so i better i suggest you to
read a json file something like this:
import codecs
import json
with codecs.open('tweetfile','rU','utf-8') as f:
for line in f:
data=json.loads(line)
print data["tweet"]
keys.append(data["id"])
fulldata.append(data["tweet"])
which will get you the data load from json file .
You can also write it to a csv using Pandas.
import pandas as pd
output = pd.DataFrame( data={ "tweet":fulldata,"id":keys} )
output.to_csv( "tweets.csv", index=False, quoting=1 )
Then read from csv to avoid the encoding and decoding problem
hope this will help you solving you problem.
Midhun

Saving Pandas DataFrame and meta-data to JSON format

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)
Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.
B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()
My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)
Failure modes
C) I have found that even the simplified case of
df_dict = df.to_dict()
json.dump(df_dict, fp)
fails with:
TypeError: key (u'US', 112, 5, 80, 'wl') is not a string
D) Investigating, I've found that the complement also fails.
df.to_json(fp)
json.load(fp)
fails with
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)
So it appears that Pandas JSON format and the Python's JSON library are not compatible.
My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.
Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?
Python 2.7.10
Pandas 0.17
Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.
Lesson: Read the documentation kids, even when it's your own documentation.
Edit 2:
If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.
You should be able to just put the data on separate lines.
Writing:
f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)
Reading:
f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))
In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.
In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.
If you similarly need to store the data and metadata in a single JSON blob, the following code will work:
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
json.dump(top_level_dict, outfile)
Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.
Reading the data back in:
with open(FILENAME, 'r') as infile:
top_level_dict = json.load(infile)
df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)
meta = top_level_dict['metadata']
At this point, you'll need to re-create your Multi-Index (if applicable)

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")