line-delimited json format txt file, how to import with pandas - json

I have a line-delimited Json format txt file. The format of the file is .txt. Now I want to import it with pandas. Usually I can import with
df = pd.read_csv('df.txt')
df = pd.read_json('df.txt')
df = pd.read_fwf('df.txt')
they all give me an error.
ParserError: Error tokenizing data. C error: Expected 29 fields in line 1354, saw 34
ValueError: Trailing data
this returns the data, but the data is organized in a weird way where column name is in the left, next to the data
can anyone tells me how to solve this?

pd.read_json('df.txt', lines=True)
read_json accepts a boolean argument lines which will Read the file as a json object per line.

Related

Error trying to open json file [json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes]

I'm trying to open a json file using the json library in Python 3.8 but I have not succeeded.
This is my MWE:
with open(pbit_path + file_name, 'r') as f:
data = json.load(f)
print(data)
where pbit_path and file_name is the absolute path of the .json file. As an example, this is a sample of the .json file that i'm trying to open.
https://github.com/pwnaoj/desktop-tutorial/blob/master/DataModelSchema.json
Error returned
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I have also tried using the functions loads(), dump(), dumps().
I appreciate any suggestions
Thanks in advance.
I found a solution to my problem. In principle, it is an encoding problem since the type of file I am trying to read is encoded with UCS-2, so in python
with open(file, mode='r', encoding='utf_16_le') as file:
data = file.read()
data = json.loads(data)
file.close()

Convert dataframe to JSON using Python

I have been trying to convert a dataframe to JSON using Python. I am able to do it successfully but i am not getting the required format of JSON.
Code -
df1 = df.rename_axis('CUST_ID').reset_index()
df.to_json('abc.json')
Here, abc.json is the filename of JSON and df is the required dataframe.
What I am getting -
{"CUST_LAST_UPDATED":
{"1000":1556879045879.0,"1001":1556879052416.0},
"CUST_NAME":{"1000":"newly
updated_3_file","1001":"heeloo1"}}
What I want -
[{"CUST_ID":1000,"CUST_NAME":"newly
updated_3_file","CUST_LAST_UPDATED":1556879045879},
{"CUST_ID":1001,"CUST_NAME":"heeloo1","CUST_LAST_UPDATED":1556879052416}]
Error -
Traceback (most recent call last):
File
"C:/Users/T/PycharmProject/test_pandas.py",
line 19, in <module>
df1 = df.rename_axis('CUST_ID').reset_index()
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\frame.py", line 3379, in reset_index
new_obj.insert(0, name, level_values)
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\frame.py", line 2613, in insert
allow_duplicates=allow_duplicates)
File "C:\Users\T\AppData\Local\Programs\Python\Python36\lib\site-
packages\pandas\core\internals.py", line 4063, in insert
raise ValueError('cannot insert {}, already exists'.format(item))
ValueError: cannot insert CUST_ID, already exists
df.head() Output -
CUST_ID CUST_LAST_UPDATED CUST_NAME
0 1000 1556879045879 newly updated_3_file
1 1001 1556879052416 heeloo1
How to change the format while converting dataframe to JSON?
Use DataFrame.rename_axis with DataFrame.reset_index for column from index and then DataFrame.to_json with orient='records':
df1 = df.rename_axis('CUST_ID').reset_index()
df1.to_json('abc.json', orient='records')
[{"CUST_ID":"1000",
"CUST_LAST_UPDATED":1556879045879.0,
"CUST_NAME":"newly updated_3_file"},
{"CUST_ID":"1001",
"CUST_LAST_UPDATED":1556879052416.0,
"CUST_NAME":"heeloo1"}]
EDIT:
Because there is default index in data, use:
df1.to_json('abc.json', orient='records')
Verify:
print (df1.to_json(orient='records'))
[{"CUST_ID":1000,
"CUST_LAST_UPDATED":1556879045879,
"CUST_NAME":"newly pdated_3_file"},
{"CUST_ID":1001,
"CUST_LAST_UPDATED":1556879052416,
"CUST_NAME":"heeloo1"}]
You can convert a dataframe to a jason format using to_dict:
df1.to_dict('records')
the outpit would the one that you need.
Suppose if dataframe has nan values in each row and you don't want them in your json file. Follow below code
import pandas as pd
from pprint import pprint
import json
import argparse
if __name__=="__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--csv")
parser.add_argument("--json")
args = parser.parse_args()
entities=pd.read_csv(args.csv)
json_data=[row.dropna().to_dict() for index,row in entities.iterrows()]
with open(args.json,"w") as file:
json.dump(json_data,file)

how to edit large json files with pandas?

I have a 200mb txt file which includes roughly about 25k JSON files (metadata and the content of newspaper articles). Now i want to manipulate the data so that the file is smaller and it only contains such data which is relevant for my analysis (only 3 out of 16 columns).
Question:
How to delete/drop columns in pandas dataframe and safe these changes to the .json file?
JSON:
{"_version_":1609422219455234049,
"content": " abc ",
"docType":"shNews",
"id":"SNW_000050a3-38c6-4794-8e73-3ab3464be248",
"publishDate":"2017-08-16T16:01:018Z",
"stakeholderId":482,
"status":"BlackListed",
"systemDate":"2017-08-16T17:42:010Z"
"tags2":"type_de_Institution;subtype_de_Administration;industry_de_Staat;continent_de_Europa;country_de_Deutschland;level_de_National;highrelevance_eu_0;"
,"title":"Waffen schaffen keine Sicherheit. Von Außenminister Sigmar Gabriel",
"url":"http://www.auswaertiges-amt.de/sid_A5AB4A9D659FF8612B357392137BE7EB/DE/Infoservice/Presse/Interviews/2017/170816-BM_Rheinische_Post.html"}
Code:
import pandas as pd
articles=pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
print (type (articles))
df = pd.DataFrame(articles)
df[df['tags2'].str.contains('country_de_Deutschland')==True]
i already tried this:
df.to_json ("example_name.json")
The actual result of the line i tried is a json file which is larger than the original file and atom cannot read it out. Moreover the changes i made in the dataframe (del/drop of columns) are not applied to the .json file on my pc.
import pandas as pd
df = pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
# read_json should convert things into dataframe already
print(type(articles))
# you forgot to re assign df
df = df[df['tags2'].str.contains('country_de_Deutschland')==True]
df.to_json("example_name.json")

Convert Pandas DataFrame to JSON format

I have a Pandas DataFrame with two columns – one with the filename and one with the hour in which it was generated:
File Hour
F1 1
F1 2
F2 1
F3 1
I am trying to convert it to a JSON file with the following format:
{"File":"F1","Hour":"1"}
{"File":"F1","Hour":"2"}
{"File":"F2","Hour":"1"}
{"File":"F3","Hour":"1"}
When I use the command DataFrame.to_json(orient = "records"), I get the records in the below format:
[{"File":"F1","Hour":"1"},
{"File":"F1","Hour":"2"},
{"File":"F2","Hour":"1"},
{"File":"F3","Hour":"1"}]
I'm just wondering whether there is an option to get the JSON file in the desired format. Any help would be appreciated.
The output that you get after DF.to_json is a string. So, you can simply slice it according to your requirement and remove the commas from it too.
out = df.to_json(orient='records')[1:-1].replace('},{', '} {')
To write the output to a text file, you could do:
with open('file_name.txt', 'w') as f:
f.write(out)
In newer versions of pandas (0.20.0+, I believe), this can be done directly:
df.to_json('temp.json', orient='records', lines=True)
Direct compression is also possible:
df.to_json('temp.json.gz', orient='records', lines=True, compression='gzip')
I think what the OP is looking for is:
with open('temp.json', 'w') as f:
f.write(df.to_json(orient='records', lines=True))
This should do the trick.
use this formula to convert a pandas DataFrame to a list of dictionaries :
import json
json_list = json.loads(json.dumps(list(DataFrame.T.to_dict().values())))
Try this one:
json.dumps(json.loads(df.to_json(orient="records")))
convert data-frame to list of dictionary
list_dict = []
for index, row in list(df.iterrows()):
list_dict.append(dict(row))
save file
with open("output.json", mode) as f:
f.write("\n".join(str(item) for item in list_dict))
To transform a dataFrame in a real json (not a string) I use:
from io import StringIO
import json
import DataFrame
buff=StringIO()
#df is your DataFrame
df.to_json(path_or_buf=buff,orient='records')
dfJson=json.loads(buff)
instead of using dataframe.to_json(orient = “records”)
use dataframe.to_json(orient = “index”)
my above code convert the dataframe into json format of dict like {index -> {column -> value}}
Here is small utility class that converts JSON to DataFrame and back: Hope you find this helpful.
# -*- coding: utf-8 -*-
from pandas.io.json import json_normalize
class DFConverter:
#Converts the input JSON to a DataFrame
def convertToDF(self,dfJSON):
return(json_normalize(dfJSON))
#Converts the input DataFrame to JSON
def convertToJSON(self, df):
resultJSON = df.to_json(orient='records')
return(resultJSON)

Python 3 Pandas Error: pandas.parser.CParserError: Error tokenizing data. C error: Expected 11 fields in line 5, saw 13

I checked out this answer as I am having a similar problem.
Python Pandas Error tokenizing data
However, for some reason ALL of my rows are being skipped.
My code is simple:
import pandas as pd
fname = "data.csv"
input_data = pd.read_csv(fname)
and the error I get is:
File "preprocessing.py", line 8, in <module>
input_data = pd.read_csv(fname) #raw data file ---> pandas.core.frame.DataFrame type
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 465, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 251, in _read
return parser.read()
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 710, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/pandas/io/parsers.py", line 1154, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 754, in pandas.parser.TextReader.read (pandas/parser.c:7391)
File "pandas/parser.pyx", line 776, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7631)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._read_rows (pandas/parser.c:8253)
File "pandas/parser.pyx", line 816, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8127)
File "pandas/parser.pyx", line 1728, in pandas.parser.raise_parser_error (pandas/parser.c:20357)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 11 fields in line 5, saw 13
Solution is to use pandas built-in delimiter "sniffing".
input_data = pd.read_csv(fname, sep=None)
For those landing here, I got this error when the file was actually an .xls file not a true .csv. Try resaving as a csv in a spreadsheet app.
I had the same error, I read my csv data using this :
d1 = pd.read_json('my.csv')
then I try this
d1 = pd.read_json('my.csv', sep='\t')
and this time it's right.
So you could try this method if your delimiter is not ',', because the default is ',', so if you don't indicate clearly, it go wrong.
pandas.read_csv
This error means, you get unequal number of columns for each row. In your case, until row 5, you've had 11 columns but in line 5 you have 13 inputs (columns).
For this problem, you can try the following approach to open read your file:
import csv
with open('filename.csv', 'r') as file:
reader = csv.reader(file, delimiter=',') #if you have a csv file use comma delimiter
for row in reader:
print (row)
This parsing error could occur for multiple reasons and solutions to the different reasons have been posted here as well as in Python Pandas Error tokenizing data.
I posted a solution to one possible reason for this error here: https://stackoverflow.com/a/43145539/6466550
I have had similar problems. With my csv files it occurs because they were created in R, so it has some extra commas and different spacing than a "regular" csv file.
I found that if I did a read.table in R, I could then save it using write.csv and the option of row.names = F.
I could not get any of the read options in pandas to help me.
The problem could be that one or multiple rows of csv file contain more delimiters (commas ,) than expected. It is solved when each row matches the amount of delimiters of the first line of the csv file where the column names are defined.
use \t+ in the separator pattern instead of \t.
import pandas as pd
fname = "data.csv"
input_data = pd.read_csv(fname, sep='\t+`, header=None)