how to edit large json files with pandas? - json

I have a 200mb txt file which includes roughly about 25k JSON files (metadata and the content of newspaper articles). Now i want to manipulate the data so that the file is smaller and it only contains such data which is relevant for my analysis (only 3 out of 16 columns).
Question:
How to delete/drop columns in pandas dataframe and safe these changes to the .json file?
JSON:
{"_version_":1609422219455234049,
"content": " abc ",
"docType":"shNews",
"id":"SNW_000050a3-38c6-4794-8e73-3ab3464be248",
"publishDate":"2017-08-16T16:01:018Z",
"stakeholderId":482,
"status":"BlackListed",
"systemDate":"2017-08-16T17:42:010Z"
"tags2":"type_de_Institution;subtype_de_Administration;industry_de_Staat;continent_de_Europa;country_de_Deutschland;level_de_National;highrelevance_eu_0;"
,"title":"Waffen schaffen keine Sicherheit. Von Außenminister Sigmar Gabriel",
"url":"http://www.auswaertiges-amt.de/sid_A5AB4A9D659FF8612B357392137BE7EB/DE/Infoservice/Presse/Interviews/2017/170816-BM_Rheinische_Post.html"}
Code:
import pandas as pd
articles=pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
print (type (articles))
df = pd.DataFrame(articles)
df[df['tags2'].str.contains('country_de_Deutschland')==True]
i already tried this:
df.to_json ("example_name.json")
The actual result of the line i tried is a json file which is larger than the original file and atom cannot read it out. Moreover the changes i made in the dataframe (del/drop of columns) are not applied to the .json file on my pc.

import pandas as pd
df = pd.read_json('/Users/Flo/export_harnisch.json', lines=True, orient='columns')
# read_json should convert things into dataframe already
print(type(articles))
# you forgot to re assign df
df = df[df['tags2'].str.contains('country_de_Deutschland')==True]
df.to_json("example_name.json")

Related

Pandas parallel URL downloads with pd.read_html

I know I can download a csv file from a web page by doing:
import pandas as pd
import numpy as np
from io import StringIO
URL = "http://www.something.com"
data = pd.read_html(URL)[0].to_csv(index=False, header=True)
file = pd.read_csv(StringIO(data), sep=',')
Now I would like to do the above for more URLs at the same time, like when you open different tabs in your browser. In other words, a way to parallelize this when you have different URLs, instead of looping through or doing it one at a time. So, I thought of having a series of URLs inside a dataframe, and then create a new column which contains the strings 'data', one for each URL.
list_URL = ["http://www.something.com", "http://www.something2.com",
"http://www.something3.com"]
df = pd.DataFrame(list_URL, columns =['URL'])
df['data'] = pd.read_html(df['URL'])[0].to_csv(index=False, header=True)
But it gives me error: cannot parse from 'Series'
Is there a better syntax, or does this mean I cannot do this in parallel for more than one URL?
You could try like this:
import pandas as pd
URLS = [
"https://en.wikipedia.org/wiki/Periodic_table#Presentation_forms",
"https://en.wikipedia.org/wiki/Planet#Planetary_attributes",
]
df = pd.DataFrame(URLS, columns=["URL"])
df["data"] = df["URL"].map(
lambda x: pd.read_html(x)[0].to_csv(index=False, header=True)
)
print(df)
# Output
URL data
0 https://en.wikipedia.org/wiki/Periodic_t... 0\r\nPart of a series on the\r\nPeriodic...
1 https://en.wikipedia.org/wiki/Planet#Pla... 0\r\n"The eight known planets of the Sol...

Reading Json files using pyspark

I am trying to read multiple json files from dbfs in databricks.
raw_df = spark.read.json('/mnt/testdatabricks/metrics-raw/',recursiveFileLookup=True)
This returns data for only 35 files whereas there are around 1600 files.
I tried to read some of the files (except those 35) using pandas and it returned data.
However the driver fails when I try to read all 1600 files using pandas.
import pandas as pd
from glob import glob
jsonFiles = glob('/dbfs/mnt/testdatabricks/metrics-raw/***/*.json')
dfList = []
for jsonFile in jsonFiles:
df = pd.read_json(jsonFile)
dfList.append(df)
print("written :", jsonFile )
dfTrainingDF = pd.concat(dfList, axis=0)
Not sure why spark is not able to read all the files.
Try:
spark.read.option("recursiveFileLookup", "true").json("file:///dir1/subdirectory")
Ref: How to make Spark session read all the files recursively?

line-delimited json format txt file, how to import with pandas

I have a line-delimited Json format txt file. The format of the file is .txt. Now I want to import it with pandas. Usually I can import with
df = pd.read_csv('df.txt')
df = pd.read_json('df.txt')
df = pd.read_fwf('df.txt')
they all give me an error.
ParserError: Error tokenizing data. C error: Expected 29 fields in line 1354, saw 34
ValueError: Trailing data
this returns the data, but the data is organized in a weird way where column name is in the left, next to the data
can anyone tells me how to solve this?
pd.read_json('df.txt', lines=True)
read_json accepts a boolean argument lines which will Read the file as a json object per line.

Export JSON to CSV using Python

I wrote a code to extract some information from a website. the output is in JSON and I want to export it to CSV. So, I tried to convert it to a pandas dataframe and then export it to CSV in pandas. I can print the results but still, it doesn't convert the file to a pandas dataframe. Do you know what the problem with my code is?
# -*- coding: utf-8 -*-
# To create http request/session
import requests
import re, urllib
import pandas as pd
from BeautifulSoup import BeautifulSoup
url = "https://www.indeed.com/jobs?
q=construction%20manager&l=Houston&start=10"
# create session
s = requests.session()
html = s.get(url).text
# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' +
urllib.quote(job_ids)
# do Ajax request and convert the response to json
ajax_content = s.get(ajax_url).json()
print(ajax_content)
#Convert to pandas dataframe
df = pd.read_json(ajax_content)
#Export to CSV
df.to_csv("c:\\users\\Name\desktop\\newcsv.csv")
The error message is:
Traceback (most recent call last):
File "C:\Users\Mehrdad\Desktop\Indeed 06.py", line 21, in
df = pd.read_json(ajax_content)
File "c:\python27\lib\site-packages\pandas\io\json\json.py", line 408, in read_json
path_or_buf, encoding=encoding, compression=compression,
File "c:\python27\lib\site-packages\pandas\io\common.py", line 218, in get_filepath_or_buffer
raise ValueError(msg.format(_type=type(filepath_or_buffer)))
ValueError: Invalid file path or buffer object type:
The problem was that nothing was going into the dataframe when you called read_json() because it was a nested JSON dict:
import requests
import re, urllib
import pandas as pd
from pandas.io.json import json_normalize
url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston&start=10"
s = requests.session()
html = s.get(url).text
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
ajax_content= s.get(ajax_url).json()
df = json_normalize(ajax_content).transpose()
df.to_csv('your_output_file.csv')
Note that I called json_normalize() to collapse the nested columns from the JSON. I also called transpose() so that the rows were labelled with the job ID rather than columns. This will give you a dataframe that looks like this:
0079ccae458b4dcf <p><b>Company Environment: </b></p><p>Planet F...
0c1ab61fe31a5c62 <p><b>Commercial Construction Project Manager<...
0feac44386ddcf99 <div><div>Trendmaker Homes is currently seekin...
...
It's not really clear what your expected output is, though ... what are you expecting the DataFrame/CSV file to look like?. If you actually were looking for just a single row/Series with the job ID's as column labels, just remove the call to transpose()

Reading a big JSON file with multiple objects in Python

I have a big GZ compressed JSON file where each line is a JSON object (i.e. a python dictionary).
Here is an example of the first two lines:
{"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}
So, I'm using the following code to read each line into a Pandas DataFrame:
import json
import gzip
import pandas as pd
import random
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = pd.DataFrame()
for line in fin:
data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)
But it's taking years.
Any suggestion to read the data quicker?
EDIT:
Finally what solved the problem:
import json
import gzip
import pandas as pd
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = []
for line in fin:
data_lan.append(json.loads(line.decode('utf-8')))
data = pd.DataFrame(data_lan)
I've worked on a similar problem myself, The append() is kinda slow. I generally use a list of dicts to load the json file and then create a Dataframe at once. This ways, you can have the flexibility the lists give you and only when you're sure about the Data in the list you convert it into a Dataframe. Below is an implementation of the concept:
import pandas as pd
import gzip
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with gzip.open(file_path) as file:
contents = file.read()
return json.loads(contents.decode('UTF-8'))
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def main(file_path: str):
file_contents = get_contents_from_json(file_path)
if not isinstance(file_contents,list):
# I've considered you have a JSON Array in your file
# if not let me know in the comments
raise TypeError("The file doesn't have a JSON Array!!!")
all_columns = file_contents[0].keys()
data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')
if __name__ == '__main__':
main(r'C:\Users\carrot\Desktop\dummyData.json.gz')
A pandas DataFrame fits into a contiguous block of memory which means that pandas needs to know the size of the data set when the frame is created. Since append changes the size, new memory must be allocated and the original plus new data sets are copied in. As your set grows, the copy gets bigger and bigger.
You can use from_records to avoid this problem. First, you need to know the row count and that means scanning the file. You could potentially cache that number if you do it often, but its a relatively fast operation. Now you have the size and pandas can allocate the memory efficiently.
# count rows
with gzip.GzipFile(file_to_test, 'r',) as fin:
row_count = sum(1 for _ in fin)
# build dataframe from records
with gzip.GzipFile(file_to_test, 'r',) as fin:
data_lan = pd.DataFrame.from_records(fin, nrows=row_count)