I have been tasked to create a method to download multiple PDFs from URLs included in JSON files. Probably 1 URL per JSON file, with approx 500k JSON files to process in any one batch.
Here's a sample of the JSON file:
{
"from": null,
"id": "sfm_c4kjatol7u8psvqfati0",
"imb_code": "897714123456789",
"mail_date": null,
"mail_type": "usps_first_class",
"object": "self_mailer",
"press_proof": "https://lob-assets.com/sid-self_mailers/sfm_c4kjatol7u8psvqfati0.pdf?version=v1&expires=1635274615&signature=AZlb0MSzZPuCjtKFkXRr_OoHzDzEy23UqzmKFWs5bycKCEcIyfe2od58zHzfP1a-iW5d9azFYUT1PnosqKcvBg",
"size": "11x9_bifold",
"target_delivery_date": null,
"to": {
"address_city": "SAN FRANCISCO",
"address_country": "UNITED STATES",
"address_line1": "185 BERRY ST STE 6100",
"address_line2": null,
"address_state": "CA",
"address_zip": "94107-1741",
"company": "Name.COM",
"name": "EMILE ILES"
}
}
The JSON file is converted to CSV and the URL is downloaded.
Here's what I have been trying to use but it is not working. What am I missing?
Import urllib.request, json, requests, os, csvkit
from itertools import islice
from pathlib import Path
path = Path("/Users/MyComputer/Desktop/self_mailers")
paths = [i.path for i in islice(os.scandir(path), 100)]
in2csv data.json > data.csv
with open('*.json', 'r') as f:
urls_dict = json.load(f)
urls_dict = urls_dict[0]
itr = iter(urls_dict)
len(list(itr))
f.write(r.pdf)
Why are you converting your JSON to a CSV?
Btw, if you are unsure of where are the urls in the jsons, I would do this:
import os
import json
from rethreader import Rethreader
from urllib.parse import urlparse
from urllib.request import urlretrieve
def download_pdf(url):
# use urlparse to find the pdf name
filename = urlparse(url).path.rsplit('/')[-1]
urlretrieve(url, filename)
# use multi-threading for faster downloads
downloader = Rethreader(download_pdf).start()
def verify_url(value):
if not isinstance(value, str):
# if the value is not a string, it's neither an url
return False
try:
parsed_url = urlparse(value)
except AttributeError:
# value cannot be parsed as url
return False
if not (parsed_url.scheme and parsed_url.netloc and parsed_url.path):
# value cannot be an url because it does not have the right scheme
return False
return True
def parse_data(data):
for value in data.values():
if verify_url(value):
downloader.add(value)
for file in os.listdir():
with open(file) as fp:
try:
json_data = json.load(fp)
except (json.JSONDecodeError, UnicodeDecodeError):
# this file is not a json; let's skip to the next one
continue
parse_data(json_data)
# quit the downloader after downloading the files
downloader.quit()
If you know in what possible keys can be the urls, I would do as this:
# The other parts same as before
def parse_data(data):
for key in ['possible_key', 'another_possible_key']:
if key in data and verify_url(data[key]):
downloader.add(data[key])
Related
I have JSON file which I need to load into memory via chunks.
consider this file.json example:
[{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1}
]
THen I want to use pandas read_json function combined with chunksizes.:
import pandas as pd
file = "file.json"
dtype = {"weekly_report_day": str, "project_id": str, "actions": int, "event_id": int}
chunked = pd.read_json(file, orient = 'records', dtype=dtype, chunksize = 1, lines = True)
for df in chunked:
print(df)
This however returns an error.
I would like to ask for suggestion (I need to use chunks, as the original data are very huge).
Good afternoon, I'm trying to find the top 10 ip in access.log (standard log of the Apache server).
There is a code like this:
import argparse
import json
import re
from collections import defaultdict, Counter
parser = argparse.ArgumentParser(description='parser script')
parser.add_argument('-f', dest='logfile', action='store', default='access.log')
args = parser.parse_args()
regul_ip = (r"^(?P<ips>.*?)")
regul_method = (r"\"(?P<request_method>GET|POST|PUT|DELETE|HEAD)")
def req_by_method():
dict_ip = defaultdict(lambda: {"GET": 0, "POST": 0, "PUT": 0, "DELETE": 0, "HEAD": 0})
with open(args.logfile) as file:
for index, line in enumerate(file.readlines()):
try:
ip = re.search(regul_ip, line).group()
method = re.search(regul_method, line).groups()[0]
return Counter(dict_ip).most_common(10)
except AttributeError:
pass
dict_ip[ip][method] += 1
print(json.dumps(dict_ip, indent=4))
with open("final_log.json", "w") as jsonfile:
json.dump(dict_ip, jsonfile, indent=5)
When the code is executed, I only get: []
How can I fix this code to make it work?
I also need to output to the final json file a set of such lines: "ip", "method", "status code", "url" and the duration of the request
Input :
There are 5 part JSON files named as test_par1.json, test_part2.json, test_part3.json, test_part4.json, test_part5.json in s3://test/json_files/data/.
Expected Output :
Single csv file
Explanation : All of the json files are having same number of columns with same structure. They are basically part files of same source.
I want to merge/re partition all of them and convert them into a csv file and store it in S3.
import pandas as pd
import os
import boto3
import numpy
# Boto3 clients
resource = boto3.resource('s3')
client = boto3.client('s3')
session = boto3.session.Session()
bucket = 'test'
path = 'json_files/data/'
delimiter = '/'
suffix = '.json'
json_files = client.list_objects(Bucket=bucket, Prefix=path, Delimiter=delimiter)
#print(inter_files)
for obj in inter_files['Contents']:
#print(obj)
obj = client.get_object(Bucket=bucket, Key=obj['Key'])
#print(obj)
df = pd.read_json(obj["Body"], lines=True)
print(df)
I try to read JSON from file, get values, transform them and back write to new file.
{
"metadata": {
"info": "important info"
},
"timestamp": "2018-04-06T12:19:38.611Z",
"content": {
"id": "1",
"name": "name test",
"objects": [
{
"id": "1",
"url": "http://example.com",
"properties": [
{
"id": "1",
"value": "1"
}
]
}
]
}
}
Above is a JSON that I read from file.
Below I attach a python program that gets values, creates new JSON and write it to file.
import json
from pprint import pprint
def load_json(file_name):
return json.load(open(file_name))
def get_metadata(json):
return json["metadata"]
def get_timestamp(json):
return json["timestamp"]
def get_content(json):
return json["content"]
def create_json(metadata, timestamp, content):
dct = dict(__metadata=metadata, timestamp=timestamp, content=content)
return json.dumps(dct)
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
json.dump(json_content, file)
STACK_JSON = 'stack.json';
STACK_OUT_JSON = 'stack-out.json'
if __name__ == '__main__':
json_content = load_json(STACK_JSON)
print("Loaded JSON:")
print(json_content)
metadata = get_metadata(json_content)
print("Metadata:", metadata)
timestamp = get_timestamp(json_content)
print("Timestamp:", timestamp)
content = get_content(json_content)
print("Content:", content)
created_json = create_json(metadata, timestamp, content)
print("\n\n")
print(created_json)
write_json_to_file(STACK_OUT_JSON, created_json)
But the problem is that create json is not correct. Finally as result I get:
"{\"__metadata\": {\"info\": \"important info\"}, \"timestamp\": \"2018-04-06T12:19:38.611Z\", \"content\": {\"id\": \"1\", \"name\": \"name test\", \"objects\": [{\"id\": \"1\", \"url\": \"http://example.com\", \"properties\": [{\"id\": \"1\", \"value\": \"1\"}]}]}}"
It is not that what I want to achieve. It's not correct JSON. What do I wrong?
Solution:
Change the write_json_to_file(...) method like this:
def write_json_to_file(file_name, json_content):
with open(file_name, 'w') as file:
file.write(json_content)
Explanation:
The problem is, that when you're calling write_json_to_file(STACK_OUT_JSON, created_json) at the end of your script, the variable created_json contains a string - it's the JSON representation of the dictionary created in the create_json(...) function. But inside the write_json_to_file(file_name, json_content), you're calling:
json.dump(json_content, file)
You're telling the json module write the JSON representation of variable json_content (which contains a string) into the file. And JSON representation of a string is a single value encapsulated in double-quotes ("), with all the double-quotes it contains escaped by \.
What you want to achieve is to simply write the value of the json_content variable into the file and not have it first JSON-serialized again.
Problem
You're converting a dict into a json and then right before you write it into a file, you're converting it into a json again. When you retry to convert a json to a json it gives you the \" since it's escaping the " since it assumes that you have a value there.
How to solve it?
It's a great idea to read the json file, convert it into a dict and perform all sorts of operations to it. And only when you want to print out an output or write to a file or return an output you convert to a json since json.dump() is expensive, it adds 2ms (approx) of overhead which might not seem much but when your code is running in 500 microseconds it's almost 4 times.
Other Recommendations
After seeing your code, I realize you're coming from a java background and while in java the getThis() or getThat() is a great way to module your code since we represent our code in classes in java, in python it just causes problems in the readability of the code as mentioned in the PEP 8 style guide for python.
I've updated the code below:
import json
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with open(file_path) as file:
contents = file.read()
return json.loads(contents)
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def write_to_json_file(metadata, timestamp, content, file_path):
"""
Creates a dict of all the data and then writes it into the file
:param metadata: The meta data
:param timestamp: the timestamp
:param content: the content
:param file_path: The file in which json needs to be written
:return: None
"""
output_dict = dict(metadata=metadata, timestamp=timestamp, content=content)
with open(file_path, 'w') as outfile:
json.dump(output_dict, outfile, sort_keys=True, indent=4, ensure_ascii=False)
def main(input_file_path, output_file_path):
# get a dict from the loaded json
data = get_contents_from_json(input_file_path)
# the print() supports multiple args so you don't need multiple print statements
print('JSON:', json.dumps(data), 'Loaded JSON as dict:', data, sep='\n')
try:
# load your data from the dict instead of the methods since it's more pythonic
metadata = data['metadata']
timestamp = data['timestamp']
content = data['content']
# just cumulating your print statements
print("Metadata:", metadata, "Timestamp:", timestamp, "Content:", content, sep='\n')
# write your json to the file.
write_to_json_file(metadata, timestamp, content, output_file_path)
except KeyError:
print('Could not find proper keys to in the provided json')
except TypeError:
print('There is something wrong with the loaded data')
if __name__ == '__main__':
main('stack.json', 'stack-out.json')
Advantages of the above code:
More Modular and hence easily unit testable
Handling of exceptions
Readable
More pythonic
Comments because they are just awesome!
I am new to Python and Django. I am an IT professional that deploys software that monitors computers. The api outputs to JSON. I want to create a Django app that reads the api and outputs the data to an html page. Where do I get started? I think the idea is to write the JSON feed to a Django model. Any help/advice is greatly appreciated.
Here's a simple single file to extract the JSON data:
import urllib2
import json
def printResults(data):
theJSON = json.loads(data)
for i in theJSON[""]
def main():
urlData = ""
webUrl = urllib2.urlopen(urlData)
if (webUrl.getcode() == 200):
data = webUrl.read()
printResults(data)
else:
print "Received error"
if __name__ == '__main__':
main()
If you have an URL returning a json as response, you could try this:
import requests
import json
url = 'http://....' # Your api url
response = requests.get(url)
json_response = response.json()
Now json_response is a list containing dicts. Let's suppose you have this structure:
[
{
'code': ABC,
'avg': 14.5,
'max': 30
},
{
'code': XYZ,
'avg': 11.6,
'max': 21
},
...
]
You can iterate over the list and take every dict into a model.
from yourmodels import CurrentModel
...
for obj in json_response:
cm = CurrentModel()
cm.avg = obj['avg']
cm.max = obj['max']
cm.code = obj['code']
cm.save()
Or you could use a bulk method, but keep in mind that bulk_create does not trigger save method.