Extracting data from a JSON database (Python 3)

Extracting data from a JSON database (Python 3) - json

I want to write a program that loads data from a JSON database into a Python list of dictionary and adds all of the number of times the mean temperature was above versus below freezing. However, I am struggling to extract information from the database successfully/ I am concerned my algorithim is off. My plan:
1) define a function that loads data from the json file.
2) define a function that extracts information from the file
3) use that extracted information to tally the number of times the temp was above or below freezing
import json
def load_weather_data(): #function 1: Loads data
with open("NYC4-syr-weather-dec-2015.json", encoding = 'utf8') as w: #w for weather
data = w.read()
weather = json.loads(data)
print(type(weather))
return weather
def extract_temp(weather): #function 2: Extracts information on weather
info = {}
info['Mean TemperatureF'] = weather['Mean TemperatureF']#i keep getting a type error here
return info
print("Above and blelow freezing")
weather = load_weather_data()
info = extract_temp(weather)
above_freezing = 0
below_freezing = 0
for temperature in weather: # summing the number of times the weather was above versus below freezing
if info['Mean Temperature'] >32:
above_freezing=above_freezing+1
elif info['mean temperature']<32:
below_freezing = below_freezing +1
print(above_freezing)
print(below_freezing)
If you have any ideas, please let me know! Thank you.

You are trying to extract temperature from the weather list one time before starting the loop when really you should be doing it for each temperature object in the loop. You haven't posted sample data, but I think that weather is a list and you are trying to use it as a dict. Below is a fix with a couple of other changes for tidiness.
import json
# fixed: call with filename so that the function works on other files
def load_weather_data(filename): #function 1: Loads data
with open(filename, encoding = 'utf8') as w: #w for weather
# fixed: fewer steps
return json.load(w)
# fixed: not needed, doesn't simply anything
#def extract_temp(weather): #function 2: Extracts information on weather
# info = {}
# info['Mean TemperatureF'] = weather['Mean TemperatureF']#i keep getting a type error here
# return info
print("Above and blelow freezing")
weather = load_weather_data("NYC4-syr-weather-dec-2015.json")
above_freezing = 0
below_freezing = 0
for temperature in weather: # summing the number of times the weather was above versus below freezing
if info['Mean Temperature'] > 32:
above_freezing += 1
# fixed: capitalized above... so assuming it should be here too
elif info['Mean Temperature'] < 32:
below_freezing += 1
print(above_freezing)
print(below_freezing)

Related

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])

In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?

I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)

Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

Access multiple dictionaries in a file - Python

I am very new to Json files. I have a json file with multiple json objects such as following:
{"ID":"12345","Timestamp":"20140101", "Usefulness":"Yes",
"Code":[{"event1":"A","result":"1"},…]}
{"ID":"1A35B","Timestamp":"20140102", "Usefulness":"No",
"Code":[{"event1":"B","result":"1"},…]}
{"ID":"AA356","Timestamp":"20140103", "Usefulness":"No",
"Code":[{"event1":"B","result":"0"},…]}
…
I want to parse these json objects like a stream. The end game for me however is to create pairwise combinations of event1 and result. like so:
[AB, AB, BB],[11,10,10]
What I know:
The exact structure of the dict
What I do not know: How to extract these dict by dict to perform this operation.
I cannot modify the existing file, so don't tell me to add '[ ], and ','
Additional Help:
I might run into files that I cannot store directly in memory, so a stream solution is more apreciated.

The easiest thing there is to feed the file stream into a custom generator, that would "pre-parse" the json objects. That can be done with some state variables counting somewhat naively the number of open { and [ - each time it reaches zero, it yields a string with a full JSON object.
I could not figure out your desired final intent from the example you provided. I suppose you have other dicts inside "code", and what you want in the end is a pair of the combined "event1, result" inside each "code" value for the outermost dicts. If it is not that, suit yourself to change the code.
(An ordered dict is good enough for storing the results you need - and you can retrieve the separate lists for keys and values if you need)
from collections import OrderedDict
import json
import string
import sys
def char_streamer(stream):
for line in stream:
for char in line:
yield char
def json_source(stream):
result = []
curly_count = 0
bracket_count = 0
nonwhitespace_count = 0
inside_string = False
previous_is_escape = False
for char in char_streamer(stream):
if not result and char in string.whitespace:
continue
result.append(char)
if char == '"':
if inside_string:
inside_string = True
elif not previous_is_escape:
inside_string = False
if inside_string:
if char == "\\": # single '\' character
previous_is_escape = not previous_is_escape
else:
previous_is_escape = False
continue
if char == "{":
curly_count += 1
if char == "[":
bracket_count += 1
if char == "}":
curly_count -= 1
if char == "]":
bracket_count -= 1
if curly_count == 0 and bracket_count== 0 and result:
yield(json.loads("".join(result)))
result = []
def main(filename):
result = OrderedDict()
with open(filename) as file:
for data_part in json_source(file):
# agregate your data here
print (result.keys(), result.values())
main(sys.argv[1])

Twitter streaming script is throwing a keyerror on location field of the tweet

I have as of now written a Python script to stream tweets and I have made use of the tweepy module to do so. After streaming for around 3 minutes for tweets, I dump these tweets into a .json file. I populate these tweets (I try to) into a pandas dataframe for location and text fields of the tweet. The text field of the tweet gets populated but not for every tweet (problem 1) in the .json file and as far as the location field is concerned a keyerror (problem 2) is thrown. May I know what exactly is going wrong.
twitter_stream_dump.py
import time
import json
import pandas as pd
import re
#tweepy based modules
import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener
#initializing authentication credentials
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener) :
def __init__(self,time_limit) :
self.start_time = time.time()
self.limit = time_limit
self.saveFile = open('requests.json','a')
super(StdOutListener,self).__init__()
def on_data(self, data) :
if ((time.time() - self.start_time) < self.limit) :
self.saveFile.write(data)
self.saveFile.write('\n')
return True
else :
self.saveFile.close()
return False
def on_error(self, status) :
print(status)
def getwords(string) :
return re.findall(r"[\w'#]+|[.,!?;]",string)
if __name__ == '__main__' :
#This handles Twitter authetification and the connection to Twitter Streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
time_limit = input("Enter the time limit in minutes : ")
time_limit *= 60
stream = Stream(auth,listener = StdOutListener(time_limit))
string = raw_input("Enter the list of keywords/hashtags to be compared : ")
keyword_list = getwords(string)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track = keyword_list)
tweets_data_path = 'requests.json'
tweets_data = []
tweet_list = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file :
try :
tweet = json.loads(line)
tweet_list.append(tweet)
except :
continue
num_tweets_collected = len(tweet_list)
#Creates a data frame structure
tweet_dataframe = pd.DataFrame()
text_dump = open('text_dump.txt', 'w')
#Populating the location field of the data frame
#tweet_dataframe['location'] = map(lambda tweet : tweet['location'], tweet_list)
tweet_dataframe['text'] = map(lambda tweet : tweet['text'], tweet_list)
print(tweet_dataframe['text'])
Errors :
abhijeet-mohanty-2:Desktop SubrataMohanty$ python twitter_stream_dump.py
Enter the time limit in minutes : 3
Enter the list of keywords/hashtags to be compared : python ruby scala
Traceback (most recent call last):
File "twitter_stream_dump.py", line 81, in <module>
tweet_dataframe['location'] = map(lambda tweet : tweet['location'], tweet_list)
File "twitter_stream_dump.py", line 81, in <lambda>
tweet_dataframe['location'] = map(lambda tweet : tweet['location'], tweet_list)
KeyError: 'location'
requests.json (My .json file)
https://drive.google.com/file/d/0B1p05OszaBkXLWFsQ2VmeWVjbDQ/view?usp=sharing

The location field is a user-defined value and will sometimes not be present.
That's why you're getting the KeyError.
Note that location is part of the "user profile" metadata that comes with a tweet. It's intended to describe a user's location (like their hometown), and not the geotagged location of a given tweet.
In case you're interested in geotags, first check a tweeet to see if the geo_enabled field is true. If so, the geo, coordinates, and place fields may contain geotagged information.
As for missing text entries, I don't see the same issue when using the data you provided. It's possible the issue was caused by your try/except clause when reading in lines of data. Consider this approach:
for i, line in enumerate(tweets_file):
if line.rstrip():
tweet = json.loads(line)
tweet_list.append(tweet)
num_tweets_collected = len(tweet_list)
texts = [tweet['text'] for tweet in tweet_list]
tweet_dataframe = pd.DataFrame(texts, columns=['text'])
Sample output:
print(tweet_dataframe.head())
# text
# 0 Tweets and python BFF <3 15121629.976126991
# 1 RT #zeroSteiner: Can now write more post modul...
# 2 •ruby• #MtvInstagLSelena #MtvColabTaylors
# 3 Ruby Necklace July Birthstone Jewelry Rosary...
# 4 #ossia I didn't see any such thing as Python. ...
A few quick summary stats show that no lines are missing, and no entries are null:
print("N tweets: {}".format(num_tweets_collected))
# N tweets: 286
print("N rows in dataframe: {}".format(tweet_dataframe.shape[0]))
# N rows in dataframe: 286
null_count = tweet_dataframe.text.isnull().sum()
print("Tweets with no text field extracted: {}".format(null_count))
# Tweets with no text field extracted: 0

can you convert a dict() to a sequence?

I have a dict() for all the nmea sentences that are found in a csv. I tried creating another csv to write the results of the dict() into it for statistical and logging purposes. However, I can't due to the dict() not being 'callable'?
import csv
#Counts the number of times a GPS command is observed
def list_gps_commands(data):
"""Counts the number of times a GPS command is observed.
Returns a dictionary object."""
gps_cmds = dict()
for row in data:
try:
gps_cmds[row[0]] += 1
except KeyError:
gps_cmds[row[0]] = 1
return gps_cmds
print(list_gps_commands(read_it))
print ("- - - - - - - - - - - - -")
with open('gpsresults.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',', dialect='excel')
spamwriter.writerow(list_gps_commands(read_it))
Can someone help me? Is there a way I can convert the keys/values into sequences so the csv module can recognize it? Or another way?

Use csv.DictWriter instead of csv.writer.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting data from a JSON database (Python 3) - json

Related

How can i extract information quickly from 130,000+ Json files located in S3?

How to handle large JSON file in Pytorch?

Access multiple dictionaries in a file - Python

Twitter streaming script is throwing a keyerror on location field of the tweet

can you convert a dict() to a sequence?

Categories

Resources