Sort JSON dictionaries using datetime format not consistent - json

I have JSON file (post responses from an API) - I need to sort the dictionaries by a certain key in order to parse the JSON file in chronological order. After studying the data, I can sort it by the date format in metadata or by the number sequences of the S5CV[0156]P0.xml
One text example that you can load in JSON here - http://pastebin.com/0NS5BiDk
I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by [metadata][0][value].
The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.
For the 2nd code the format of date is not consistent and sometimes the value is not present at all. I am struggling to extract the datetime format in a consistent way. The second one also gives me an error, but I cannot figure out why - string indices must be integers.
# 1st code (it works but not ideal)
# load post response r1 in json (python 3.5)
j=r1.json()
# iterate through dictionaries and sort by the 4 num of xml (ex. 0156)
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)
# 2nd code. I need something to make consistent datetime,
# except missing values and solve the list index error
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
# extract the last 3 blocks of characters from the [metadata][0][value]
# usually are like this "7th april, 1922." and trasform in datatime format
# using dparser.parse
def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)
def order(slist):
try:
return sorted(slist, key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0
print(order(list))
#update
orig_list = j["tree"]["children"][0]["children"]
cleaned_list = sorted((x for x in orig_list if extract_date(x) != DEFAULT_DATE),
key=extract_date)
first_date = extract_date(cleaned_list[0])
if first_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = first_date
print(first_date)
middle_date = extract_date(cleaned_list[len(cleaned_list)//2])
if middle_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = middle_date
print(middle_date)
last_date = extract_date(cleaned_list [-1])
if last_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = last_date
print(last_date)

Clearly you can't use the .xml filenames to sort the data if it's unreliable, so the most promising strategy seems to be what you're attempting to do in the 2nd code.
When I mentioned needing a datetime to sort the items in my comments to your other question, I literally meant something like datetime.date instances, not strings like "28th july, 1933", which wouldn't provide the proper ordering needed since they would be compared lexicographically with one another, not numerically like datetime.dates.
Here's something that seems to work. It uses the re module to search for the date pattern in the strings that usually contain them (those with a "name" associated with the value "Comprising period from"). If there's more than one date match in the string, it uses the last one. This is then converted into a date instance and returned as the value to key on.
Since some of the items don't have valid date strings, a default one is substituted for sorting purposes. In the code below, a earliest valid date is used as the default—which makes all items with date problems appear at the beginning of the sorted list. Any items following them should be in the proper order.
Not sure what you should do about items lacking date information—if it isn't there, your only options are to guess a value, ignore them, or consider it an error.
# v3.2.1
import datetime
import json
import re
# default date when one isn't found
DEFAULT_DATE = datetime.date(1, 1, datetime.MINYEAR) # 01/01/0001
MONTHS = ('january february march april may june july august september october'
' november december'.split())
# dictionary to map month names to numeric values 1-12
MONTH_TO_ORDINAL = dict( zip(MONTHS, range(1, 13)) )
DMY_DATE_REGEX = (r'(3[01]|[12][0-9]|[1-9])\s*(?:st|nd|rd|th)?\s*'
+ r'(' + '|'.join(MONTHS) + ')(?:[,.])*\s*'
+ r'([0-9]{4})')
MDY_DATE_REGEX = (r'(' + '|'.join(MONTHS) + ')\s+'
+ r'(3[01]|[12][0-9]|[1-9])\s*(?:st|nd|rd|th)?,\s*'
+ r'([0-9]{4})')
DMY_DATE = re.compile(DMY_DATE_REGEX, re.IGNORECASE)
MDY_DATE = re.compile(MDY_DATE_REGEX, re.IGNORECASE)
def extract_date(item):
metadata0 = item["metadata"][0] # check only first item in metadata list
if metadata0.get("name") != "Comprising period from":
return DEFAULT_DATE
else:
value = metadata0.get("value", "")
matches = DMY_DATE.findall(value) # try dmy pattern (most common)
if matches:
day, month, year = matches[-1] # use last match if more than one
else:
matches = MDY_DATE.findall(value) # try mdy pattern...
if matches:
month, day, year = matches[-1] # use last match if more than one
else:
print('warning: date patterns not found in "{}"'.format(value))
return DEFAULT_DATE
# convert strings found into numerical values
year, month, day = int(year), MONTH_TO_ORDINAL[month.lower()], int(day)
return datetime.date(year, month, day)
# test files: 'json_sample.txt', 'india_congress.txt', 'olympic_games.txt'
with open('json_sample.txt', 'r') as f:
j = json.load(f)
orig_list = j["tree"]["children"][0]["children"]
sorted_list = sorted(orig_list, key=extract_date)
for item in sorted_list:
print(json.dumps(item, indent=4))
To answer your latest follow-on questions, you could leave out all the items in the list that don't have recognizable dates by using extract_date() to filter them out beforehand in a generator expression with something like this:
# to obtain a list containing only entries with a parsable date
cleaned_list = sorted((x for x in orig_list if extract_date(x) != DEFAULT_DATE),
key=extract_date)
Once you have a sorted list of items that all have a valid date, you can do things like the following, again reusing the extract_date() function:
# extract and display dates of items in cleaned list
print('first date: {}'.format(extract_date(cleaned_list[0])))
print('middle date: {}'.format(extract_date(cleaned_list[len(cleaned_list)//2])))
print('last date: {}'.format(extract_date(cleaned_list[-1])))
Calling extract_date() on the same item multiple times is somewhat inefficient. To avoid that you could easily add the datetime.date value it returns to the object on-the-fly since it's a dictionary, and then just refer to it as often as needed with very little additional overhead:
# add extracted datetime.date entry to a list item[i] if a valid one was found
date = extract_date(some_list[i])
if date != DEFAULT_DATE: # valid date found?
some_list[i]['date'] = date # save by adding it to object
This effectively caches the extracted date by storing it in the item itself. Afterwards, the datetime.date value can simply be referenced with some_list[i]['date'].
As a concrete example, consider this revised example of displaying the datesof the first, middle, and last objects:
# display dates of items in cleaned list
print('first date: {}'.format(cleaned_list[0]['date']))
middle = len(cleaned_list)//2
print('middle date: {}'.format(cleaned_list[middle]['date']))
print('last date: {}'.format(cleaned_list[-1]['date']))

Related

get date from google sheet in data frame convert each date in integer

What do I want to achieve?
get the date from the google sheet in the dataframe and convert the dataframe each date in integer.
If this is better with JSON let me know
get a date from google sheet in JSON each date in integer.
why I am doing it
#this code takes date month and year in integer format SO!
from QuantLib import *
valuation_date = Date(22, 8, 2018)
print(valuation_date+2)
Output:
August 24th, 2018
My solution.
But I want something better. and there is one problem also.
Date_df = df['Date line 6']
date_lst = []
month_lst = []
year_lst = []
for i in Date_df:
date_lst.append(int(i[0:2]))
month_lst.append(int(i[3:4]))
year_lst.append(int(i[-4:]))
The Problem In Above Code
date_lst.append(int(i[0:2]))
Here if the date is 12 then it's ok because 0 and 1 two indexes are appended to date_lstbut if a date is 8 which is single in that case output is 8/(because at index 1 values is /) which cant be converted to int and produce an error.
Instead of using a dataframe, I am Working with JSON.
First Get data from the sheet and convert it into JSON
json_data = pd.DataFrame(sheet.get_all_records()).to_json()
Write data in JSON file for Later USE
with open('sheet_json_data.json', 'w') as f:
f.write(json_data)
Reading JSON File and Convert It In Pyhton Obj
with open('sheet_json_data.json') as f:
json_data = json.load(f)
Here convert JSON to Dictionary json.load(f)
Here create DateTime with string and get int type class.
str_date = json_data['Date line 6'][str(1)]
date_time = dt.datetime.strptime(str_date, '%m/%d/%Y')
print(type(date_time.day))
print(date_time.month)
print(date_time.year)

Use (partykit- Nodeapply- Info_node) to extrac information within the function

I am trying to extract information coming from nodeapply( info_node) function.
I want to automate the process so that I can extract the information of a list a node ids and operate on them later.
The example as follow:
data("cars", package = "datasets")
ct <- ctree(dist ~ speed, data = cars)
node5 <-nodeapply(as.simpleparty(ct), ids = 5, info_node)
node5$`5`$n
I use the code above to extract the number of records on node 5.
I want to create a function to extract the info from a series of node:
infonode <- function(x,y){
for (j in x){
info = nodeapply(y, j, info_node)
print(info$`j`$n)
}
}
But the result always comes back as null.
I wonder if the type of "J" is wrong within the function that leads to a null read in the print.
If someone could help me it would be greatly appreciated!
Thanks
You can give nodeapply() a list of ids and then not only list with a single element will be extracted but a list of all selected nodes. This is the only partykit-specific part of your question.
From that point forward it is simply operating on standard named lists in R, without anything partykit specific about that. To address your problem you can easily use [[ indexing rather than $ indexing, either with an integer or a character index:
node5[[1]]$n
## n
## 19
node5[["5"]]$n
## n
## 19
Thus, in your infonode() function you could replace info$j$n by either info[[1]]$n or info[[as.character(j)]]$n.
However, I would simply do this with an sapply():
ni <- nodeapply(as.simpleparty(ct), ids = 3:5, info_node)
sapply(ni, "[[", "n")
## 3.n 4.n 5.n
## 15 16 19
Or some variation of this...

How to use PySpark to load a rolling window from daily files?

I have a large number of fairly large daily files stored in a blog storage engine(S3, Azure datalake exc.. exc..) data1900-01-01.csv, data1900-01-02.csv,....,data2017-04-27.csv. My goal is to preform a rolling N-day linear regression but I am having trouble with the data loading aspect. I am not sure how to do this without nested RDD's.
The schema for every .csv file is the same.
In other words for every date d_t, I need data x_t and to join data (x_t-1, x_t-2,... x_t-N).
How can I use PySpark to load an N-day Window of these daily files? All of the PySpark examples I can find seem to load from one very large file or data set.
Here's an example of my current code:
dates = [('1995-01-03', '1995-01-04', '1995-01-05'), ('1995-01-04', '1995-01-05', '1995-01-06')]
p = sc.parallelize(dates)
def test_run(date_range):
dt0 = date_range[-1] #get the latest date
s = '/daily/data{}.csv'
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
file_list = [s.format(dt) for dt in date_range[:-1]] # Get a window of trailing dates
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
return 1
p.filter(test_run)
p.map(test_run) #fails with same error as p.filter
I'm on PySpark version '2.1.0'
I'm running this on an Azure HDInsight cluster jupyter notebook.
spark here is of type <class 'pyspark.sql.session.SparkSession'>
A smaller more reproducible example is as follows:
p = sc.parallelize([1, 2, 3])
def foo(date_range):
df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
return 1
p.filter(foo).count()
You are better off with using Dataframes rather than RDD. Dataframe's read.csv api accepts list of paths like -
pathList = ['/path/to/data1900-01-01.csv','/path/to/data1900-01-02.csv']
df = spark.read.csv(pathList)
have a look at documentation for read.csv
You can form the list of paths to date files to your data files by doing some date operation over window of N days like "path/to/data"+datetime.today().strftime("%Y-%m-%d"))+.csv" (This will get you file name of today only but its not hard to figure out date calculation for N days)
However keep in mind that schema of all date csvs should be same for above to work.
edit : When you parallelize list of dates i.e. p, each date gets processed individually by different executors, so input to test_run2 wasnt really as list of dates, it was one individual string like 1995-01-01
Try this instead, see if this works.
# Get the list of dates
date_range = window(dates, N)
s = '/daily/data{}.csv'
dt0 = date_range[-1] # most recent file
df0 = spark.read.csv(s.format(dt0), header=True, mode='DROPMALFORM')
# read previous files
file_list = [s.format(dt) for dt in date_range[:-1]]
df1 = spark.read.csv(file_list, header=True, mode='DROPMALFORM')
r, resid = computeLinearRegression(df0,df1)
r.write.save('daily/r{}.csv'.format(dt0))
resid.write.save('/daily/resid{}.csv'.format(dt0))

Removing characters from column in pandas data frame

My goal is to (1) import Twitter JSON, (2) extract data of interest, (3) create pandas data frame for the variables of interest. Here is my code:
import json
import pandas as pd
tweets = []
for line in open('00.json'):
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame satisfying conditions:
df2 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]
So far, everything seems to be working fine.
Now, the extracted values for Geo result in the following example:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
To get rid of everything except the coordinates inside the squared brackets I tried using:
df2.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df2['Geo'] = df2['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only.
The following line from your question indicates that this is an issue with understanding the underlying data type of the returned object.
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
You are returning a Python dictionary here -- not a string! If you want to return just the values of the coordinates, you should just use the 'coordinates' key to return those values, e.g.
df2.loc[1921,'Geo']['coordinates']
[39.11890951, -84.48903638]
The returned object in this case will be a Python list object containing the two coordinate values. If you want just one of the values, you can slice the list, e.g.
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
This workflow is much easier to deal with than casting the dictionary to a string, parsing the string, and recapturing the coordinate values as you are trying to do.
So let's say you want to create a new column called "geo_coord0" which contains all of the coordinates in the first position (as shown above). You could use a something like the following:
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
This uses a Python list comprehension to iterate over all entries in the df2['Geo'] column and for each entry it uses the same syntax we used above to return the first coordinate value. It then assigns these values to a new column in df2.
See the Python documentation on data structures for more details on the data structures discussed above.

How to fetch a JSON file to get a row position from a given value or argument

I'm using wget to fetch several dozen JSON files on a daily basis that go like this:
{
"results": [
{
"id": "ABC789",
"title": "Apple",
},
{
"id": "XYZ123",
"title": "Orange",
}]
}
My goal is to find row's position on each JSON file given a value or set of values (i.e. "In which row XYZ123 is located?"). In previous example ABC789 is in row 1, XYZ123 in row 2 and so on.
As for now I use Google Regine to "quickly" visualize (using the Text Filter option) where the XYZ123 is standing (row 2).
But since it takes a while to do this manually for each file I was wondering if there is a quick and efficient way in one go.
What can I do and how can I fetch and do the request? Thanks in advance! FoF0
In python:
import json
#assume json_string = your loaded data
data = json.loads(json_string)
mapped_vals = []
for ent in data:
mapped_vals.append(ent['id'])
The order of items in the list will be indexed according to the json data, since the list is a sequenced collection.
In PHP:
$data = json_decode($json_string);
$output = array();
foreach($data as $values){
$output[] = $values->id;
}
Again, the ordered nature of PHP arrays ensure that the output will be ordered as-is with regard to indexes.
Either example could be modified to use a mapped dictionary (python) or an associative array (php) if needs demand.
You could adapt these to functions that take the id value as an argument, track how far they are into the array, and when found, break out and return the current index.
Wow. I posted the original question 10 months ago when I knew nothing about Python nor computer programming whatsoever!
Answer
But I learned basic Python last December and came up with a solution for not only get the rank order but to insert the results into a MySQL database:
import urllib.request
import json
# Make connection and get the content
response = urllib.request.urlopen(http://whatever.com/search?=ids=1212,125,54,454)
content = response.read()
# Decode Json search results to type dict
json_search = json.loads(content.decode("utf8"))
# Get 'results' key-value pairs to a list
search_data_all = []
for i in json_search['results']:
search_data_all.append(i)
# Prepare MySQL list with ranking order for each id item
ranks_list_to_mysql = []
for i in range(len(search_data_all)):
d = {}
d['id'] = search_data_all[i]['id']
d['rank'] = i + 1
ranks_list_to_mysql.append(d)