I'm trying to figure out the best way of producing a JSON file from R. I have the following dataframe tmp in R.
> tmp
gender age welcoming proud tidy unique
1 1 30 4 4 4 4
2 2 34 4 2 4 4
3 1 34 5 3 4 5
4 2 33 2 3 2 4
5 2 28 4 3 4 4
6 2 26 3 2 4 3
The output of dput(tmp) is as follows:
tmp <- structure(list(gender = c(1L, 2L, 1L, 2L, 2L, 2L), age = c(30,
34, 34, 33, 28, 26), welcoming = c(4L, 4L, 5L, 2L, 4L, 3L), proud = c(4L,
2L, 3L, 3L, 3L, 2L), tidy = c(4L, 4L, 4L, 2L, 4L, 4L), unique = c(4L,
4L, 5L, 4L, 4L, 3L)), .Names = c("gender", "age", "welcoming",
"proud", "tidy", "unique"), na.action = structure(c(15L, 39L,
60L, 77L, 88L, 128L, 132L, 172L, 272L, 304L, 305L, 317L, 328L,
409L, 447L, 512L, 527L, 605L, 618L, 657L, 665L, 670L, 708L, 709L,
729L, 746L, 795L, 803L, 826L, 855L, 898L, 911L, 957L, 967L, 983L,
984L, 988L, 1006L, 1161L, 1162L, 1224L, 1245L, 1256L, 1257L,
1307L, 1374L, 1379L, 1386L, 1387L, 1394L, 1401L, 1408L, 1434L,
1446L, 1509L, 1556L, 1650L, 1717L, 1760L, 1782L, 1814L, 1847L,
1863L, 1909L, 1930L, 1971L, 2004L, 2022L, 2055L, 2060L, 2065L,
2082L, 2109L, 2121L, 2145L, 2158L, 2159L, 2226L, 2227L, 2281L
), .Names = c("15", "39", "60", "77", "88", "128", "132", "172",
"272", "304", "305", "317", "328", "409", "447", "512", "527",
"605", "618", "657", "665", "670", "708", "709", "729", "746",
"795", "803", "826", "855", "898", "911", "957", "967", "983",
"984", "988", "1006", "1161", "1162", "1224", "1245", "1256",
"1257", "1307", "1374", "1379", "1386", "1387", "1394", "1401",
"1408", "1434", "1446", "1509", "1556", "1650", "1717", "1760",
"1782", "1814", "1847", "1863", "1909", "1930", "1971", "2004",
"2022", "2055", "2060", "2065", "2082", "2109", "2121", "2145",
"2158", "2159", "2226", "2227", "2281"), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
Using the rjson package, I run the line toJSON(tmp) which produces the following JSON file:
{"gender":[1,2,1,2,2,2],
"age":[30,34,34,33,28,26],
"welcoming":[4,4,5,2,4,3],
"proud":[4,2,3,3,3,2],
"tidy":[4,4,4,2,4,4],
"unique":[4,4,5,4,4,3]}
I also experimented with the RJSONIO package; the output of toJSON() was the same. What I would like to produce is the following structure:
{"traits":["gender","age","welcoming","proud", "tidy", "unique"],
"values":[
{"gender":1,"age":30,"welcoming":4,"proud":4,"tidy":4, "unique":4},
{"gender":2,"age":34,"welcoming":4,"proud":2,"tidy":4, "unique":4},
....
]
I'm not sure how best to do this. I realize that I can parse it line by line using python but I feel like there is probably a better way of doing this. I also realize that my data structure in R does not reflect the meta-information desired in my JSON file (specifically the traits line), but I am mainly interested in producing the data formatted like the line
{"gender":1,"age":30,"welcoming":4,"proud":4,"tidy":4, "unique":4}
as I can manually add the first line.
EDIT: I found a useful blog post where the author dealt with a similar problem and provided a solution. This function produces a formatted JSON file from a data frame.
toJSONarray <- function(dtf){
clnms <- colnames(dtf)
name.value <- function(i){
quote <- '';
# if(class(dtf[, i])!='numeric'){
if(class(dtf[, i])!='numeric' && class(dtf[, i])!= 'integer'){ # I modified this line so integers are also not enclosed in quotes
quote <- '"';
}
paste('"', i, '" : ', quote, dtf[,i], quote, sep='')
}
objs <- apply(sapply(clnms, name.value), 1, function(x){paste(x, collapse=', ')})
objs <- paste('{', objs, '}')
# res <- paste('[', paste(objs, collapse=', '), ']')
res <- paste('[', paste(objs, collapse=',\n'), ']') # added newline for formatting output
return(res)
}
Using the package jsonlite:
> jsonlite::toJSON(list(traits = names(tmp), values = tmp), pretty = TRUE)
{
"traits": ["gender", "age", "welcoming", "proud", "tidy", "unique"],
"values": [
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
},
{
"gender": 2,
"age": 34,
"welcoming": 4,
"proud": 2,
"tidy": 4,
"unique": 4
},
{
"gender": 1,
"age": 34,
"welcoming": 5,
"proud": 3,
"tidy": 4,
"unique": 5
},
{
"gender": 2,
"age": 33,
"welcoming": 2,
"proud": 3,
"tidy": 2,
"unique": 4
},
{
"gender": 2,
"age": 28,
"welcoming": 4,
"proud": 3,
"tidy": 4,
"unique": 4
},
{
"gender": 2,
"age": 26,
"welcoming": 3,
"proud": 2,
"tidy": 4,
"unique": 3
}
]
}
Building upon Andrie's idea with apply, you can get exactly what you want by modifying the tmp variable before calling toJSON.
library(RJSONIO)
modified <- list(
traits = colnames(tmp),
values = unname(apply(tmp, 1, function(x) as.data.frame(t(x))))
)
cat(toJSON(modified))
Building further on Andrie and Richie's ideas, use alply instead of apply to avoid converting numbers to characters:
library(RJSONIO)
library(plyr)
modified <- list(
traits = colnames(tmp),
values = unname(alply(tmp, 1, identity))
)
cat(toJSON(modified))
plyr's alply is similar to apply but returns a list automatically; whereas without the more complicated function inside Richie Cotton's answer, apply would return a vector or array. And those extra steps, including t, mean that if your dataset has any non-numeric columns, the numbers will get converted to strings.
So use of alply avoids that concern.
For example, take your tmp dataset and add
tmp$grade <- c("A","B","C","D","E","F")
Then compare this code (with alply) vs the other example (with apply).
It seems to me you can do this by sending each row of your data.frame to JSON with the appropriate apply statement.
For a single row:
library(RJSONIO)
> x <- toJSON(tmp[1, ])
> cat(x)
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
}
The entire data.frame:
x <- apply(tmp, 1, toJSON)
cat(x)
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
} {
...
} {
"gender": 2,
"age": 26,
"welcoming": 3,
"proud": 2,
"tidy": 4,
"unique": 3
}
Another option is to use the split to split your data.frame with N rows into N data.frames with 1 row.
library(RJSONIO)
modified <- list(
traits = colnames(tmp),
values = split(tmp, seq_len(nrow(tmp)))
)
cat(toJSON(modified))
Related
I'm a new starter to node and sql. I have a question:
I have 3 tables in MySQL
1- Day
2- Exercises
3- Sets
I used Join SQL statement to retrieve data from all 3 of the tables. The problem is, I have 1 day, 1 exercise but 3 sets of reps. so with my statement, I got 3 day object with the same dayID and exerciceID but with a single set of reps.
Any ideea how to combine everithing in a single object when i have a single dayID ?
This is a little app which let me store daily exercises..
This is the postman response
[
{
"dayID": 11,
"dayName": "monday",
"exerciceID": 5,
"exName": "Biceps Curl braces",
"exComments": "close to body elbows",
"setID": 3,
"repNumbers": 12,
"timeBetween": "3",
"weights": 12,
"comments": "Add another 2 reps"
},
{
"dayID": 11,
"dayName": "monday",
"exerciceID": 5,
"exName": "Biceps Curl braces",
"exComments": "close to body elbows",
"setID": 4,
"repNumbers": 12,
"timeBetween": "3",
"weights": 12,
"comments": "Add another 2 reps"
},
{
"dayID": 11,
"dayName": "monday",
"exerciceID": 5,
"exName": "Biceps Curl braces",
"exComments": "close to body elbows",
"setID": 5,
"repNumbers": 12,
"timeBetween": "3",
"weights": 12,
"comments": "Add another 2 reps"
}
]
this is the statement i used =
const sql = SELECT DISTINCT * FROM workoutday w JOIN exercice e ON w.dayID = e.dayID JOIN sets s ON e.exerciceID = s.exerciceID;
Thanks
you have to use INNER JOIN not only join.
if this dosen't work your query code in node is probably executed 3 times (this at last, 90% is the first option)
I have a json as:
mytestdata = {
"success": True,
"message": "",
"data": {
"totalCount": 95,
"goal": [
{
"user_id": 123455,
"user_email": "john.smith#test.com",
"user_first_name": "John",
"user_last_name": "Smith",
"people_goals": [
{
"goal_id": 545555,
"goal_name": "test goal name",
"goal_owner": "123455",
"goal_narrative": "",
"goal_type": {
"id": 1,
"name": "Team"
},
"goal_create_at": "1595874095",
"goal_modified_at": "1595874095",
"goal_created_by": "123455",
"goal_updated_by": "123455",
"goal_start_date": "1593561600",
"goal_target_date": "1601424000",
"goal_progress": "34",
"goal_progress_color": "#ff9933",
"goal_status": "1",
"goal_permission": "internal,team",
"goal_category": [],
"goal_owner_full_name": "John Smith",
"goal_team_id": "766754",
"goal_team_name": "",
"goal_workstreams": []
}
]
}
]
}
}
I am trying to display all details in "people_goals" along with "user_last_name", "user_first_name","user_email", "user_id" with json_normalize.
So far I am able to display "people_goals", "user_first_name","user_email" with the code
df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email']], errors='ignore')
However I am having issue when trying to include ['goal', 'user_id'] in the meta=[]
The error is:
TypeError Traceback (most recent call last)
<ipython-input-192-b7a124a075a0> in <module>
7 df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
8 meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email'], ['goal','user_id']],
----> 9 errors='ignore')
10
11 # df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'])
The only difference I see for 'user_id' is that it is not a string
Am I missing something here?
Your code works on my platform. I've migrated away from using record_path and meta parameters for two reasons. a) they are difficult to work out b) there are compatibility issues between versions of pandas
Therefore I now use approach of use json_normalize() multiple times to progressively expand JSON. Or use pd.Series. Have included both as examples.
df = pd.json_normalize(data=mytestdata['data']).explode("goal")
df = pd.concat([df, df["goal"].apply(pd.Series)], axis=1).drop(columns="goal").explode("people_goals")
df = pd.concat([df, df["people_goals"].apply(pd.Series)], axis=1).drop(columns="people_goals")
df = pd.concat([df, df["goal_type"].apply(pd.Series)], axis=1).drop(columns="goal_type")
df.T
df2 = pd.json_normalize(pd.json_normalize(
pd.json_normalize(data=mytestdata['data']).explode("goal").to_dict(orient="records")
).explode("goal.people_goals").to_dict(orient="records"))
df2.T
print(df.T.to_string())
output
0
totalCount 95
user_id 123455
user_email john.smith#test.com
user_first_name John
user_last_name Smith
goal_id 545555
goal_name test goal name
goal_owner 123455
goal_narrative
goal_create_at 1595874095
goal_modified_at 1595874095
goal_created_by 123455
goal_updated_by 123455
goal_start_date 1593561600
goal_target_date 1601424000
goal_progress 34
goal_progress_color #ff9933
goal_status 1
goal_permission internal,team
goal_category []
goal_owner_full_name John Smith
goal_team_id 766754
goal_team_name
goal_workstreams []
id 1
name Team
I am trying to fit a curve to a set of data points but would like to preserve certain characteristics.
Like in this graph I have curves that almost end up being linear and some of them are not. I need a functional form to interpolate between the given data points or past the last given point.
The curves have been created using a simple regression
def func(x, d, b, c):
return c + b * np.sqrt(x) + d * x
My question now is what is the best approach to ensure a positive slope past the last data point(s) ??? In my application a decrease in costs while increasing the volume doesn't make sense even if the data says so.
I would like to keep the order as low as possible maybe ˆ3 would still be fine.
The data used to create the curve with the negative slope is
x_data = [ 100, 560, 791, 1117, 1576, 2225,
3141, 4434, 6258, 8834, 12470, 17603,
24848, 35075, 49511, 69889, 98654, 139258,
196573, 277479, 391684, 552893, 780453, 1101672,
1555099, 2195148, 3098628, 4373963, 6174201, 8715381,
12302462, 17365915]
y_data = [ 7, 8, 9, 10, 11, 12, 14, 16, 21, 27, 32, 30, 31,
38, 49, 65, 86, 108, 130, 156, 183, 211, 240, 272, 307, 346,
389, 436, 490, 549, 473, 536]
And for the positive one
x_data = [ 100, 653, 950, 1383, 2013, 2930,
4265, 6207, 9034, 13148, 19136, 27851,
40535, 58996, 85865, 124969, 181884, 264718,
385277, 560741, 816117, 1187796, 1728748, 2516062,
3661939, 5329675, 7756940, 11289641, 16431220, 23914400,
34805603, 50656927]
y_data = [ 6, 6, 7, 7, 8, 8, 9, 10, 11, 12, 14, 16, 18,
21, 25, 29, 35, 42, 50, 60, 72, 87, 105, 128, 156, 190,
232, 284, 347, 426, 522, 640]
The curve fitting is simple done by using
popt, pcov = curve_fit(func, x_data, y_data)
For the plot
plt.plot(xdata, func(xdata, *popt), 'g--', label='fit: a=%5.3f, b=%5.3f, c=%5.3f' % tuple(popt))
plt.plot(x_data, y_data, 'ro')
plt.xlabel('Volume')
plt.ylabel('Costs')
plt.show()
A simple solution might just look like this:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import least_squares
def fit_function(x, a, b, c, d):
return a**2 + b**2 * x + c**2 * abs(x)**d
def residuals( params, xData, yData):
diff = [ fit_function(x, *params ) - y for x, y in zip( xData, yData ) ]
return diff
fit1 = least_squares( residuals, [ .1, .1, .1, .5 ], loss='soft_l1', args=( x1Data, y1Data ) )
print fit1.x
fit2 = least_squares( residuals, [ .1, .1, .1, .5 ], loss='soft_l1', args=( x2Data, y2Data ) )
print fit2.x
testX1 = np.linspace(0, 1.1 * max( x1Data ), 100 )
testX2 = np.linspace(0, 1.1 * max( x2Data ), 100 )
testY1 = [ fit_function( x, *( fit1.x ) ) for x in testX1 ]
testY2 = [ fit_function( x, *( fit2.x ) ) for x in testX2 ]
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.scatter( x1Data, y1Data )
ax.scatter( x2Data, y2Data )
ax.plot( testX1, testY1 )
ax.plot( testX2, testY2 )
plt.show()
providing
>>[ 1.00232004e-01 -1.10838455e-04 2.50434266e-01 5.73214256e-01]
>>[ 1.00104293e-01 -2.57749592e-05 1.83726191e-01 5.55926678e-01]
and
It just takes the parameters as squares, therefore ensuring positive slope. Naturally, the fit becomes worse if following the decreasing points at the end of data set 1 is forbidden. Concerning this I'd say those are just statistical outliers. Therefore, I used least_squares, which can deal with this with a soft loss. See this doc for details. Depending on how the real data set is, I'd think about removing them. Finally, I'd expect that zero volume produces zero costs, so the constant term in the fit function doesn't seem to make sense.
So if the function is only of type a**2 * x + b**2 * sqrt(x) it look like:
where the green graph is the result of leastsq, i.e. without the f_scale option of least_squares.
Trying to write a python script that will allow me to read a .csv file and mix up the values to a specific format/data structure in .json that I can then import into mongoDB. I'm using pedestrian data as my dataset and there are over a million entries with redundant data. I'm stuck on writing the actual script and translating that into my desired .json format.
data.csv - in table format for easier reading and raw
Id,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts
1, 01-JUN-2009 00:00,2009,June,1,Monday,0,4,Town Hall (West),194
2, 01-JUN-2009 00:00,2009,June,1,Monday,0,17,Collins Place (South),21
3, 01-JUN-2009 00:00,2009,June,1,Monday,0,18,Collins Place (North),9
4, 01-JUN-2009 00:00,2009,June,1,Monday,0,16,Australia on Collins,39
5, 01-JUN-2009 00:00,2009,June,1,Monday,0,2,Bourke Street Mall (South),28
6, 01-JUN-2009 00:00,2009,June,1,Monday,0,1,Bourke Street Mall (North),37
7, 01-JUN-2009 00:00,2009,June,1,Monday,0,13,Flagstaff Station,1
8, 01-JUN-2009 00:00,2009,June,1,Monday,0,3,Melbourne Central,155
9, 01-JUN-2009 00:00,2009,June,1,Monday,0,15,State Library,98
10, 01-JUN-2009 00:00,2009,June,1,Monday,0,9,Southern Cross Station,7
11, 01-JUN-2009 00:00,2009,June,1,Monday,0,10,Victoria Point,8
12, 01-JUN-2009 00:00,2009,June,1,Monday,0,12,New Quay,30
Because I'll be uploading to mongoDB, the Id in my context is redundant to me so I need my script to skip that. Sensor_ID is not unique but I'm planning to make it the PK and create a list of objects differentiating the Hourly_Count.
I'm aiming to generate a JSON object like this from the data:
**data.json**
{
{
"Sensor_ID": 4,
"Sensor_Name": "Town Hall(West)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 194
},
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 1,
"Hourly_Counts": 82
}
]
},
{
"Sensor_ID": 17,
"Sensor_Name": "Collins Place(North)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 21
}
]
}
}
So on so forth. I'm trying to make it so when it reads a Sensor_ID it creates an json object from the fields listed and adds it to the countList. Added in another from station_ID = 4 to the countList.
I am using python 2.7.x and I have looked at every question concerning this on stackoverflow and every other website. Very few rarely seem to want to restructure the .csv data when converting to .json so it's been a bit difficult.
What I have so far, still relatively new to python so thought this would be good to try out.
csvtojson.py
import csv
import json
def csvtojson():
filename = 'data.csv'
fieldnames = ('Id','Date_Time','Year','Month','Mdate','Day',
'Time','Sensor_ID','Sensor_Name', 'Hourly_Counts')
dataTime = ('Date_Time','Year','Month','Mdate','Day',
'Time', 'Hourly_Counts')
all_data = {}
with open(filename, 'rb') as csvfile:
reader = csv.DictReader(csvfile, fieldnames)
#skip header
next(reader)
current_sensorID = None
for row in reader:
sensor_ID = row['Sensor_ID']
sensorName = row['Sensor_Name']
data = all_data[sensor_ID] = {}
data['dataTime'] = dict((k, row[k]) for k in dataTime)
print json.dumps(all_data, indent=4, sort_keys=True)
if __name__ == "__main__":
csvtojson()
As far as I got is that countList is in is own object but it's not creating a list of objects and may mess up the import to mongoDB. It is filtering through Sensor_ID but overwriting if there are duplicates instead of adding to countList. And I can't seem to get it in the format/data structure I want - I'm not even sure if that's the right structure, ultimate goal is to import the millions of tuples into mongoDB like the way I listed. I'm trying a small set now to test it out.
Please check the following.
https://github.com/gurdyals/test-repo/tree/master/MongoDB
Use " MongoDB_py.zip " files .
I did the same to convert csv data to MongoDB dict .
Please let me know if you have any questions.
Thanks
Here is sample code for doing something similar to the above using python pandas. You could also do some aggregation in the dataframe if you wish to summarise the data to get rid of the redundant data.
import pandas as pd
import pprint as pp
import json
from collections import defaultdict
results = defaultdict(lambda: defaultdict(dict))
df = pd.read_csv('data.csv')
df.set_index(['Sensor_ID', 'Sensor_Name'],inplace=True)
df.reset_index(inplace=True)
grouped = df.groupby(['Sensor_ID', 'Sensor_Name']).apply(lambda x: x.drop(['Sensor_ID', 'Sensor_Name'], axis=1).to_json(orient='records'))
grouped.name = 'countList'
js = json.loads(pd.DataFrame(grouped).reset_index().to_json(orient='records'))
print json.dumps(js, indent = 4)
The output:
[
{
"Sensor_ID": 1,
"countList": "[{\"Id\":6,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":37}]",
"Sensor_Name": "Bourke Street Mall (North)"
},
{
"Sensor_ID": 2,
"countList": "[{\"Id\":5,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":28}]",
"Sensor_Name": "Bourke Street Mall (South)"
},
{
"Sensor_ID": 3,
"countList": "[{\"Id\":8,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":155}]",
"Sensor_Name": "Melbourne Central"
},
{
"Sensor_ID": 4,
"countList": "[{\"Id\":1,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":194}]",
"Sensor_Name": "Town Hall (West)"
},
How do I convert this into a pandas dataframe?
df_components = { "result1":
{"data" : [["43", "48", "27", "12"], ["67", "44", "24", "11"], ["11.85", "6.31", "5.18", "11.70"]],
"index" : [["Device_use_totala11. PS4", "Unweighted base"], ["Device_use_totala11. PS4", "Base"], ["Device_use_totala11. PS4", "Mean"]],
"columns" : [["Age", "Under 30"], ["Age", "30-44"], ["Age", "45-54"], ["Age", "55+"]]}
}
It's a dict with list of lists.
I thought this would work but it returns something funky which doesnt look like a dataframe
pd.DataFrame(df_components['result1'])
Output looks like:
columns [[Age, Under 30], [Age, 30-44], [Age, 45-54], ...
data [[43, 48, 27, 12], [67, 44, 24, 11], [11.85, 6...
index [[Device_use_totala11. PS4, Unweighted base], ...
Expected output:
a multi index df, something similar to the table below?
Your dict is not formatted properly to transform it directly into a DataFrame, you need to do:
d = df_components["result1"]
df = pd.DataFrame(d["data"],
columns=pd.MultiIndex.from_tuples(d["columns"]),
index=pd.MultiIndex.from_tuples(d["index"]))
df
Age
Under 30 30-44 45-54 55+
Device_use_totala11. PS4 Unweighted base 43 48 27 12
Base 67 44 24 11
Mean 11.85 6.31 5.18 11.70