Parsing a Data Table from Website That is Using JSON - json

I am trying to parse data from the Minnesota DNR page and it says they are using JSON. I want to build a script to download the data tables from many different pages but am focusing on one first. I have tried rvest, JSONIO, and many other packages to no avail. The most frustrating error I am getting is:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "list"
Here is my code:
library(rvest)
kk<-read_html("http://www.dnr.state.mn.us/lakefind/showreport.html?downum=56003100")
node <- "table.table_colors:nth-child(1) > tbody:nth-child(1)"
html_table(node, fill=TRUE)
head(kk)
How do I get this table to download with the headers in tact???

Just get the actual data that goes into making the table. It's JSON and not too complex:
library(httr)
res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
query=list(type="lake_survey", id="56003100"))
str(content(res))
This lets you get metadata by county name:
get_lake_metadata <- function(county_name) {
require(httr)
require(dplyr)
require(jsonlite)
xlate_df <- data_frame(
id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L,
24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L,
37L, 38L, 39L, 40L, 41L, 42L, 44L, 45L, 46L, 43L, 47L, 48L, 49L,
50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L, 61L, 62L,
63L, 64L, 65L, 66L, 67L, 68L, 70L, 71L, 72L, 69L, 73L, 74L, 75L,
76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L, 84L, 85L, 86L, 87L),
county = c("Aitkin", "Anoka", "Becker", "Beltrami", "Benton",
"Big Stone", "Blue Earth", "Brown", "Carlton", "Carver",
"Cass", "Chippewa", "Chisago", "Clay", "Clearwater", "Cook",
"Cottonwood", "Crow Wing", "Dakota", "Dodge", "Douglas",
"Faribault", "Fillmore", "Freeborn", "Goodhue", "Grant",
"Hennepin", "Houston", "Hubbard", "Isanti", "Itasca", "Jackson",
"Kanabec", "Kandiyohi", "Kittson", "Koochiching", "Lac Qui Parle",
"Lake", "Lake of the Woods", "Le Sueur", "Lincoln", "Lyon",
"Mahnomen", "Marshall", "Martin", "McLeod", "Meeker", "Mille Lacs",
"Morrison", "Mower", "Murray", "Nicollet", "Nobles", "Norman",
"Olmsted", "Otter Tail", "Pennington", "Pine", "Pipestone",
"Polk", "Pope", "Ramsey", "Red Lake", "Redwood", "Renville",
"Rice", "Rock", "Roseau", "Scott", "Sherburne", "Sibley",
"St. Louis", "Stearns", "Steele", "Stevens", "Swift", "Todd",
"Traverse", "Wabasha", "Wadena", "Waseca", "Washington",
"Watonwan", "Wilkin", "Winona", "Wright", "Yellow Medicine"))
target <- filter(xlate_df, tolower(county) == tolower(county_name))
if (nrow(target) == 1) {
res <- GET("http://maps2.dnr.state.mn.us/cgi-bin/lakefinder_json.cgi",
query=list(county=target$id))
jsonlite::fromJSON(content(res, as="parsed"))
} else {
message("County not found")
}
}
get_lake_metadata("Anoka")
get_lake_metadata("Steele")

Related

Python- Issue parsing multi-layered API JSON into CSV

I'm trying to parse the NIH grant API and am running into a complex layering issue. In the JSON output below, I've been able to navigate into the "results" section which contains all the fields I want, except some are layered within another dictionary. What I'm trying to do is get the JSON data within "full_study_section", "organization", and "project_num_split" to be in the same layer as "appl_id", "contact_pi_name", "fiscal_year", and so forth. This post was helpful but I'm not quite sure how to level the layers through iteration.
{
"meta":{
"limit":25,
"offset":0,
"properties":{},
"search_id":null,
"sort_field":"project_start_date",
"sort_order":"desc",
"sorted_by_relevance":false,
"total":78665
},
"results":[
{
"appl_id":10314644,
"contact_pi_name":"BROCATO, EMILY ROSE",
"fiscal_year":2021,
"full_study_section":{
"group_code":"32",
"name":"Special Emphasis Panel[ZAA1 GG (32)]",
"sra_designator_code":"GG",
"sra_flex_code":"",
"srg_code":"ZAA1",
"srg_flex":""
},
"organization":{
"city":null,
"country":null,
"dept_type":"PHARMACOLOGY",
"external_org_id":353201,
"fips_country_code":null,
"org_city":"RICHMOND",
"org_country":"UNITED STATES",
"org_duns":[
"105300446"
],
"org_fips":"US",
"org_ipf_code":"353201",
"org_name":"VIRGINIA COMMONWEALTH UNIVERSITY",
"org_state":"VA",
"org_state_name":null,
"org_zipcode":"232980568"
},
"project_end_date":null,
"project_num":"1F31AA029259-01A1",
"project_num_split":{
"activity_code":"F31",
"appl_type_code":"1",
"full_support_year":"01A1",
"ic_code":"AA",
"serial_num":"029259",
"suffix_code":"A1",
"support_year":"01"
},
"project_start_date":"2022-03-07T05:00:00Z",
"subproject_id":null
},
Code:
import requests
import json
import csv
params = {
"criteria":
{
"fiscal_years":[2021]
},
"include_fields": [
"ApplId","ContactPiName","FiscalYear",
"OrgCountry","AllText",
"FullStudySection","Organization","ProjectEndDate",
"ProjectNum","ProjectNumSplit","ProjectStartDate","SubprojectId"
],
"offset":0,
"limit":25,
"sort_field":"project_start_date",
"sort_order":"desc"
}
response = requests.post("https://api.reporter.nih.gov/v2/projects/search", json = params)
#print(response.status_code)
#print(response.text)
resdecode = json.loads(response.text)
#print(json.dumps(resdecode, sort_keys=True, indent=4, separators=(',', ':')))
data = resdecode["results"]
#print(json.dumps(data, sort_keys=True, indent=4, separators=(',', ':')))
pns = resdecode["results"][0]["project_num_split"]
#print(json.dumps(pns, sort_keys=True, indent=4, separators=(',', ':')))
# for item in data:
# appl_id = item.get("appl_id")
# print(appl_id)
writerr = csv.writer(open('C:/Users/nkmou/Desktop/Venture/Tech Opportunities/NIH.csv', 'w', newline = ''))
count = 0
for row in resdecode:
if count == 0:
header = resdecode.keys()
writerr.writerow(header)
count += 1
writerr.writerow(row)
writerr.close()
In order to move the items under full_study_section, organization and project_num_split to same level as appl_id, contact_pi_name and fiscal_year you will have to loop through each of the results and recreate those key value pairs for those three dicts and then remove the full_study_section, organization and project_num_split keys once done. Below code should work as you expected.
import requests
import json
import csv
params = {
"criteria":
{
"fiscal_years":[2021]
},
"include_fields": [
"ApplId","ContactPiName","FiscalYear",
"OrgCountry","AllText",
"FullStudySection","Organization","ProjectEndDate",
"ProjectNum","ProjectNumSplit","ProjectStartDate","SubprojectId"
],
"offset":0,
"limit":25,
"sort_field":"project_start_date",
"sort_order":"desc"
}
response = requests.post("https://api.reporter.nih.gov/v2/projects/search", json = params)
resdecode = json.loads(response.text)
data = resdecode["results"]
for item in data:
x = ["full_study_section","organization","project_num_split"]
for i in x:
for key, value in item[i].items():
item[key] = value
del item[i]
with open('C:/Users/nkmou/Desktop/Venture/Tech Opportunities/NIH.csv', 'w', newline = '') as f:
writer = csv.writer(f)
count = 0
for row in data:
if count == 0:
header = row.keys()
writer.writerow(header)
count =+ 1
writer.writerow(row.values())
You can move the items to the required level and remove the dict.
import json
import pprint
pp = pprint
file = open("test.json")
jsonData = json.load(file)
full_study_section = jsonData['results'][0]['full_study_section']
organization = jsonData['results'][0]['organization']
project_num_split = jsonData['results'][0]['project_num_split']
jsonData['results'][0].update(full_study_section)
jsonData['results'][0].update(project_num_split)
jsonData['results'][0].update(organization)
jsonData['results'][0].pop('full_study_section')
jsonData['results'][0].pop('project_num_split')
jsonData['results'][0].pop('organization')
pp.pprint(jsonData)
Output:
{u'meta': {u'limit': 25,
u'offset': 0,
u'properties': {},
u'search_id': None,
u'sort_field': u'project_start_date',
u'sort_order': u'desc',
u'sorted_by_relevance': False,
u'total': 78665},
u'results': [{u'activity_code': u'F31',
u'appl_id': 10314644,
u'appl_type_code': u'1',
u'city': None,
u'contact_pi_name': u'BROCATO, EMILY ROSE',
u'country': None,
u'dept_type': u'PHARMACOLOGY',
u'external_org_id': 353201,
u'fips_country_code': None,
u'fiscal_year': 2021,
u'full_support_year': u'01A1',
u'group_code': u'32',
u'ic_code': u'AA',
u'name': u'Special Emphasis Panel[ZAA1 GG (32)]',
u'org_city': u'RICHMOND',
u'org_country': u'UNITED STATES',
u'org_duns': [u'105300446'],
u'org_fips': u'US',
u'org_ipf_code': u'353201',
u'org_name': u'VIRGINIA COMMONWEALTH UNIVERSITY',
u'org_state': u'VA',
u'org_state_name': None,
u'org_zipcode': u'232980568',
u'project_end_date': None,
u'project_num': u'1F31AA029259-01A1',
u'project_start_date': u'2022-03-07T05:00:00Z',
u'serial_num': u'029259',
u'sra_designator_code': u'GG',
u'sra_flex_code': u'',
u'srg_code': u'ZAA1',
u'srg_flex': u'',
u'subproject_id': None,
u'suffix_code': u'A1',
u'support_year': u'01'}]}

Choosing an intercept in lmer

I am using an lmer model (from lmerTest) to understand whether size is significantly correlated with gene expression, and if so, which specific genes are correlated with size (also accounting for 'female' and 'cage' as random effects):
lmer(Expression ~ size*genes + (1|female) + (1|cage), data = df)
In the summary output, one of my genes is being used up as an intercept (since it is highest in the alphabet, 'ctsk'). After reading around, it was recommended that I choose the highest (or lowest) expressed gene as my intercept to compare everything else against. In this case, the gene 'star' was the highest expressed. After re-levelling my data, and re-running the model with 'star' as the intercept, ALL the other slopes are now significant in summary() output, although anova() output is identical.
My questions are:
Is it possible to not have one of my genes used as an intercept? If it is not possible, then how do I know which gene I should choose as an intercept?
Can I test whether the slopes are different from zero? Perhaps this is where I would specify no intercept in my model (i.e. '0+size*genes')?
Is it possible to have the intercept as the mean of all slopes?
I will then use lsmeans to determine whether the slopes are significantly different from each other.
Here is some reproducible code:
df <- structure(list(size = c(13.458, 13.916, 13.356, 13.84, 14.15,
16.4, 15.528, 13.916, 13.458, 13.285, 15.415, 14.181, 13.367,
13.356, 13.947, 14.615, 15.804, 15.528, 16.811, 14.677, 13.2,
17.57, 13.947, 14.15, 16.833, 13.2, 17.254, 16.4, 14.181, 13.367,
14.294, 13.84, 16.833, 17.083, 15.847, 13.399, 14.15, 15.47,
13.356, 14.615, 15.415, 15.596, 15.847, 16.833, 13.285, 15.47,
15.596, 14.181, 13.356, 14.294, 15.415, 15.363, 15.4, 12.851,
17.254, 13.285, 17.57, 14.7, 17.57, 13.947, 16.811, 15.4, 13.399,
14.22, 13.285, 14.344, 17.083, 15.363, 14.677, 15.945), female = structure(c(7L,
12L, 7L, 11L, 12L, 9L, 6L, 12L, 7L, 7L, 6L, 12L, 8L, 7L, 7L,
11L, 9L, 6L, 10L, 11L, 8L, 10L, 7L, 12L, 10L, 8L, 10L, 9L, 12L,
8L, 12L, 11L, 10L, 10L, 9L, 8L, 12L, 6L, 7L, 11L, 6L, 9L, 9L,
10L, 7L, 6L, 9L, 12L, 7L, 12L, 6L, 6L, 6L, 8L, 10L, 7L, 10L,
11L, 10L, 7L, 10L, 6L, 8L, 11L, 7L, 6L, 10L, 6L, 11L, 9L), .Label = c("2",
"3", "6", "10", "11", "16", "18", "24", "25", "28", "30", "31",
"116", "119", "128", "135", "150", "180", "182", "184", "191",
"194", "308", "311", "313", "315", "320", "321", "322", "324",
"325", "329", "339", "342"), class = "factor"), Expression = c(1.10620339407889,
1.06152707257767, 2.03000185674761, 1.92971750056866, 1.30833983462599,
1.02760836165184, 0.960969703469363, 1.54706275342441, 0.314774666283256,
2.63330873720495, 0.895123048920455, 0.917716470037954, 1.3178821021651,
1.57879156856332, 0.633429011784367, 1.12641940390116, 1.0117475796626,
0.687813581350802, 0.923485880847423, 2.98926377892241, 0.547685277701021,
0.967691178046748, 2.04562285257417, 1.09072264997544, 1.57682235413366,
0.967061529758701, 0.941995966023426, 0.299517719292817, 1.8654758451133,
0.651369936708288, 1, 1.04407979584122, 0.799275069735012, 1.007255409328,
0.428129727802404, 0.93927930755046, 0.987394257033815, 0.965050972503591,
2.06719308587322, 1.63846508102874, 0.997380526962644, 0.60270197593643,
2.78682867333149, 0.552922632281237, 3.06702198884562, 0.890708510580522,
1.15168812515828, 0.929205084743164, 2.27254101826041, 1, 0.958147442333527,
1.05924173014089, 0.984356852670054, 0.623630720815415, 0.796864961771971,
2.4679841984147, 1.07248904053777, 1.79630829771291, 0.929642913565982,
0.296954006040077, 2.25741254504115, 1.17188536743493, 0.849778293699644,
2.32679163466857, 0.598119006609413, 0.975660099975423, 1.01494421228949,
1.14007557533352, 2.03638316428189, 0.777347547080068), cage = structure(c(64L,
49L, 56L, 66L, 68L, 48L, 53L, 49L, 64L, 56L, 55L, 68L, 80L, 56L,
64L, 75L, 69L, 53L, 59L, 66L, 63L, 59L, 64L, 68L, 59L, 63L, 50L,
48L, 68L, 80L, 49L, 66L, 59L, 50L, 48L, 63L, 68L, 62L, 56L, 75L,
55L, 81L, 48L, 59L, 56L, 62L, 81L, 68L, 56L, 49L, 55L, 62L, 55L,
63L, 50L, 56L, 59L, 75L, 59L, 64L, 59L, 55L, 63L, 66L, 56L, 53L,
50L, 62L, 66L, 81L), .Label = c("023", "024", "041", "042", "043",
"044", "044 bis", "045", "046", "047", "049", "051", "053", "058",
"060", "061", "068", "070", "071", "111", "112", "113", "123",
"126", "128", "14", "15", "23 bis", "24", "39", "41", "42", "44",
"46 bis", "47", "49", "51", "53", "58", "60", "61", "67", "68",
"70", "75", "76", "9", "D520", "D521", "D522", "D526", "D526bis",
"D533", "D535", "D539", "D544", "D545", "D545bis", "D546", "D561",
"D561bis", "D564", "D570", "D581", "D584", "D586", "L611", "L616",
"L633", "L634", "L635", "L635bis", "L637", "L659", "L673", "L676",
"L686", "L717", "L718", "L720", "L725", "L727", "L727bis"), class = "factor"),
genes = c("igf1", "gr", "ctsk", "ets2", "ctsk", "mtor", "igf1",
"sgk1", "sgk1", "ghr1", "ghr1", "gr", "ctsk", "ets2", "timp2",
"timp2", "ets2", "rictor", "sparc", "mmp9", "gr", "sparc",
"mmp2", "ghr1", "mmp9", "sparc", "mmp2", "timp2", "star",
"sgk1", "mmp2", "gr", "mmp2", "rictor", "timp2", "mmp2",
"mmp2", "mmp2", "mmp2", "rictor", "mtor", "ghr1", "star",
"igf1", "mmp9", "igf1", "igf2", "rictor", "rictor", "mmp9",
"ets2", "ctsk", "mtor", "ghr1", "mtor", "ets2", "ets2", "igf2",
"igf1", "sgk1", "sgk1", "ghr1", "sgk1", "igf2", "star", "mtor",
"igf2", "ghr1", "mmp2", "rictor")), .Names = c("size", "female",
"Expression", "cage", "genes"), row.names = c(1684L, 2674L, 10350L,
11338L, 10379L, 4586L, 1679L, 3637L, 3610L, 5537L, 5530L, 2676L,
10355L, 11313L, 8422L, 8450L, 11322L, 6494L, 9406L, 13262L, 2653L,
9407L, 12274L, 5564L, 13256L, 9394L, 12294L, 8438L, 750L, 3614L,
12303L, 2671L, 12293L, 6513L, 8437L, 12284L, 12305L, 12267L,
12276L, 6524L, 4567L, 5545L, 733L, 1700L, 13241L, 1674L, 7471L,
6528L, 6498L, 13266L, 11308L, 10347L, 4566L, 5541L, 4590L, 11315L,
11333L, 7482L, 1703L, 3607L, 3628L, 5529L, 3617L, 7483L, 722L,
4565L, 7476L, 5532L, 12299L, 6510L), class = "data.frame")
genes <- as.factor(df$genes)
library(lmerTest)
fit1 <- lmer(Expression ~ size * genes +(1|female) + (1|cage), data = df)
anova(fit1)
summary(fit1) # uses the gene 'ctsk' for intercept, so re-level to see what happens if I re-order based on highest value (sgk1):
df$genes <- relevel(genes, "star")
# re-fit the model with 'star' as the intercept:
fit1 <- lmer(Expression ~ size * genes +(1|female) + (1|cage), data = df)
anova(fit1) # no difference here
summary(fit1) # lots of difference
My sample data is pretty long since the model wouldn't run otherwise-hopefully this is ok!
While it is possible to interpret the coefficients in your fitted model, that isn't the most fruitful or productive approach. Instead, just fit the model using whatever contrast methods are used by default, and follow-up with suitable post-hoc analyses.
For that, I suggest using the emmeans (estimated marginal means) package, which is a continuation of lsmeans where all future developments will take place. The package has several vignettes, and the one most relebant to your situation is vignette("interactions"), which you may view here -- particularly the section on interactions with covariates.
Briefly, comparing intercepts can be very misleading, since those are predictions at size = 0 which is an extrapolation; and moreover, as you suggest in a question, the real point here is probably to compare slopes more than intercepts. For that purpose, there is an emtrends() function (or, if you like, its alias lstrends()).
I also strongly recommend displaying a graph of the model predictions so you can visualize what's going on. This may be done via
library(emmeans)
emmip(fit1, gene ~ size, at = list(size = range(df$size)))

Writing JSON children in R

I have a data set that I would like to group in JSON.
address city.x state.x latitude.x longitude.x
1 5601 W. Slauson Ave. #200 Culver City CA 33.99718 -118.40145
2 PO 163005 Austin TX 30.31622 -97.85877
3 10215 W. Jamesburg Street Wichita KS 37.70063 -97.43430
4 14556 Newport Ave Tustin CA 33.74165 -117.82127
5 2496 Falcon Crescent Virginia Beach VA 36.83840 -76.02862
6 1306 Wilshire Boulevard Santa Monica CA 34.03216 -118.49022
I would like to group together address and lat/long and put it all under the category of company.
I would like it to look like this:
{company: {address: {address: "5601 W. Slauson Ave. #200" ,
city.x: "Culver City" ,
state.x: "CA"}},
{geo: {latitude: "33.99718",
longitude: "-118.40145"}}},
{company: {address: {address: "PO 163005" ,
city.x: "Austin" ,
state.x: "TX"}},
{geo: {latitude: "30.31622",
longitude: "-97.85877"}}},
structure(list(address = c("5601 W. Slauson Ave. #200", "PO 163005",
"10215 W. Jamesburg Street", "14556 Newport Ave", "2496 Falcon Crescent",
"1306 Wilshire Boulevard"), city.x = c("Culver City", "Austin",
"Wichita", "Tustin", "Virginia Beach", "Santa Monica"), state.x = c("CA",
"TX", "KS", "CA", "VA", "CA"), latitude.x = c(33.997179, 30.316223,
37.700632, 33.741651, 36.838398, 34.032159), longitude.x = c(-118.40145,
-97.85877, -97.4343, -117.82127, -76.02862, -118.49022)), .Names = c("address",
"city.x", "state.x", "latitude.x", "longitude.x"), class = "data.frame", row.names = c(NA,
6L))
Any help would be appreciated!
The following code should output what you want:
for (i in 1:nrow(df)){
cat ("{company:{address:{adress:\t\"",df$address[i],
"\",\n\t\tcity.x:\t\"", df$city.x[i],
"\",\n\t\tstate.x:\t \"", df$state.x[i],
"\"}}\n\t {geo:{\tlatitude: \"", df$latitude[i],
"\",\n\t\tlongitude: \"", df$longitude[i],
"\"}}},\n", sep="")
}
with df as your data frame.
Another option is to use the rjson package.
require(rjson)
# This is necessary to avoid duplication of labels in the JSON output
names(df) <- NULL
reshaped <- apply(df, 1, FUN=function(x){list(address=list(
address = x[1],
city = x[2],
state = x[3]),
coords=list(
latitude = x[4],
longitude = x[5]))})
result <- toJSON(reshaped)
The only difference from what you requested is that instead of having "company" as the root it will have sequential numbers. You could change it by changing the row names of your data (using rownames), but R does not support duplicate row names... the closest that I got was using
rownames(df) <- paste("company", 1:nrow(df), collapse="")
and maybe with a little regexp magic you could strip the numbers in the output string...

Creating a json string from a loadings object in R

After performing a factor analysis the loadings object looks like this:
Loadings:
Factor1 Factor2
IV1 0.844 -0.512
IV2 0.997
IV3 -0.235
IV4 -0.144
IV5 0.997
Factor1 Factor2
SS loadings 1.719 1.333
Proportion Var 0.344 0.267
Cumulative Var 0.344 0.610
I can target the factors themselves using print(fit$loadings[,1:2])to get the following.
Factor1 Factor2
IV1 0.84352949 -0.512090197
IV2 0.01805673 0.997351400
IV3 0.05877499 -0.234710743
IV4 0.09088599 -0.144251843
IV5 0.99746785 0.008877643
I would like to create a json string that would look something like the following.
"loadings": {
"Factor1": {
"IV1": 0.84352949, "IV2":0.01805673, "IV3":0.05877499, "IV4": 0.09088599, "IV5": 0.99746785
},
"Factor2": {
"IV1": -0.512090197, "IV2": 0.997351400, "IV3": -0.234710743, "IV4": -0.144251843, "IV5": 0.008877643
}
}
I have tried accessing the individual properties using unclass(), hoping that I could then loop through and put them into a string,have not had any luck ( using loads <- loadings(fit) and <- names(unclass(loads)) names shows up as "null")
Just seconding #GSee's comment (+1) and #dickoa's answer (+1) with a closer example:
Creating some demo data for reproducible example (you should also provide one in all your Qs):
> fit <- princomp(~ ., data = USArrests, scale = FALSE)
Load RJSONIO/rjson packages:
> library(RJSONIO)
Transform your data to fit your needs:
> res <- list(loadings = apply(fit$loadings, 2, list))
Return JSON:
> cat(toJSON(res))
{
"loadings": {
"Comp.1": [
{
"Murder": -0.041704,
"Assault": -0.99522,
"UrbanPop": -0.046336,
"Rape": -0.075156
}
],
"Comp.2": [
{
"Murder": 0.044822,
"Assault": 0.05876,
"UrbanPop": -0.97686,
"Rape": -0.20072
}
],
"Comp.3": [
{
"Murder": 0.079891,
"Assault": -0.06757,
"UrbanPop": -0.20055,
"Rape": 0.97408
}
],
"Comp.4": [
{
"Murder": 0.99492,
"Assault": -0.038938,
"UrbanPop": 0.058169,
"Rape": -0.072325
}
]
}
}>
You can do something along these lines
require(RJSONIO) ## or require(rjson)
pca <- prcomp(~ ., data = USArrests, scale = FALSE)
export <- list(loadings = split(pca$rotation, rownames(pca$rotation)))
cat(toJSON(export))
## {
## "loadings": {
## "Assault": [ 0.99522, -0.05876, -0.06757, 0.038938 ],
## "Murder": [ 0.041704, -0.044822, 0.079891, -0.99492 ],
## "Rape": [ 0.075156, 0.20072, 0.97408, 0.072325 ],
## "UrbanPop": [ 0.046336, 0.97686, -0.20055, -0.058169 ]
## }
## }
If you want to export it :
cat(toJSON(export), file = "loadings.json")
If it doesn't really suit your need, just modify the data structure (export object) to the output you want.

Strategies for formatting JSON output from R

I'm trying to figure out the best way of producing a JSON file from R. I have the following dataframe tmp in R.
> tmp
gender age welcoming proud tidy unique
1 1 30 4 4 4 4
2 2 34 4 2 4 4
3 1 34 5 3 4 5
4 2 33 2 3 2 4
5 2 28 4 3 4 4
6 2 26 3 2 4 3
The output of dput(tmp) is as follows:
tmp <- structure(list(gender = c(1L, 2L, 1L, 2L, 2L, 2L), age = c(30,
34, 34, 33, 28, 26), welcoming = c(4L, 4L, 5L, 2L, 4L, 3L), proud = c(4L,
2L, 3L, 3L, 3L, 2L), tidy = c(4L, 4L, 4L, 2L, 4L, 4L), unique = c(4L,
4L, 5L, 4L, 4L, 3L)), .Names = c("gender", "age", "welcoming",
"proud", "tidy", "unique"), na.action = structure(c(15L, 39L,
60L, 77L, 88L, 128L, 132L, 172L, 272L, 304L, 305L, 317L, 328L,
409L, 447L, 512L, 527L, 605L, 618L, 657L, 665L, 670L, 708L, 709L,
729L, 746L, 795L, 803L, 826L, 855L, 898L, 911L, 957L, 967L, 983L,
984L, 988L, 1006L, 1161L, 1162L, 1224L, 1245L, 1256L, 1257L,
1307L, 1374L, 1379L, 1386L, 1387L, 1394L, 1401L, 1408L, 1434L,
1446L, 1509L, 1556L, 1650L, 1717L, 1760L, 1782L, 1814L, 1847L,
1863L, 1909L, 1930L, 1971L, 2004L, 2022L, 2055L, 2060L, 2065L,
2082L, 2109L, 2121L, 2145L, 2158L, 2159L, 2226L, 2227L, 2281L
), .Names = c("15", "39", "60", "77", "88", "128", "132", "172",
"272", "304", "305", "317", "328", "409", "447", "512", "527",
"605", "618", "657", "665", "670", "708", "709", "729", "746",
"795", "803", "826", "855", "898", "911", "957", "967", "983",
"984", "988", "1006", "1161", "1162", "1224", "1245", "1256",
"1257", "1307", "1374", "1379", "1386", "1387", "1394", "1401",
"1408", "1434", "1446", "1509", "1556", "1650", "1717", "1760",
"1782", "1814", "1847", "1863", "1909", "1930", "1971", "2004",
"2022", "2055", "2060", "2065", "2082", "2109", "2121", "2145",
"2158", "2159", "2226", "2227", "2281"), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
Using the rjson package, I run the line toJSON(tmp) which produces the following JSON file:
{"gender":[1,2,1,2,2,2],
"age":[30,34,34,33,28,26],
"welcoming":[4,4,5,2,4,3],
"proud":[4,2,3,3,3,2],
"tidy":[4,4,4,2,4,4],
"unique":[4,4,5,4,4,3]}
I also experimented with the RJSONIO package; the output of toJSON() was the same. What I would like to produce is the following structure:
{"traits":["gender","age","welcoming","proud", "tidy", "unique"],
"values":[
{"gender":1,"age":30,"welcoming":4,"proud":4,"tidy":4, "unique":4},
{"gender":2,"age":34,"welcoming":4,"proud":2,"tidy":4, "unique":4},
....
]
I'm not sure how best to do this. I realize that I can parse it line by line using python but I feel like there is probably a better way of doing this. I also realize that my data structure in R does not reflect the meta-information desired in my JSON file (specifically the traits line), but I am mainly interested in producing the data formatted like the line
{"gender":1,"age":30,"welcoming":4,"proud":4,"tidy":4, "unique":4}
as I can manually add the first line.
EDIT: I found a useful blog post where the author dealt with a similar problem and provided a solution. This function produces a formatted JSON file from a data frame.
toJSONarray <- function(dtf){
clnms <- colnames(dtf)
name.value <- function(i){
quote <- '';
# if(class(dtf[, i])!='numeric'){
if(class(dtf[, i])!='numeric' && class(dtf[, i])!= 'integer'){ # I modified this line so integers are also not enclosed in quotes
quote <- '"';
}
paste('"', i, '" : ', quote, dtf[,i], quote, sep='')
}
objs <- apply(sapply(clnms, name.value), 1, function(x){paste(x, collapse=', ')})
objs <- paste('{', objs, '}')
# res <- paste('[', paste(objs, collapse=', '), ']')
res <- paste('[', paste(objs, collapse=',\n'), ']') # added newline for formatting output
return(res)
}
Using the package jsonlite:
> jsonlite::toJSON(list(traits = names(tmp), values = tmp), pretty = TRUE)
{
"traits": ["gender", "age", "welcoming", "proud", "tidy", "unique"],
"values": [
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
},
{
"gender": 2,
"age": 34,
"welcoming": 4,
"proud": 2,
"tidy": 4,
"unique": 4
},
{
"gender": 1,
"age": 34,
"welcoming": 5,
"proud": 3,
"tidy": 4,
"unique": 5
},
{
"gender": 2,
"age": 33,
"welcoming": 2,
"proud": 3,
"tidy": 2,
"unique": 4
},
{
"gender": 2,
"age": 28,
"welcoming": 4,
"proud": 3,
"tidy": 4,
"unique": 4
},
{
"gender": 2,
"age": 26,
"welcoming": 3,
"proud": 2,
"tidy": 4,
"unique": 3
}
]
}
Building upon Andrie's idea with apply, you can get exactly what you want by modifying the tmp variable before calling toJSON.
library(RJSONIO)
modified <- list(
traits = colnames(tmp),
values = unname(apply(tmp, 1, function(x) as.data.frame(t(x))))
)
cat(toJSON(modified))
Building further on Andrie and Richie's ideas, use alply instead of apply to avoid converting numbers to characters:
library(RJSONIO)
library(plyr)
modified <- list(
traits = colnames(tmp),
values = unname(alply(tmp, 1, identity))
)
cat(toJSON(modified))
plyr's alply is similar to apply but returns a list automatically; whereas without the more complicated function inside Richie Cotton's answer, apply would return a vector or array. And those extra steps, including t, mean that if your dataset has any non-numeric columns, the numbers will get converted to strings.
So use of alply avoids that concern.
For example, take your tmp dataset and add
tmp$grade <- c("A","B","C","D","E","F")
Then compare this code (with alply) vs the other example (with apply).
It seems to me you can do this by sending each row of your data.frame to JSON with the appropriate apply statement.
For a single row:
library(RJSONIO)
> x <- toJSON(tmp[1, ])
> cat(x)
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
}
The entire data.frame:
x <- apply(tmp, 1, toJSON)
cat(x)
{
"gender": 1,
"age": 30,
"welcoming": 4,
"proud": 4,
"tidy": 4,
"unique": 4
} {
...
} {
"gender": 2,
"age": 26,
"welcoming": 3,
"proud": 2,
"tidy": 4,
"unique": 3
}
Another option is to use the split to split your data.frame with N rows into N data.frames with 1 row.
library(RJSONIO)
modified <- list(
traits = colnames(tmp),
values = split(tmp, seq_len(nrow(tmp)))
)
cat(toJSON(modified))