R: Selecting certain from a JSON file - json

I've imported a JSON file into R from ( http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json ) and I'm trying to select only counties in Kansas.
Right now I have all the data into one variable and I'm trying to make subdata of this that is just counties of Kansas. I'm not sure how to go about this.

What you have there is geoJson, which can be read directly by library(sf), to give you an sf object, which is also data.frame. Then you can use the usual data.frame subsetting operations
library(sf)
sf <- sf::read_sf("http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json")
sf[sf$NAME == "Kansas", ]
# Simple feature collection with 1 feature and 5 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -102.0517 ymin: 36.99308 xmax: -94.58993 ymax: 40.00316
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# GEO_ID STATE NAME LSAD CENSUSAREA geometry
# 30 0400000US20 20 Kansas 81758.72 MULTIPOLYGON(((-99.541116 3...
And seeing as you want the individual counties, you need to use the counties data set
sf_counties <- sf::read_sf("http://eric.clst.org/wupl/Stuff/gz_2010_us_050_00_500k.json")
sf_counties[sf_counties$STATE == 20, ]

To stay with a JSON workflow, can try jqr
library(jqr)
url <- 'http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json'
download.file(url, (f <- tempfile(fileext = ".json")))
res <- paste0(readLines(f), collapse = " ")
out <- jq(res, '.features[] | select(.properties.NAME == "Kansas")')
can map easily like
library(leaflet)
leaflet() %>%
addTiles() %>%
addGeoJSON(out) %>%
setView(-98, 38, 6)

library(rjson)
lst=fromJSON(file = 'http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json')
index = which(sapply(lapply(lst$features,"[[",'properties'),'[[','NAME')=='Kansas')
subdata = lst$features[[index]]

Related

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

R highcharter get data from plots saved as html

I plot data with highcharter package in R, and save them as html to keep interactive features. In most cases I plot more than one graph, therefore bring them together as a canvas.
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
I want to extract the data from plots that I have saved as html in the past, just like these. Normally I would do hc_list[[1]]$x$hc_opts$series, but when I import html into R and try to do the same, I get an error. It won't do the job.
> hc_imported <- htmltools::includeHTML("test_grid.html")
> hc_imported[[1]]$x$hc_opts$series
Error in hc_imported$x : $ operator is invalid for atomic vectors
If I would be able to write a function like
get_my_data(my_imported_highcharter,3) # get data from 3rd plot
it would be the best. Regards.
You can use below code
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
# hc_imported <- htmltools::includeHTML("test_grid.html")
# hc_imported[[1]]$x$hc_opts$series
library(jsonlite)
library(RCurl)
library(XML)
get_my_data<-function(my_imported_highcharter,n){
webpage <- readLines(my_imported_highcharter)
pagetree <- htmlTreeParse(webpage, error=function(...){})
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[n]]
script<-divbodyContent$children[[2]]
data<-as.character(script$children[[1]])[6]
data<-fromJSON(data,simplifyVector = FALSE)
data<-data$x$hc_opts$series[[1]]$data
return(data)
}
get_my_data("test_grid.html",3)
get_my_data("test_grid.html",1)

Convert JSON into CSV in R programming

I have JSON of the form:
{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}
I need to convert above format JSON to CSV file in R.
CSV format :
ds y
123 45600
378 78689
343 23456
I'm using R library rjson to do so. I'm doing something like this:
jsonFile <- fromJSON(file=fileName)
json_data_frame <- as.data.frame(jsonFile)
but it's not doing the way I need it.
You can use jsonlite::fromJSON to read the data into a list, though you'll need to pull it apart to assemble it into a data.frame:
abc <- jsonlite::fromJSON('{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}')
abc <- data.frame(ds = names(abc[[1]]),
y = unlist(abc[[1]]), stringsAsFactors = FALSE)
abc
#> ds y
#> 123 123 45600
#> 378 378 78689
#> 343 343 23456
I believe you got the json file reader - fromJSON function right.
df <- data.frame( do.call(rbind, rjson::fromJSON( '{"a":true, "b":false, "c":null}' )) )
The code below gets me Google's Location History (json) archive from https://takeout.google.com. This is if you have enabled a 'Timeline' (location tracking) in Google Maps on your cell. Credit to http://rpubs.com/jsmanij/131030 for the original code. Note that json files like this can be quite large and plyr::llply is so much more efficient than lapply in parsing a list. Data.table gives me the more efficient 'rbindlist' to take the list to a data.table. Google logs between 350 to 800 GPS calls each day for me! A multi-year location history is converted to quite a sizeable list by 'fromJSON':
format(object.size(doc1),units="MB")
[1] "962.5 Mb"
I found 'do.call(rbind..)' un-optimized. The timestamp, lat, and long needed some work to be useful to Google Earth Pro, but I am getting carried away. At the end, I use 'write.csv' to take a data.table to CSV. That is all the original OP wanted here.
ts lat long latitude longitude
1: 1416680531900 487716717 -1224893214 48.77167 -122.4893
2: 1416680591911 487716757 -1224892938 48.77168 -122.4893
3: 1416680668812 487716933 -1224893231 48.77169 -122.4893
4: 1416680728947 487716468 -1224893275 48.77165 -122.4893
5: 1416680791884 487716554 -1224893232 48.77166 -122.4893
library(data.table)
library(rjson)
library(plyr)
doc1 <- fromJSON(file="LocationHistory.json", method="C")
object.size(doc1)
timestamp <- function(x) {as.list(x$timestampMs)}
timestamps <- as.list(plyr::llply(doc1$locations,timestamp))
timestamps <- rbindlist(timestamps)
latitude <- function(x) {as.list(x$latitudeE7)}
latitudes <- as.list(plyr::llply(doc1$locations,latitude))
latitudes <- rbindlist(latitudes)
longitude <- function(x) {as.list(x$longitudeE7)}
longitudes <- as.list(plyr::llply(doc1$locations,longitude))
longitudes <- rbindlist(longitudes)
datageoms <- setnames(cbind(timestamps,latitudes,longitudes),c("ts","lat","long")) [order(ts)]
write.csv(datageoms,"datageoms.csv",row.names=FALSE)

Reading complex json data as dataframe in R

I have the following json data:
json_data <- data.frame(changedContent=c('{"documents":[],"images":[],"profileCommunications":[],"shortListedProfiles":[],"matrimonyUser":{"createdBy":null,"parentMatrimonyUserId":0,"userSalutationVal":"Mr.","matrimonyUserCode":"173773","matrimonyUserName":"SUDIPTO DEB BARMAN","emailAddress":"sudipto06#yahoo.com","contactNumber":"9434944429","emailOTP":"","mobilePhoneOTP":"","isEmailOTPVerified":1,"isMobilePhoneOTPverified":1,"isHideContact":null,"isHideEmail":null,"lastLogInTime":null,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133028,"isDeleted":null,"isActive":1,"isAllowedLogin":null,"numberOfChildProfile":null,"matrimonyUserTypeId":100000006,"matrimonyUserTypeVal":"Online Customer","onlineStatusFlag":null,"lastSystemTransactionDateTime":null,"isLive":null,"mobileCountryCode":0,"userStatusIdValue":"Registered and Verified","crmUserStatusIdValue":null,"deactivateReasonIdValue":null,"deactivateReason":null,"matrimonyUserId":165614,"userSalutationId":100001617,"userStatusId":100002760,"crmUserStatusId":null,"deactivateReasonId":null,"createdOn":null},"aboutMes":[],"partnerPreference":{"isSubcastDealbreaker":null,"isOccupationDealbreaker":null,"isIndustryDealbreaker":null,"isIncomeDealbreaker":null,"isHeightDealbreaker":null,"isBodyTypeDealbreaker":null,"isHivDealbreaker":null,"isFamilyTypeDealbreaker":null,"isFamilyIncomeDealbreaker":null,"isDrinkingDealbreaker":null,"locationTypeIds":null,"isLocationTypeDealbreaker":null,"isLocationNameDealbreaker":null,"locationNameOthers":"","isMaritalStatusDealbreaker":null,"isSmokingDealbreaker":null,"isFoodHabitsDealbreaker":null,"isGothraDealbreaker":null,"isManglikDealbreaker":null,"isProfileCreatedbyDealbreaker":null,"religionIdsValues":"","casteIdsValues":null,"motherTongueIdsValues":"","minimumEducationValues":"","occupationIdsValues":"","industryIdsValues":"","bodyTypeIdsValues":"","hivIdValue":null,"familyTypeIdsValues":"","familyIncomeValues":"","drinkingIdValues":"","locationNameIdsValues":null,"maritalStatusIdsValues":"","smokingIdsValues":"","foodHabitsIdsValues":"","gothraIdsValues":"","manglikIdValue":null,"profileCreatedbyValues":"","heightFrom":null,"heightTo":null,"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133115,"isDeleted":null,"isActive":1,"partnerPreferenceId":2757,"isReligionDealbreaker":null,"casteIds":null,"isCasteDealbreaker":null,"isMotherTongueDealbreaker":null,"subcaste":"","religionIds":null,"motherTongueIds":null,"minimumEducation":null,"occupationIds":null,"industryIds":null,"bodyTypeIds":null,"income":null,"incomeValues":"","familyIncome":null,"hivId":0,"familyTypeIds":null,"drinkingId":null,"locationNameIds":"","maritalStatusIds":null,"smokingIds":null,"foodHabitsIds":null,"gothraIds":null,"manglikId":0,"profileCreatedby":null,"adbCount":0,"fifCount":0,"ageFrom":null,"ageTo":null,"isAgeDealbreaker":null,"isminimumEducationDealbreaker":null,"userId":165614,"createdOn":1440167133115,"height":null},"profileAgentDtl":{"campaignId":"","acquirerCode":0,"createdBy":4444,"modifiedBy":4444,"modifiedOn":1440167133110,"isDeleted":null,"isActive":1,"relationshipMangerId":0,"sourceCode":100000004,"userId":165614,"createdOn":1440167133110,"idOdNo":"","relationshipMangerName":null,"relationshipMangerContact":"","profileAgentDtlId":2757,"dateOfEntry":1437935400000,"formSerialNo":"3661","sourceCodeVal":null,"agentCode":null,"acquirerCodeVal":null,"agentName":"","agentMobileNo":"","adBookingNo":""},"profileBasicRegistrationDtl":{"sourceId":null,"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133109,"isDeleted":null,"isActive":1,"genderId":100000596,"priorityId":100001671,"profileCreatedById":100000590,"webSourceId":100001672,"dob":null,"genderVal":"Male","userId":165614,"profileCompleteness":null,"createdOn":1440167133109,"profileDtlId":2757,"nickName":null,"relation":null,"regViewersCount":null,"guestViewersCount":null,"trustScore":20,"webSourceVal":"Newspaper ","priorityVal":"Medium","profileCreatedByval":"Self","fieldContentModerationStatusId":null,"photoModerationStatusId":null,"documentModerationStatusId":null,"isPhotoHide":null,"isHoroscopeHide":null},"profileAstrologyDtl":{"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133111,"isDeleted":null,"isActive":1,"userId":165614,"createdOn":1440167133111,"profileAstrologyDtlId":2757,"gothraId":0,"gaanId":0,"nakshatraId":0,"sunSignId":0,"moonSignId":0,"manglikFlagId":0,"placeOfBirth":"0","timeOfBirth":null,"isPreferredPartnerDtl":null,"gothraVal":"","gaanVal":"","nakshatraVal":"","sunSignVal":"","moonSignVal":"","manglikFlagVal":""},"profileFamilyDtl":{"permanentAddress":null,"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133111,"isDeleted":null,"isActive":1,"familyIncome":0.0,"fathersStatusId":0,"mothersStatusId":0,"fathersOccupationId":0,"mothersOccupationId":0,"mothersIndustryId":null,"fathersIndustryId":null,"familyTypeId":0,"familyValueId":0,"familyKindId":0,"familyStatusId":0,"userId":165614,"createdOn":1440167133111,"moderatedOn":null,"profileFamilyDtlId":2757,"fathersName":"","fathersStatusVal":null,"motherName":"","mothersStatusVal":null,"numberOfSibling":0,"shortRefModerationStatus":null,"fathersOccupationVal":null,"mothersOccupationVal":null,"familyTypeVal":null,"familyValueVal":null,"familyKindVal":null,"familyStatusVal":null,"mothersIndustryVal":null,"fathersIndustryVal":null,"familyIncomeVal":"","moderatedBy":null,"moderatorRemarks":null,"ref1fullName":null,"ref1relationship":null,"ref1emailId":null,"ref1phoneNo":null,"ref1remarks":null,"ref2fullName":null,"ref2relationship":null,"ref2emailId":null,"ref2phoneNo":null,"ref2remarks":null},"profileLifestyleDtl":{"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133110,"isDeleted":null,"isActive":1,"favouriteBooksTypeIds":null,"favouriteHobbiesTypeIds":null,"favouriteMoviesTypeIds":null,"favouriteMusicTypeIds":null,"favouriteSportsTypeIds":null,"livingInHouseTypeId":0,"vehicleTypeOwnedId":0,"petsId":0,"drinkingStatusId":0,"numberOfKids":0,"userId":165614,"createdOn":1440167133110,"moderatedOn":null,"moderatedBy":null,"isModerated":null,"moderatorRemarks":null,"profileLifestyleDtlId":2757,"smokingStatusId":0,"foodHabitsId":0,"financialPlansId":0,"retirementPlansId":0,"vehicleDescription":null,"vehicleNumber":0,"childrenDesiredId":null,"isReligionImportantFlagId":null,"religiousBeliefs":0,"smokingStatusVal":"","drinkingStatusVal":null,"foodHabitsVal":"","financialPlansVal":null,"retirementPlansVal":null,"vehicleTypeOwnedVal":null,"livingInHouseTypeVal":null,"petsVal":null,"childrenDesiredVal":null,"favouriteBooksTypeVals":"","favouriteMoviesTypeVals":"","favouriteMusicTypeVals":"","favouriteSportsTypeVals":"","favouriteHobbiesTypeVals":"","isReligionImportantFlagVal":null,"religiousBeliefsVal":"","favouriteHobbiesRating":null,"favouriteHobbiesDescription":null,"noOfKidsVal":null},"profileOccupationEducationDtl":{"highestSpecializationVal":null,"highestSpecializationOthersVal":"","createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133110,"isDeleted":null,"isActive":1,"highestEducationId":null,"occupationId":null,"designationId":null,"incomeCurrencyId":null,"education2id":0,"education3id":0,"specialization2id":0,"specialization3id":0,"highestSpecializationId":null,"industryId":null,"annualIncome":null,"userId":165614,"createdOn":1440167133110,"moderatedOn":null,"moderatedBy":null,"isModerated":null,"moderatorRemarks":null,"highestEducationVal":null,"occupationVal":null,"industryVal":null,"incomeCurrencyVal":null,"designationVal":null,"education3val":null,"education2val":null,"specialization2val":null,"specialization2othersVal":"","specialization3val":null,"specialization3othersVal":"","additionalQualification":null,"professionalQualification":null,"occupationOthersVal":"","departmentId":null,"employmentSectorId":null,"companyName":"","highestEducationInstituteVal":null,"education2instituteVal":"0","education3instituteVal":"","professionalQualificationVal":null,"departmentVal":null,"employmentSectorVal":null,"annualIncomeVal":null,"profileOccupationEducationDtlId":2757,"schoolName2":"","schoolName1":"","education2instituteId":null,"education3instituteId":null},"profilePersonalDtl":{"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133110,"isDeleted":null,"isActive":1,"familyOriginId":0,"stateId":null,"countryId":null,"numChildrenProspect":0,"countryVal":null,"stateVal":null,"landmark":null,"locationVal":null,"userId":165614,"locationId":null,"religionId":100000598,"createdOn":1440167133110,"isPreferredPartnerDtl":null,"maritalStatusId":null,"maritalStatusVal":null,"subCaste":"","profilePersonalDtlId":2757,"motherTongueId":100000618,"casteId":null,"marryOutsideCasteId":0,"familyOriginVal":null,"facebookHandle":"","linkedInHandle":"","twiterHandle":null,"googlePlus":null,"casteText":"Kshatriya","homeTownText":"0","religionVal":"Hindu","motherTongueVal":"Bengali","marryOutsideCasteVal":"","isSocialMediaVerified":null,"numChildrenProspectVal":null,"locality":null},"profilePhysicalAttributesDtl":{"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133110,"isDeleted":null,"isActive":1,"hivId":0,"bodyTypeId":0,"complexionId":0,"bloodGroupId":0,"userId":165614,"createdOn":1440167133110,"height":null,"isPreferredPartnerDtl":null,"hairColourId":0,"eyeColourId":0,"hairLengthId":0,"physicalStatusId":null,"disabilitiesVal":"","hivVal":"","knownAilmentVal":"","bodyTypeVal":null,"complexionVal":null,"hairColourVal":"","eyeColourVal":"","hairLengthVal":"","physicalStatusVal":null,"bloodGroupVal":null,"profilePhysicalAttributesDtlId":2757,"weight":null},"profileSiblingsDtl":null,"profileImageDtl":null,"notes":[{"createdBy":4444,"modifiedBy":4444,"modifiedOn":1440167133115,"isDeleted":null,"isActive":1,"userId":165614,"createdOn":1440167133115,"profileNotesDtlId":3499,"notesDescription":""}],"references":[],"relationOthers":[],"photoIdentificationDetails":null,"preModAboutMes":[{"answer":"null ","preModerationAboutMeId":1439283144614540579,"moderationStatus":1,"createdBy":4444,"questionVal":null,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133092,"isActive":1,"isAnswerChange":0,"userId":165614,"questionId":1,"createdOn":1440167133092},{"answer":"null ","preModerationAboutMeId":1439283144614540580,"moderationStatus":1,"createdBy":4444,"questionVal":null,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133093,"isActive":1,"isAnswerChange":0,"userId":165614,"questionId":2,"createdOn":1440167133093},{"answer":"null ","preModerationAboutMeId":1439283144614540581,"moderationStatus":1,"createdBy":4444,"questionVal":null,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133094,"isActive":1,"isAnswerChange":0,"userId":165614,"questionId":3,"createdOn":1440167133094},{"answer":"null ","preModerationAboutMeId":1439283144614540582,"moderationStatus":1,"createdBy":4444,"questionVal":null,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133094,"isActive":1,"isAnswerChange":0,"userId":165614,"questionId":4,"createdOn":1440167133094}],"preModContent":[{"preModerationContentId":1439307323336466240,"isChangeMatrimonyUserName":null,"isChangeLocality":0,"isChangeLandmark":0,"permanentAddress":"Dev Barman,Mayapur,PO-Talbagicha,Kharadpur-721306","isChangePermanentAddress":1,"nameOfInstitutionHighestEducation":"0","highestSpecializationVal":null,"highestSpecializationOthersVal":null,"createdBy":4444,"matrimonyUserName":"SUDIPTO DEB BARMAN","userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133104,"isDeleted":null,"isActive":1,"highestEducationId":0,"occupationId":0,"designationId":0,"incomeCurrencyId":null,"highestSpecializationId":0,"industryId":0,"annualIncome":0.0,"stateId":100000269,"countryId":100000101,"dob":520972200000,"countryVal":null,"stateVal":null,"landmark":"","userId":165614,"moderationStatusId":null,"createdOn":1440167133104,"isChangeNameOfInstitutionHighestEducation":0,"isChangeHighestSpecialization":0,"highestEducationVal":null,"isChangeHighestEducation":0,"occupationVal":null,"isChangeOccupation":0,"industryVal":"","isChangeIndustry":0,"incomeCurrencyVal":null,"isChangeIncomeCurrency":0,"customerTypeId":null,"customerTypeVal":null,"isChangeCustomerType":1,"isChangeDob":1,"maritalStatusId":100000900,"maritalStatusVal":"Never Married","isChangeMaritalStatus":1,"isChangeCountry":1,"isChangeState":1,"cityId":0,"isChangeCity":0,"cityVal":null,"isChangeAnnualIncome":0,"designationVal":null,"isChangeDesignation":0,"subCaste":null,"hometown":null,"isChangeSubCaste":null,"isChangeHometown":null,"ref1fullName":null,"isChangeRef1fullName":null,"ref1relationship":null,"isChangeRef1relationship":null,"ref1emailId":null,"isChangeRef1emailId":null,"ref1phoneNo":null,"isChangeRef1phoneNo":null,"ref1remarks":null,"isChangeRef1remarks":null,"ref2fullName":null,"isChangeRef2fullName":null,"ref2relationship":null,"isChangeRef2relationship":null,"ref2emailId":null,"isChangeRef2emailId":null,"ref2phoneNo":null,"isChangeRef2phoneNo":null,"ref2remarks":null,"isChangeRef2remarks":null,"typeOfCustomer":null,"isChangeTypeOfCustomer":null,"highestEducationInstituteId":null,"typeOfCustomerId":100000006,"locality":""}],"preModReferences":[],"preModShortReferences":[{"moderationStatus":null,"createdBy":4444,"userSessionDtlId":null,"modifiedBy":4444,"modifiedOn":1440167133099,"isDeleted":null,"isActive":1,"userId":165614,"createdOn":1440167133099,"isModerated":null,"premoderationprofileImageDtlId":1772,"ref1fullName":"","isChangeRef1fullName":0,"ref1relationship":"","isChangeRef1relationship":0,"ref1emailId":"","isChangeRef1emailId":0,"ref1phoneNo":null,"isChangeRef1phoneNo":0,"ref1remarks":null,"ref2fullName":"","isChangeRef2fullName":0,"ref2relationship":"","isChangeRef2relationship":0,"ref2emailId":"","isChangeRef2emailId":0,"ref2phoneNo":null,"isChangeRef2phoneNo":0,"ref2remarks":null}],"paymentTransactions":[],"userPlanMappings":[],"userFeatureMappings":[],"userPlanMapping":null,"blockedProfiles":[],"notMyTypeProfiles":[]}')
I want to convert the above to a convenient data frame with 1 row each MatrimonyUserId in the above.I have tried a few things but unable to get this in desired format.
Assuming you can wrangle the json data into a nested list....
x <- jsonlite::fromJSON(jsontext)
I've found it's easiest to parse complex list structures by using the pipe operator and frequently checking the structure (limited to 1 or 2 levels.
str1 <- function(x) str(x, 1)
str2 <- function(x) str(x, 2)
# for pipe operator
library("magittr")
x %>% str1
x %>% .[[1]] %>% str2
Etc.

Unable to convert JSON to dataframe

I want to convert a json-file into a dataframe in R. With the following code:
link <- 'https://www.dropbox.com/s/ckfn1fpkcix1ccu/bevingenbag.json'
document <- fromJSON(file = link, method = 'C')
bev <- do.call("cbind", document)
i'm getting this:
type features
1 FeatureCollection list(type = "Feature", geometry = list(type = "Point", coordinates = c(6.54800000288927, 52.9920000044505)), properties = list(gid = "1496600", yymmdd = "19861226", lat = "52.992", lon = "6.548", mag = "2.8", depth = "1.0", knmilocatie = "Assen", baglocatie = "Assen", tijd = "74751"))
which is the first row of a matrix. All the other rows have the same structure. I'm interested in the properties = list(gid = "1496600", yymmdd = "19861226", lat = "52.992", lon = "6.548", mag = "2.8", depth = "1.0", knmilocatie = "Assen", baglocatie = "Assen", tijd = "74751") part, which should be converted into a dataframe with the columns gid, yymmdd, lat, lon, mag, depth, knmilocatie, baglocatie, tijd.
I searched for and tryed several solutions but none of them worked. I used the rjson package for this. I also tryed the RJSONIO & jsonlite package, but was unable to extract the desired information.
Anyone an idea how to solve this problem?
Here's a way to obtain the data frame:
library(rjson)
document <- fromJSON(file = "bevingenbag.json", method = 'C')
dat <- do.call(rbind, lapply(document$features,
function(x) data.frame(x$properties)))
Edit: How to replace empty values with NA:
dat$baglocatie[dat$baglocatie == ""] <- NA
The result:
head(dat)
gid yymmdd lat lon mag depth knmilocatie baglocatie tijd
1 1496600 19861226 52.992 6.548 2.8 1.0 Assen Assen 74751
2 1496601 19871214 52.928 6.552 2.5 1.5 Hooghalen Hooghalen 204951
3 1496602 19891201 52.529 4.971 2.7 1.2 Purmerend Kwadijk 200914
4 1496603 19910215 52.771 6.914 2.2 3.0 Emmen Emmen 21116
5 1496604 19910425 52.952 6.575 2.6 3.0 Geelbroek Ekehaar 102631
6 1496605 19910808 52.965 6.573 2.7 3.0 Eleveld Assen 40114
This is just another, quite similar, approach.
#SvenHohenstein's approach creates a dataframe at each step, an expensive process. It's much faster to create vectors and re-type the whole result at the end. Also, Sven's approach makes each column a factor, which might or might not be what you want. The approach below runs about 200 times faster. This can be important if you intend to do this repeatedly. Finally, you will need to convert columns lon, lat, mag, and depth to numeric.
library(microbenchmark)
library(rjson)
document <- fromJSON(file = "bevingenbag.json", method = 'C')
json2df.1 <- function(json){ # #SvenHohenstein approach
df <- do.call(rbind, lapply(json$features,
function(x) data.frame(x$properties, stringsAsFactors=F)))
return(df)
}
json2df.2 <- function(json){
df <- do.call(rbind,lapply(json[["features"]],function(x){c(x$properties)}))
df <- data.frame(apply(result,2,as.character), stringsAsFactors=F)
return(df)
}
microbenchmark(x<-json2df.1(document), y<-json2df.2(document), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# x <- json2df.1(document) 2304.34378 2654.95927 2822.73224 2977.75666 3227.30996 10
# y <- json2df.2(document) 13.44385 15.27091 16.78201 18.53474 19.70797 10
identical(x,y)
# [1] TRUE