The problem:
I have a json file with 20000 lines, which are basically web logs each representing specific users activities. I want to create a data frame in R to work with this data. Here is an example of a json line (random):
{"_type":"verifiedProductDetail","ts":1431820984214,"did":"7cd80696-4ede-49e4-a267-b887e684de32","profileId":"33021589-c159-4ec6-8c22-c0e5d9b600d9","preferenceIds":[],"price":115.0,"itemId":"10645","category":"/Binnenverlichting/Wandlampen","currency":1,"language":1,"name":"Wandlamp Linea 60 aluminium","url":"http://www.shop1.be/pagea/wandlampen.html_be","imageUrl":"http://vhetnevnejk.cloudfont.net/media/catalog/product/cache/7/thumbnail/450x/9df78eab33525dcdehl6e5fb8d27136e95/i/m/image_14583/Wandlamp.jpg","id":"871d275a-c856-4280-9cbd-f163b9f749e7","product":{"_id":"625363f4-0d80-3ff5-b091-174de3f9c9b2","domainId":"7cd80696-4ede-49e4-a267-b887e684de32","created":1427806290512,"updated":1436870460905,"itemId":"10645","prices":{"4":299.99,"1":69.99,"2":69.99,"5":299.99},"ratings":{"4":{"rate":1.0,"count":1,"created":1433447796660,"lan":4},"1":{"rate":0.9,"count":2,"created":1434355924529,"lan":1}},"categories":[{"language":3,"text":" Destockage","created":1427820384334},{"language":2,"text":" Outlet","created":1427883890399},{"language":1,"text":"/Binnenverlichting/Wandlampen","created":1431545171151},{"language":6,"text":" Outlet","created":1427876074772},{"language":4,"text":" Outlet","created":1427901573250},{"language":4,"text":" Beleuchtung nach Raum","created":1427827783211},{"language":11,"text":" Outlet","created":1427809161244}],"names":[{"language":3,"text":"Applique murale Linea 60cm en aluminium","created":1427820384334},{"language":2,"text":"Wall Lamp Linea 60 Aluminium","created":1427826729309},{"language":1,"text":"Wandlamp Linea 60 aluminium","created":1435695901730},{"language":6,"text":"Aplique de pared LINEA 60 aluminio ","created":1427819228360},{"language":11,"text":"Kinkiet Linea 60 aluminium","created":1427806290512},{"language":4,"text":"Wandleuchte Linea 60 Aluminium","created":1436870460905}],"imageUrl":"hhttp://vhetnevnejk.cloudfont.net/media/catalog/product/cache/7/thumbnail/450x/9df78eab335evwnrf5fb8d27136e95/i/m/image_14083/LineaWandlamp.jpg","url":"http://www.lampyiswiatlo.pl/kinkiet-linea.html","overwritePrinciples":{},"sku":"10645","stock":-1},"preferences":[]}
Here is what I did in R:
install.packages("rjson")
library("rjson")
SampleFile <- "filesample.json"
json_data <- fromJSON(paste(readLines(SampleFile), collapse=""))
str(json_data)
summary(json_data)
Finally I read it in R and have extracted variables:
> str(json_data)
List of 18
$ _type : chr "verifiedProductDetail"
$ ts : num 1.43e+12
$ did : chr "7cd80696-4ede-49e4-a267-b887e684de32"
$ profileId : chr "8be1a552-9124-453d-a0aa-7124c99b56c6"
$ preferenceIds: list()
$ price : num 26.9
$ itemId : chr "9858"
$ category : chr ""
$ currency : num 1
$ language : num 6
$ name : chr "up Weiss"
$ profile :List of 13
..$ _id : chr "8be1a552-9124-453d-a0aa-7124c99b56c6"
..$ created : num 1.43e+12
..$ updated : num 1.43e+12
[and others]
My issue: However, as you can see the length is 1 for all my variables, meaning that each variable only takes and represents one value (the first entry on the json file). Other values have disappeared. We can see it better using summary() function.
> summary(json_data)
Length Class Mode
_type 1 -none- character
ts 1 -none- numeric
did 1 -none- character
profileId 1 -none- character
preferenceIds 0 -none- list
price 1 -none- numeric
itemId 1 -none- character
category 1 -none- character
currency 1 -none- numeric
language 1 -none- numeric
name 1 -none- character
url 1 -none- character
imageUrl 1 -none- character
id 1 -none- character
profile 13 -none- list
product 14 -none- list
group 10 -none- list
preferences 0 -none- list
Summary: Could you please give to me any advice on what is wrong with my code that makes it only get the first value of each variable and all others have disappeared?
Related
I have my Rstudio connected to a MySQL database. The table I'm importing has a MySQL JSON column type: https://dev.mysql.com/doc/refman/5.7/en/json.html
When I import it into R, it becomes a BLOb. You can see the table, as its imported, here:
'data.frame': 15 obs. of 5 variables:
$ id :integer64 1 2 3 4 5 6 7 8 ...
$ user_id : chr
$ survey_id:integer64 3 10 10 10 10 3 10 10 ...
$ p_id : chr "22zdae" "0" "0" "0" ...
$ data : blob [1:15] ..$ : raw 7b 22 45 78 ...
When I go to extract information from the blob I use the following code:
for(row in 1:NROW(data)){
print(row)
tryCatch({
if(is_empty(data$data[[row]])==TRUE){
x<-NA
} else {
x <- rawToChar(data$data[[row]])
}
survey_data <- rbind(survey_data,x)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}
)}
Every row is transformed into only partially what in the database. For example:
Status": "Never married", "Liberal_Conserv": "Very Liberal",
"Political_Party": "Republican", "Kids_18yo_Number": ""}
This row has 251 variables in the database, not 4.
How can I accurately transform a blob into workable data?
I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.
Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13
I have json files with data for countries. One of the files has the following data:
"[{\"count\":1,\"subject\":{\"name\":\"Namibia\",\"alpha2\":\"NA\"}}]"
I have the following code convert the json into a data.frame using the jsonlite package:
df = as.data.frame(fromJSON(jsonfile), flatten=TRUE))
I was expecting a data.frame with numbers and strings:
count subject.name subject.alpha2
1 Namibia "NA"
Instead, the NA alpha2 code is being automatically converted into NA logical, and this is what I get:
str(df)
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: logi NA
I want alpha2 to be a string, not logical. How do I fix this?
That particular implementation of fromJSON (and there are three different packages with that name for a function) has a simplifyVector argument which appears to prevent the corecion:
require(jsonlite)
> as.data.frame( fromJSON(test, simplifyVector=FALSE ) )
count subject.name subject.alpha2
1 1 Namibia NA
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : Factor w/ 1 level "Namibia": 1
$ subject.alpha2: Factor w/ 1 level "NA": 1
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ,stringsAsFactors=FALSE) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: chr "NA"
I tried seeing if that option worked well with the flatten argument, but was disappointed:
> str( fromJSON(test, simplifyVector=FALSE, flatten=TRUE) )
List of 1
$ :List of 2
..$ count : int 1
..$ subject:List of 2
.. ..$ name : chr "Namibia"
.. ..$ alpha2: chr "NA"
The accepted answer did not solve my use case.
However, rjson::fromJSON does this naturally, and to my surprise, 10 times faster on my data.
How do I write a json array from R that has a sequence of lat and long?
I would like to write:
[[[1,2],[3,4],[5,6]]]
the best I can do is:
toJSON(matrix(1:6, ncol = 2, byrow = T))
#"[ [ 1, 2 ],\n[ 3, 4 ],\n[ 5, 6 ] ]"
How can I wrap the thing in another array (the json kind)?
This is important to me so I can write files into a geojson format as a LineString.
I usually use fromJSON to get the target object :
ll <- fromJSON('[[[1,2],[3,4],[5,6]]]')
str(ll)
List of 1
$ :List of 3
..$ : num [1:2] 1 2
..$ : num [1:2] 3 4
..$ : num [1:2] 5 6
So we should create , a list of unnamed list, each containing 2 elements:
xx <- list(setNames(split(1:6,rep(1:3,each=2)),NULL))
identical(toJSON(xx),'[[[1,2],[3,4],[5,6]]]')
[1] TRUE
If you have a matrix
m1 <- matrix(1:6, ncol=2, byrow=T)
may be this helps:
library(rjson)
paste0("[",toJSON(setNames(split(m1, row(m1)),NULL)),"]")
#[1] "[[[1,2],[3,4],[5,6]]]"
I want to convert the following json and put the values into a data frame. It almost works but as.data.frame() puts everything into one row.
require(rjson)
require(RCurl)
y = getURI(url1)
y
[1] "[{\"close\":5.45836392962902,\"highest\":5.45837200714172,\"lowest\":5.45836392962902,\"open\":5.45837200714172,\"start_time\":\"2012-01-29T18:29:24-08:00\"},{\"close\":5.45837200714172,\"highest\":5.45837200714172,\"lowest\":5.45834791002201,\"open\":5.45835598753471,\"start_time\":\"2012-01-29T18:28:24-08:00\"}]"
x = fromJSON(y)
> str(x)
List of 2
$ :List of 5
..$ close : num 5.46
..$ highest : num 5.46
..$ lowest : num 5.46
..$ open : num 5.46
..$ start_time: chr "2012-01-29T18:29:24-08:00"
$ :List of 5
..$ close : num 5.46
..$ highest : num 5.46
..$ lowest : num 5.46
..$ open : num 5.46
..$ start_time: chr "2012-01-29T18:28:24-08:00"
as.data.frame(x)
close highest lowest open start_time close.1 highest.1 lowest.1 open.1 start_time.1
1 5.458364 5.458372 5.458364 5.458372 2012-01-29T18:29:24-08:00 5.458372 5.458372 5.458348 5.458356 2012-01-29T18:28:24-08:00
Instead of it being on one row. I want them in two rows.
close highest lowest open start_time
1 5.458364 5.458372 5.458364 5.458372 2012-01-29T18:29:24-08:00
2 5.458372 5.458372 5.458348 5.458356 2012-01-29T18:28:24-08:00
Is there something I can specify in as.data.table for this to work?
EDIT:
do.call(rbind,lapply(x,as.data.frame))
The above was able to coerce it into a data frame, but the time stamp column has two factors. This next part has its own question here
y = do.call(rbind,lapply(x,as.data.frame))
str(x)
'data.frame': 2 obs. of 5 variables:
$ close : num 5.46 5.46
$ highest : num 5.47 5.46
$ lowest : num 5.46 5.46
$ open : num 5.46 5.46
$ start_time: Factor w/ 2 levels "2012-01-29T21:48:24-05:00",..: 1 2
If I try to convert the POSIX format I get
x$start_time = as.POSIXct(x$start_time)
x$start_time
[1] "2012-01-29 CST" "2012-01-29 CST"
But it loses the time data.
You might try:
do.call(rbind,lapply(x,as.data.frame))