I'm trying to extract a JSON data which is a column in a CSV file. So far I've come to the point where I've extracted the column in the right format, but the formatting is only correct when the variable type is factor. But I can't convert a factor to a json-file using the jsonlite package.
[1] {"id":509746197991998767,"visibility":{"percentage":100,"time":149797,"visible1":true,"visible2":false,"visible3":false,"activetab":true},"interaction":{"mouseovercount":1,"mouseovertime":1426,"videoplaytime":0,"engagementtime":0,"expandtime":0,"exposuretime":35192}}
Another approach is to use stringsAsFactors = F when importing, but I'm struggling in getting the formatting right, where each entry looks like this:
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Am I missing something obvious here? I simply just want to exract the JSON files that sits inside a CSV file.
Heres a small example of the CSV file:
"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"
Regards,
Frederik.
df <- readr::read_csv('"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"',
col_types = "-cc")
Using jsonlite::fromJSON on each separate value, then tidyr::unnest
library(dplyr)
f <- function(.x)
if (is.na(.x) || .x == "") data.frame()[1, ] else
as.data.frame(jsonlite::fromJSON(.x))
df %>%
tidyr::unnest(UnloadVars = lapply(UnloadVars, f)) %>%
mutate_at(vars(ends_with("id")), as.character)
# A tibble: 6 x 16
# CookieID id visibility.percentage visibility.time visibility.visible1 visibility.visible2 visibility.visible3 visibility.activetab interaction.mouseovercount interaction.mouseovertime interaction.videoplaytime interaction.engagementtime interaction.expandtime interaction.exposuretime clicks.x clicks.y
# <chr> <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 -8857188784608690176 509746197991998784 100 149797 TRUE FALSE FALSE TRUE 1 1426 0 0 0 35192 NA NA
# 2 -1695626857458244096 2917654329769114112 46 0 FALSE FALSE FALSE TRUE NA NA NA NA NA NA NA NA
# 3 437299165071669184 2252707957388071936 99 10168 TRUE FALSE FALSE TRUE 0 0 0 0 0 542 105 449
# 4 292660729552227520 <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
# 6 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
I used readr::read_csv to read in your sample data set.
> df <- readr::read_csv('~/sample.csv')
Parsed with column specification:
cols(
CookieID = col_double(),
UnloadVars = col_character()
)
As you can see the UnloadVars are read in as characters and not factors. If I now examine the first value in the UnloadVars columns I see the following which matches what you get,
> df$UnloadVars[1]
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Now, I use jsonlite::fromJSON,
> j <- jsonlite::fromJSON(df$UnloadVars[1])
> j
$id
[1] 5.097462e+17
$visibility
$visibility$percentage
[1] 100
$visibility$time
[1] 149797
$visibility$visible1
[1] TRUE
$visibility$visible2
[1] FALSE
$visibility$visible3
[1] FALSE
$visibility$activetab
[1] TRUE
$interaction
$interaction$mouseovercount
[1] 1
$interaction$mouseovertime
[1] 1426
$interaction$videoplaytime
[1] 0
$interaction$engagementtime
[1] 0
$interaction$expandtime
[1] 0
$interaction$exposuretime
[1] 35192
Which I believe is what you need since JSONs are parsed as lists in R.
It can be very tricky to deal with JSON data. As a general guide line, you should always strive to have your data in a data frame. This, however, is not always possible. In the specific case, I don't see a way you can have both visibility and interaction values at once in a nicely formatted data frame.
What I will do next is to extract the information from interaction into a data frame.
Load required packages and read the data
library(purrr)
library(dplyr)
library(tidyr)
df <- read.csv("sample.csv", stringsAsFactors = FALSE)
Then remove unvalid JSON
# remove rows without JSON (in this case, the 4th row)
df <- df %>%
dplyr::filter(UnloadVars != "")
Transform each JSON into a list and put them into UnloadVars column. If you didn't know that, it is possible to have list column in a data frame. This can be very useful.
out <- data_frame(CookieID = numeric(),
UnloadVars = list())
for (row in 1:nrow(df)) {
new_row <- data_frame(CookieID = df[row, ]$CookieID,
UnloadVars = list(jsonlite::fromJSON(df[row, ]$UnloadVars)))
out <- bind_rows(out, new_row)
}
out
We can now extract the IDs from the lists in Unload Vars. This is straight forward because there is only one ID per list.
out <- out %>%
mutate(id = map_chr(UnloadVars, ~ .$id))
This final part can seem a bit intimidating. But what I am doing here is taking interaction part from UnloadVars column and putting it into a interaction column. I then transform each row from interaction, which is a list, into a data frame with two columns: key and value. key contains the name of the interaction metric and value its value. I finally unnest it, so we get rid of list columns and end up with a nicely formatted data frame.
unpack_list <- function(obj, key_name) {
as.data.frame(obj) %>%
gather(key) %>%
return()
}
df_interaction <- out %>%
mutate(interaction = map(UnloadVars, ~ .$interaction)) %>%
mutate(interaction = map(interaction, ~ unpack_list(.x, key))) %>%
unnest(interaction)
df_interaction
The solution is not very elegant, but gets the job done. You could apply the same logic to extract information from visibility.
I'm messing around with tidyjson (latest from github, published by Jeremy Stanley). I wanted to sort of automate searching and extract the nested arrays. The following examples below provide the output I want.
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object("name") %>%
gather_keys %>%
append_values_string
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(name) %>%
gather_keys %>%
append_values_string
These both give the same output:
# A tbl_json: 2 x 3 tibble with a "JSON" attribute
`attr(., "JSON")` document.id key string
<chr> <int> <chr> <chr>
1 "bob" 1 first bob
2 "jones" 1 last jones
However, if I declare a character variable before and pass it along it fails.
object_name <- "name"
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(list(name="name")) %>%
gather_keys %>%
append_values_string
Error: Path components must be single names or character strings
Any ideas why this would happen?
If you are familiar with Hadley's book Advanced R, this is a piece of non-standard evaluation that unfortunately does not presently have a workaround in pure tidyjson (I would prefer a enter_object_ that uses standard evaluation, more like dplyr). I am hopeful of that functionality at some point being available, because as you suggest, it would be nice to vectorize and automate these sorts of programs.
The Non-Standard Evaluation is the "magic" that allows you to pass in the un-quoted name and still get good results in your second example (instead of the program looking for an object called name). The hazard is it does not resolve objects like object_name in your case.
That said, it seems you can work-around with do.call and a list of parameters (I fixed your example, as I think it went a bit awry)
library(tidyjson)
json <- "{\"name\": {\"first\": \"bob\", \"last\": \"jones\"}, \"age\": 32}"
object_name <- "name"
do.call(enter_object, args = list(json, object_name)) %>% gather_object %>%
append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name string
#> <chr> <int> <chr> <chr>
#> 1 "\"bob\"" 1 first bob
#> 2 "\"jones\"" 1 last jones
I definitely recommend checking out some of the new features / functionality in the development version of tidyjson with devtools::install_github('jeremystan/tidyjson'), but unfortunately no support for standard evaluation in "path"s yet.
I have two objects of the same type in JSON:
json <- '[{"client":"ABC Company","totalUSD":1870.0000,"durationDays":365,"familySize":4,"assignmentType":"Long Term","homeLocation":"Chicago, IL","hostLocation":"Lyon, France","serviceName":"Service ABC","homeLocationGeoLat":41.8781136,"homeLocationGeoLng":-87.6297982,"hostLocationGeoLat":45.764043,"hostLocationGeoLng":4.835659},{"client":"ABC Company","totalUSD":21082.0000,"durationDays":365,"familySize":4,"assignmentType":"Long Term","homeLocation":"Chicago, IL","hostLocation":"Lyon, France","serviceName":"Service ABC","homeLocationGeoLat":41.8781136,"homeLocationGeoLng":-87.6297982,"hostLocationGeoLat":45.764043,"hostLocationGeoLng":4.835659}]'
How can I parse both objects unto the same data.frame such that I have two rows that share the same columns?
To put that another way, I have a list of JSON objects that I am trying to parse into a data.frame.
I have tried this:
p <- rjson::newJSONParser()
p$addData(json)
df <- p$getObject()
This seems to return a list whereas I am wanting a data.frame:
> df
[[1]]
[[1]]$client
[1] "ABC Company"
[[1]]$totalUSD
[1] 1870
[[1]]$durationDays
[1] 365
[[1]]$familySize
[1] 4
[[1]]$assignmentType
[1] "Long Term"
[[1]]$homeLocation
[1] "Chicago, IL"
[[1]]$hostLocation
[1] "Lyon, France"
[[1]]$serviceName
[1] "Service ABC"
[[1]]$homeLocationGeoLat
[1] 41.87811
[[1]]$homeLocationGeoLng
[1] -87.6298
[[1]]$hostLocationGeoLat
[1] 45.76404
[[1]]$hostLocationGeoLng
[1] 4.835659
[[2]]
[[2]]$client
[1] "ABC Company"
[[2]]$totalUSD
[1] 21082
[[2]]$durationDays
[1] 365
[[2]]$familySize
[1] 4
[[2]]$assignmentType
[1] "Long Term"
[[2]]$homeLocation
[1] "Chicago, IL"
[[2]]$hostLocation
[1] "Lyon, France"
[[2]]$serviceName
[1] "Service ABC"
[[2]]$homeLocationGeoLat
[1] 41.87811
[[2]]$homeLocationGeoLng
[1] -87.6298
[[2]]$hostLocationGeoLat
[1] 45.76404
[[2]]$hostLocationGeoLng
[1] 4.835659
How can I parse this list of JSON objects?
EDIT: In this case, you want do.call and rbind:
do.call(rbind.data.frame, rjson::fromJSON(json))
or using your method:
p <- rjson::newJSONParser()
p$addData(json)
df <- p$getObject()
do.call(rbind, df)
I am working with R and the package 'elastic' to query an elastic search db containing twitter data in JSON format. The query works fine and I get the output content (out) as I expect.
class(out)
[1] "list"
and out$hits$hits returns
> out$hits$hits
[[1]]
[[1]]$`_index`
[1] "twitter_all_geo-2014-11-01"
[[1]]$`_type`
[1] "ctweet"
[[1]]$`_id`
[1] "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7"
[[1]]$`_version`
[1] 1
[[1]]$`_score`
[1] 2.10757
[[1]]$`_source`
[[1]]$`_source`$id
[1] "528330489049120770"
[[1]]$`_source`$created_at
[1] "2014-10-31T23:39:39+0000"
[[1]]$`_source`$user
[[1]]$`_source`$user$name
[1] "afterlifetemis"
[[1]]$`_source`$place
[[1]]$`_source`$place$geo_point
[[1]]$`_source`$place$geo_point[[1]]
[1] 30.4529
[[1]]$`_source`$place$geo_point[[2]]
[1] 50.61104
[[1]]$`_source`$place$city
[1] "Ukraine"
[[1]]$`_source`$place$country
[1] "Ukraine"
[[1]]$`_source`$place$country_code
[1] "UA"
[[1]]$`_source`$msg
[[1]]$`_source`$msg$text
[1] "u had one job artemis\none"
[[1]]$`_source`$msg$lang
[1] "EN"
[[1]]$`_source`$msg$hash_tags
list()
[[2]]
[[2]]$`_index`
[1] "twitter_all_geo-2014-11-01"
[[2]]$`_type`
[1] "ctweet"
...
...
Basically I wanted to save the data as .csv file, so I entered
> write.csv(out$hits$hits,'out.csv')
Error in data.frame(text = "u had one job artemis\none", lang = "EN", : arguments imply differing number of rows: 1, 0
I assumed that it is necessary to convert it to an data.frame, so I tried:
> df <- ldply (out, data.frame)
Error in data.frame(text = "u had one job artemis\none", lang = "EN", :
arguments imply differing number of rows: 1, 0
(I tried several other, optimistc, attempts too like this one:)
> t(sapply(out$hits$hits, '[', 1:max(sapply(out$hits$hits, length))))
_index _type _id _version _score _source
[1,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7" 1 2.10757 List,5
[2,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-ba071fff-cafb-4d3f-947d-13c934905c1b" 1 2.10757 List,5
[3,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-dd64af32-4d59-4008-a3db-74471ad269d1" 1 2.10757 List,5
[4,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-4ba0d3d0-642d-4f9f-aaf9-c55929c35dc4" 1 2.10757 List,5
[5,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d7b8cbbc-87b3-44b5-8c9c-91c7b62f1458" 1 2.10757 List,5
[6,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-76353a7c-44c9-4863-a59d-adb16716ca18" 1 2.10757 List,5
[7,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-2aec0798-9918-4b66-9b2a-ef5a4d1f3711" 1 2.10757 List,5
[8,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-c9e7637d-358a-40ee-a06c-85af04c22191" 1 2.10757 List,5
[9,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-8928c1ef-f46a-4682-99c4-4dbc55270b03" 1 2.10757 List,5
[10,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d6b19975-b310-46c4-af11-af56971b7c4b" 1 2.10757 List,5
And in the beginning it looked good, but the actual tweet message isn't anymore in the matrix
I was optimistic and thought maybe convert it first (back) to JSON (using RJSON)
toJSON(out)
Error in toJSON(out) : unable to escape string. String is not utf8
At the end I have a list and can not save, can not convert to JSON, data.frame or data.table (because it is not uniform). Does anyone can give me an hint on a) convert it to JSON or on how to save the list to a .csv file or to put it in a data.frame?
Thanks a lot, I think I don't understand it.
-Tobias
I think unlist() and matrix() can do the job.
An example converting the Search()-return out into data frame:
# get the first 3 hits from elasticsearch store
out <- Search(index="shakespeare", size=3)
# (optional) verify that all hits expand to the same length
# (should be true for data intended to be in a table format)
stopifnot(
sapply(
out$hits$hits,
function(x) {!(length(unlist(x)) - length(unlist(out$hits$hits[[1]])))}
)
)
# count number of columns, use unlist() to convert
# nested lists to a vector, use the first hit as proxy
nColumns <- length(unlist(out$hits$hits[[1]]))
# fetch column names ... as above
nNames <- names(unlist(out$hits$hits[[1]]))
# unlist all hits and convert to matrix with ncol Columns, don't forget byrow=TRUE!
df <- data.frame(matrix(unlist(out$hits$hits), ncol=nColumns, byrow=TRUE))
# setting the column names
names(df) <- nNames
# do whatever you want with df
print(df)
Cheers!
you can use "jqr" package in R. For eg:-
datacsv<-jq(out,".hits.hits[] | #csv")
It will save your data into csv format and with the help of "jqr" you can also grep the fields that you want.