How to parse json with multiple keys in a column with R - json

I am new to R and now facing a problem parsing a json column in a dataset,
I HAVE GONE THROUGH pretty much all the threads about parsing json, but I CANNOT find a proper solution...as I believe my problem is a little different:
Here is my situation:
I am using R to connect to a SQL database via ODBC && get a table I need:
The TCbigdata is the target json column
and the json looks like :
{
"memberid": "30325292",
"hotelgroup": {
"g_orders": "-1",
"g_sfristcreatedate": "-1",
"g_lastcreatedate": "-1",
"g_slastcreatedate": "-1",
"g_fristcreatedate": "-1"
},
"visa": {
"v_orders": "-1",
"v_maxcountryid": "-1",
"v_lastsorderdate": "-1",
"v_maxvisaperson": "-1",
"v_lastorderdate": "-1",
"v_lastvisacountryid": "-1",
"v_sorders": "-1"
},
"callcentertel": {
"lastcctzzycalldate": "-1",
"ishavecctcomplaintcall": "-1",
"lastcctchujingcalldate": "-1",
"lastcctyouluncalldate": "-1"
}....(key n, key n+1.. etc)..}
**
My desire output would be all the nested vars , if possible, I want to DELETE memberid && hotelgroup && visa && callcentertel && etc such group keys , so,
1.
parsing columns would be like " g_orders...v_orders..lastcct....etc" in one dataset without keys such as "hotelgroup","visa","callcentertel" ...etc...;
2.
Or, parsing it into multiple datasets like
"hotelgroup" table, COLUMN--"g_orders"+ "g_sfristcreatedate".....
"visa" table, COLUMN--"v_orders"+ "v_maxcountryid".....
I am not sure if there is a package for problem like this?
============ PROBLEM DESCRIPTION && DESIRE OUTPUT =================
I have searched several demonstrations using jsonlite/rjsonio/tidyjson , but failed to find a properway.
**Another part I find confusing is, my dataset, which is from data warehouse via ODBC, return "factor" type of "TCbigdata", instead of "Character" as I assume:
as what it is in DW:
================ MY CODE...TBC ========================
HERE IS MY CODE:
# SQL TABLE orgtc <- sqlQuery(channel1,'SELECT idMemberInfo,memberid, refbizid, crttime, TCbigdata FROM tcbiz_fq_rcs_data.MemberInfo ')
# Convert var_type orgjf$JFMemberPortrait<- as.character( orgjf$JFMemberPortrait )
# ????? ----library(jsonlite) l <- fromJSON(orgjf$JFMemberPortrait, simplifyDataFrame = FALSE) ---- TBD
I appreciate your help!

Interesting question. There are really two pieces:
getting the JSON out of the DW
parsing the JSON into your desired output
It looks like you have made decent progress getting the JSON out of the DW. I'm not sure what you are using to connect, but I would recommend using the new-ish odbc package, which has a nice DBI interface.
(Remember that reproducible examples are important to getting help quickly). Once you have the data out of the DW, you should have something like the data_frame that I manufacture below.
Further, if you want to use tidyjson (my preference), then you need to be aware that it is off of CRAN and the dev version at jeremystan/tidyjson has useful functionality (and is broken by the new dplyr). Here, I use the dev version from my repo:
suppressPackageStartupMessages(library(tidyverse))
# devtools::install_github("colearendt/tidyjson")
suppressPackageStartupMessages(library(tidyjson))
raw_json <- '{
"memberid": "30325292",
"hotelgroup": {
"g_orders": "-1",
"g_sfristcreatedate": "-1",
"g_lastcreatedate": "-1",
"g_slastcreatedate": "-1",
"g_fristcreatedate": "-1"
},
"visa": {
"v_orders": "-1",
"v_maxcountryid": "-1",
"v_lastsorderdate": "-1",
"v_maxvisaperson": "-1",
"v_lastorderdate": "-1",
"v_lastvisacountryid": "-1",
"v_sorders": "-1"
},
"callcentertel": {
"lastcctzzycalldate": "-1",
"ishavecctcomplaintcall": "-1",
"lastcctchujingcalldate": "-1",
"lastcctyouluncalldate": "-1"
}
}'
dw_data <- data_frame(
idMemberInfo = c(1:10)
, TCbigdata = as.character(lapply(c(1:10),function(x){return(raw_json)}))
)
dw_data
#> # A tibble: 10 x 2
#> idMemberInfo TCbigdata
#> <int> <chr>
#> 1 1 "{ …
#> 2 2 "{ …
#> 3 3 "{ …
#> 4 4 "{ …
#> 5 5 "{ …
#> 6 6 "{ …
#> 7 7 "{ …
#> 8 8 "{ …
#> 9 9 "{ …
#> 10 10 "{ …
# convert to tbl_json
dw_json <- as.tbl_json(dw_data, json.column = "TCbigdata")
# option 1 - let tidyjson do the work for you
# - you will need to rename
opt_1 <- dw_json %>% spread_all()
names(opt_1)
#> [1] "idMemberInfo"
#> [2] "memberid"
#> [3] "hotelgroup.g_orders"
#> [4] "hotelgroup.g_sfristcreatedate"
#> [5] "hotelgroup.g_lastcreatedate"
#> [6] "hotelgroup.g_slastcreatedate"
#> [7] "hotelgroup.g_fristcreatedate"
#> [8] "visa.v_orders"
#> [9] "visa.v_maxcountryid"
#> [10] "visa.v_lastsorderdate"
#> [11] "visa.v_maxvisaperson"
#> [12] "visa.v_lastorderdate"
#> [13] "visa.v_lastvisacountryid"
#> [14] "visa.v_sorders"
#> [15] "callcentertel.lastcctzzycalldate"
#> [16] "callcentertel.ishavecctcomplaintcall"
#> [17] "callcentertel.lastcctchujingcalldate"
#> [18] "callcentertel.lastcctyouluncalldate"
# for instance... as long as there are no conflicts
rename_function <- function(x){
x[str_detect(x,"\\.")] <- str_sub(x[str_detect(x,"\\.")],str_locate(x[str_detect(x,"\\.")],"\\.")[,"start"]+1)
return(x)
}
opt_1 %>%
rename_all(.funs=list(rename_function)) %>%
names()
#> [1] "idMemberInfo" "memberid"
#> [3] "g_orders" "g_sfristcreatedate"
#> [5] "g_lastcreatedate" "g_slastcreatedate"
#> [7] "g_fristcreatedate" "v_orders"
#> [9] "v_maxcountryid" "v_lastsorderdate"
#> [11] "v_maxvisaperson" "v_lastorderdate"
#> [13] "v_lastvisacountryid" "v_sorders"
#> [15] "lastcctzzycalldate" "ishavecctcomplaintcall"
#> [17] "lastcctchujingcalldate" "lastcctyouluncalldate"
# option 2 - define what you want
# - more typing up front
opt_2 <- dw_json %>% spread_values(
g_orders = jstring(hotelgroup,g_orders)
, g_sfristcreatedate = jstring(hotelgroup, g_sfristcreatedate)
#...
, lastcctzzycalldate = jstring(callcentertel, lastcctzzycalldate)
#...
)
names(opt_2)
#> [1] "idMemberInfo" "g_orders" "g_sfristcreatedate"
#> [4] "lastcctzzycalldate"
Hope it helps! FWIW, I am hopeful of persisting the tidyjson-like behavior in the R community

Related

Scraping a table from OECD

I'm trying to scrape a table from https://data.oecd.org/unemp/unemployment-rate.htm and my table in specific https://data.oecd.org/chart/66NJ. I want to scrape the months at the top and all the values in the rows 'OECD - Total' and 'The Netherlands'
After trying many different code and searching on this and other forums I just can't figure out how to scrape from this table. I have tried many different html codes found via selector gadget or inspecting an element in my browser but keep getting 'list of 0' or 'character empty'
Any help would be appreciated.
library(tidyverse)
library(rvest)
library(XML)
library(magrittr)
#Get element data from one page
url<-"https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07"
#scrape all elements
content <- read_html(url)
#trying to load in a table (giveslist of 0)
inladentable <- readHTMLTable(url)
#gather al months (gives charahter 'empty')
months <- content %>%
html_nodes(".table-chart-sort-link") %>%
html_table()
#alle waarden voor de rij 'OECD - Total' verzamelen
wwpercentage<- content %>%
html_nodes(".table-chart-has-status-e") %>%
html_text()
# Combine into a tibble
wwtable <- tibble(months=months,wwpercentage=wwpercentage)
This is JSON and not HTML.
You can query it using httr and jsonlite:
library(httr)
res <- GET("https://stats.oecd.org/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07")
res <- jsonlite::fromJSON(content(res,as='text'))
res
#> $header
#> $header$id
#> [1] "98b762f3-47aa-4e28-978a-a4a6f6b3995a"
#>
#> $header$test
#> [1] FALSE
#>
#> $header$prepared
#> [1] "2020-09-30T21:58:10.5763805Z"
#>
#> $header$sender
#> $header$sender$id
#> [1] "OECD"
#>
#> $header$sender$name
#> [1] "Organisation for Economic Co-operation and Development"
#>
#>
#> $header$links
#> href
#> 1 https://stats.oecd.org:443/sdmx-json/data/DP_LIVE/.HUR.TOT.PC_LF.M/OECD?json-lang=en&dimensionAtObservation=allDimensions&startPeriod=2016-08&endPeriod=2020-07
#> rel
#> 1 request
#>
#>
#> $dataSets
#> action observations.0:0:0:0:0:0 observations.0:0:0:0:0:1
#> 1 Information 5.600849, 0.000000, NA 5.645914, 0.000000, NA
...

Extracting JSON-data from CSV file

I'm trying to extract a JSON data which is a column in a CSV file. So far I've come to the point where I've extracted the column in the right format, but the formatting is only correct when the variable type is factor. But I can't convert a factor to a json-file using the jsonlite package.
[1] {"id":509746197991998767,"visibility":{"percentage":100,"time":149797,"visible1":true,"visible2":false,"visible3":false,"activetab":true},"interaction":{"mouseovercount":1,"mouseovertime":1426,"videoplaytime":0,"engagementtime":0,"expandtime":0,"exposuretime":35192}}
Another approach is to use stringsAsFactors = F when importing, but I'm struggling in getting the formatting right, where each entry looks like this:
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Am I missing something obvious here? I simply just want to exract the JSON files that sits inside a CSV file.
Heres a small example of the CSV file:
"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"
Regards,
Frederik.
df <- readr::read_csv('"","CookieID","UnloadVars"
"1",-8857188784608690176,"{""id"":509746197991998767,""visibility"":{""percentage"":100,""time"":149797,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":1,""mouseovertime"":1426,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":35192}}"
"2",-1695626857458244096,"{""id"":2917654329769114342,""visibility"":{""percentage"":46,""time"":0,""visible1"":false,""visible2"":false,""visible3"":false,""activetab"":true}}"
"3",437299165071669184,"{""id"":2252707957388071809,""visibility"":{""percentage"":99,""time"":10168,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":542},""clicks"":[{""x"":105,""y"":449}]}"
"4",292660729552227520,""
"5",7036383942916227072,"{""id"":2299674593327687292,""visibility"":{""percentage"":76,""time"":1145,""visible1"":true,""visible2"":false,""visible3"":false,""activetab"":true},""interaction"":{""mouseovercount"":0,""mouseovertime"":0,""videoplaytime"":0,""engagementtime"":0,""expandtime"":0,""exposuretime"":74},""clicks"":[{""x"":197,""y"":135},{""x"":197,""y"":135}]}"',
col_types = "-cc")
Using jsonlite::fromJSON on each separate value, then tidyr::unnest
library(dplyr)
f <- function(.x)
if (is.na(.x) || .x == "") data.frame()[1, ] else
as.data.frame(jsonlite::fromJSON(.x))
df %>%
tidyr::unnest(UnloadVars = lapply(UnloadVars, f)) %>%
mutate_at(vars(ends_with("id")), as.character)
# A tibble: 6 x 16
# CookieID id visibility.percentage visibility.time visibility.visible1 visibility.visible2 visibility.visible3 visibility.activetab interaction.mouseovercount interaction.mouseovertime interaction.videoplaytime interaction.engagementtime interaction.expandtime interaction.exposuretime clicks.x clicks.y
# <chr> <chr> <int> <int> <lgl> <lgl> <lgl> <lgl> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 -8857188784608690176 509746197991998784 100 149797 TRUE FALSE FALSE TRUE 1 1426 0 0 0 35192 NA NA
# 2 -1695626857458244096 2917654329769114112 46 0 FALSE FALSE FALSE TRUE NA NA NA NA NA NA NA NA
# 3 437299165071669184 2252707957388071936 99 10168 TRUE FALSE FALSE TRUE 0 0 0 0 0 542 105 449
# 4 292660729552227520 <NA> NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
# 6 7036383942916227072 2299674593327687168 76 1145 TRUE FALSE FALSE TRUE 0 0 0 0 0 74 197 135
I used readr::read_csv to read in your sample data set.
> df <- readr::read_csv('~/sample.csv')
Parsed with column specification:
cols(
CookieID = col_double(),
UnloadVars = col_character()
)
As you can see the UnloadVars are read in as characters and not factors. If I now examine the first value in the UnloadVars columns I see the following which matches what you get,
> df$UnloadVars[1]
[1] "{\"id\":509746197991998767,\"visibility\":{\"percentage\":100,\"time\":149797,\"visible1\":true,\"visible2\":false,\"visible3\":false,\"activetab\":true},\"interaction\":{\"mouseovercount\":1,\"mouseovertime\":1426,\"videoplaytime\":0,\"engagementtime\":0,\"expandtime\":0,\"exposuretime\":35192}}"
Now, I use jsonlite::fromJSON,
> j <- jsonlite::fromJSON(df$UnloadVars[1])
> j
$id
[1] 5.097462e+17
$visibility
$visibility$percentage
[1] 100
$visibility$time
[1] 149797
$visibility$visible1
[1] TRUE
$visibility$visible2
[1] FALSE
$visibility$visible3
[1] FALSE
$visibility$activetab
[1] TRUE
$interaction
$interaction$mouseovercount
[1] 1
$interaction$mouseovertime
[1] 1426
$interaction$videoplaytime
[1] 0
$interaction$engagementtime
[1] 0
$interaction$expandtime
[1] 0
$interaction$exposuretime
[1] 35192
Which I believe is what you need since JSONs are parsed as lists in R.
It can be very tricky to deal with JSON data. As a general guide line, you should always strive to have your data in a data frame. This, however, is not always possible. In the specific case, I don't see a way you can have both visibility and interaction values at once in a nicely formatted data frame.
What I will do next is to extract the information from interaction into a data frame.
Load required packages and read the data
library(purrr)
library(dplyr)
library(tidyr)
df <- read.csv("sample.csv", stringsAsFactors = FALSE)
Then remove unvalid JSON
# remove rows without JSON (in this case, the 4th row)
df <- df %>%
dplyr::filter(UnloadVars != "")
Transform each JSON into a list and put them into UnloadVars column. If you didn't know that, it is possible to have list column in a data frame. This can be very useful.
out <- data_frame(CookieID = numeric(),
UnloadVars = list())
for (row in 1:nrow(df)) {
new_row <- data_frame(CookieID = df[row, ]$CookieID,
UnloadVars = list(jsonlite::fromJSON(df[row, ]$UnloadVars)))
out <- bind_rows(out, new_row)
}
out
We can now extract the IDs from the lists in Unload Vars. This is straight forward because there is only one ID per list.
out <- out %>%
mutate(id = map_chr(UnloadVars, ~ .$id))
This final part can seem a bit intimidating. But what I am doing here is taking interaction part from UnloadVars column and putting it into a interaction column. I then transform each row from interaction, which is a list, into a data frame with two columns: key and value. key contains the name of the interaction metric and value its value. I finally unnest it, so we get rid of list columns and end up with a nicely formatted data frame.
unpack_list <- function(obj, key_name) {
as.data.frame(obj) %>%
gather(key) %>%
return()
}
df_interaction <- out %>%
mutate(interaction = map(UnloadVars, ~ .$interaction)) %>%
mutate(interaction = map(interaction, ~ unpack_list(.x, key))) %>%
unnest(interaction)
df_interaction
The solution is not very elegant, but gets the job done. You could apply the same logic to extract information from visibility.

Passing a character variable into a function in R (Tidyjson)

I'm messing around with tidyjson (latest from github, published by Jeremy Stanley). I wanted to sort of automate searching and extract the nested arrays. The following examples below provide the output I want.
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object("name") %>%
gather_keys %>%
append_values_string
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(name) %>%
gather_keys %>%
append_values_string
These both give the same output:
# A tbl_json: 2 x 3 tibble with a "JSON" attribute
`attr(., "JSON")` document.id key string
<chr> <int> <chr> <chr>
1 "bob" 1 first bob
2 "jones" 1 last jones
However, if I declare a character variable before and pass it along it fails.
object_name <- "name"
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(list(name="name")) %>%
gather_keys %>%
append_values_string
Error: Path components must be single names or character strings
Any ideas why this would happen?
If you are familiar with Hadley's book Advanced R, this is a piece of non-standard evaluation that unfortunately does not presently have a workaround in pure tidyjson (I would prefer a enter_object_ that uses standard evaluation, more like dplyr). I am hopeful of that functionality at some point being available, because as you suggest, it would be nice to vectorize and automate these sorts of programs.
The Non-Standard Evaluation is the "magic" that allows you to pass in the un-quoted name and still get good results in your second example (instead of the program looking for an object called name). The hazard is it does not resolve objects like object_name in your case.
That said, it seems you can work-around with do.call and a list of parameters (I fixed your example, as I think it went a bit awry)
library(tidyjson)
json <- "{\"name\": {\"first\": \"bob\", \"last\": \"jones\"}, \"age\": 32}"
object_name <- "name"
do.call(enter_object, args = list(json, object_name)) %>% gather_object %>%
append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name string
#> <chr> <int> <chr> <chr>
#> 1 "\"bob\"" 1 first bob
#> 2 "\"jones\"" 1 last jones
I definitely recommend checking out some of the new features / functionality in the development version of tidyjson with devtools::install_github('jeremystan/tidyjson'), but unfortunately no support for standard evaluation in "path"s yet.

Parse Multiple JSON Objects of Same Type in R

I have two objects of the same type in JSON:
json <- '[{"client":"ABC Company","totalUSD":1870.0000,"durationDays":365,"familySize":4,"assignmentType":"Long Term","homeLocation":"Chicago, IL","hostLocation":"Lyon, France","serviceName":"Service ABC","homeLocationGeoLat":41.8781136,"homeLocationGeoLng":-87.6297982,"hostLocationGeoLat":45.764043,"hostLocationGeoLng":4.835659},{"client":"ABC Company","totalUSD":21082.0000,"durationDays":365,"familySize":4,"assignmentType":"Long Term","homeLocation":"Chicago, IL","hostLocation":"Lyon, France","serviceName":"Service ABC","homeLocationGeoLat":41.8781136,"homeLocationGeoLng":-87.6297982,"hostLocationGeoLat":45.764043,"hostLocationGeoLng":4.835659}]'
How can I parse both objects unto the same data.frame such that I have two rows that share the same columns?
To put that another way, I have a list of JSON objects that I am trying to parse into a data.frame.
I have tried this:
p <- rjson::newJSONParser()
p$addData(json)
df <- p$getObject()
This seems to return a list whereas I am wanting a data.frame:
> df
[[1]]
[[1]]$client
[1] "ABC Company"
[[1]]$totalUSD
[1] 1870
[[1]]$durationDays
[1] 365
[[1]]$familySize
[1] 4
[[1]]$assignmentType
[1] "Long Term"
[[1]]$homeLocation
[1] "Chicago, IL"
[[1]]$hostLocation
[1] "Lyon, France"
[[1]]$serviceName
[1] "Service ABC"
[[1]]$homeLocationGeoLat
[1] 41.87811
[[1]]$homeLocationGeoLng
[1] -87.6298
[[1]]$hostLocationGeoLat
[1] 45.76404
[[1]]$hostLocationGeoLng
[1] 4.835659
[[2]]
[[2]]$client
[1] "ABC Company"
[[2]]$totalUSD
[1] 21082
[[2]]$durationDays
[1] 365
[[2]]$familySize
[1] 4
[[2]]$assignmentType
[1] "Long Term"
[[2]]$homeLocation
[1] "Chicago, IL"
[[2]]$hostLocation
[1] "Lyon, France"
[[2]]$serviceName
[1] "Service ABC"
[[2]]$homeLocationGeoLat
[1] 41.87811
[[2]]$homeLocationGeoLng
[1] -87.6298
[[2]]$hostLocationGeoLat
[1] 45.76404
[[2]]$hostLocationGeoLng
[1] 4.835659
How can I parse this list of JSON objects?
EDIT: In this case, you want do.call and rbind:
do.call(rbind.data.frame, rjson::fromJSON(json))
or using your method:
p <- rjson::newJSONParser()
p$addData(json)
df <- p$getObject()
do.call(rbind, df)

Convert in R output of package Elastic (nested list?) to data.frame or JSON

I am working with R and the package 'elastic' to query an elastic search db containing twitter data in JSON format. The query works fine and I get the output content (out) as I expect.
class(out)
[1] "list"
and out$hits$hits returns
> out$hits$hits
[[1]]
[[1]]$`_index`
[1] "twitter_all_geo-2014-11-01"
[[1]]$`_type`
[1] "ctweet"
[[1]]$`_id`
[1] "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7"
[[1]]$`_version`
[1] 1
[[1]]$`_score`
[1] 2.10757
[[1]]$`_source`
[[1]]$`_source`$id
[1] "528330489049120770"
[[1]]$`_source`$created_at
[1] "2014-10-31T23:39:39+0000"
[[1]]$`_source`$user
[[1]]$`_source`$user$name
[1] "afterlifetemis"
[[1]]$`_source`$place
[[1]]$`_source`$place$geo_point
[[1]]$`_source`$place$geo_point[[1]]
[1] 30.4529
[[1]]$`_source`$place$geo_point[[2]]
[1] 50.61104
[[1]]$`_source`$place$city
[1] "Ukraine"
[[1]]$`_source`$place$country
[1] "Ukraine"
[[1]]$`_source`$place$country_code
[1] "UA"
[[1]]$`_source`$msg
[[1]]$`_source`$msg$text
[1] "u had one job artemis\none"
[[1]]$`_source`$msg$lang
[1] "EN"
[[1]]$`_source`$msg$hash_tags
list()
[[2]]
[[2]]$`_index`
[1] "twitter_all_geo-2014-11-01"
[[2]]$`_type`
[1] "ctweet"
...
...
Basically I wanted to save the data as .csv file, so I entered
> write.csv(out$hits$hits,'out.csv')
Error in data.frame(text = "u had one job artemis\none", lang = "EN", : arguments imply differing number of rows: 1, 0
I assumed that it is necessary to convert it to an data.frame, so I tried:
> df <- ldply (out, data.frame)
Error in data.frame(text = "u had one job artemis\none", lang = "EN", :
arguments imply differing number of rows: 1, 0
(I tried several other, optimistc, attempts too like this one:)
> t(sapply(out$hits$hits, '[', 1:max(sapply(out$hits$hits, length))))
_index _type _id _version _score _source
[1,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7" 1 2.10757 List,5
[2,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-ba071fff-cafb-4d3f-947d-13c934905c1b" 1 2.10757 List,5
[3,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-dd64af32-4d59-4008-a3db-74471ad269d1" 1 2.10757 List,5
[4,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-4ba0d3d0-642d-4f9f-aaf9-c55929c35dc4" 1 2.10757 List,5
[5,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d7b8cbbc-87b3-44b5-8c9c-91c7b62f1458" 1 2.10757 List,5
[6,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-76353a7c-44c9-4863-a59d-adb16716ca18" 1 2.10757 List,5
[7,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-2aec0798-9918-4b66-9b2a-ef5a4d1f3711" 1 2.10757 List,5
[8,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-c9e7637d-358a-40ee-a06c-85af04c22191" 1 2.10757 List,5
[9,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-8928c1ef-f46a-4682-99c4-4dbc55270b03" 1 2.10757 List,5
[10,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d6b19975-b310-46c4-af11-af56971b7c4b" 1 2.10757 List,5
And in the beginning it looked good, but the actual tweet message isn't anymore in the matrix
I was optimistic and thought maybe convert it first (back) to JSON (using RJSON)
toJSON(out)
Error in toJSON(out) : unable to escape string. String is not utf8
At the end I have a list and can not save, can not convert to JSON, data.frame or data.table (because it is not uniform). Does anyone can give me an hint on a) convert it to JSON or on how to save the list to a .csv file or to put it in a data.frame?
Thanks a lot, I think I don't understand it.
-Tobias
I think unlist() and matrix() can do the job.
An example converting the Search()-return out into data frame:
# get the first 3 hits from elasticsearch store
out <- Search(index="shakespeare", size=3)
# (optional) verify that all hits expand to the same length
# (should be true for data intended to be in a table format)
stopifnot(
sapply(
out$hits$hits,
function(x) {!(length(unlist(x)) - length(unlist(out$hits$hits[[1]])))}
)
)
# count number of columns, use unlist() to convert
# nested lists to a vector, use the first hit as proxy
nColumns <- length(unlist(out$hits$hits[[1]]))
# fetch column names ... as above
nNames <- names(unlist(out$hits$hits[[1]]))
# unlist all hits and convert to matrix with ncol Columns, don't forget byrow=TRUE!
df <- data.frame(matrix(unlist(out$hits$hits), ncol=nColumns, byrow=TRUE))
# setting the column names
names(df) <- nNames
# do whatever you want with df
print(df)
Cheers!
you can use "jqr" package in R. For eg:-
datacsv<-jq(out,".hits.hits[] | #csv")
It will save your data into csv format and with the help of "jqr" you can also grep the fields that you want.