Json list not 'flattening' properly - json

I have the below list in a column of a data frame.
As you can see, the variables change through the items. The column affilications is not always present.
I have been trying to flatten the list to a data frame or to a list of 3, but I am geeting a single columg with all elements of every column.
Is there a way I can tell R that each element has 3 columns and that the first one is not always present and to fill it with let's say null.
[[1]]
NULL
[[2]]
affiliations author_id author_name
1 Punjabi University 780E3459 munish puri
2 Punjabi University 48D92C79 rajesh dhaliwal
3 Punjabi University 7D9BD37C r s singh
[[3]]
author_id author_name
1 7FF872BC barbara eileen ryan
[[4]]
author_id author_name
1 0299B8E9 fraser j harbutt
[[5]]
author_id author_name
1 7DAB7B72 richard m freeland
[[6]]
NULL
This is what I'm getting when I try and flatten it.
authors
1 Punjabi University
2 Punjabi University
3 Punjabi University
4 780E3459
5 48D92C79
6 7D9BD37C
7 munish puri
8 rajesh dhaliwal
9 r s singh
10 7FF872BC
But what I really need would be:
[[1]] NULL
[[2]]affiliations author_id author_name
1 Punjabi University 780E3459 munish puri
2 Punjabi University 48D92C79 rajesh dhaliwal
3 Punjabi University 7D9BD37C r s singh
[[3]] NULL author_id author_name
1 NULL 7FF872BC barbara eileen ryan

I i understand you correctly you have data as follows:
require(tidyverse)
list(
NULL,
tibble(a=c(2, 2), b=c(2, 2), c=c(2, 2)),
tibble(b=3, c=3)
)
So:
[[1]]
NULL
[[2]]
# A tibble: 2 x 3
a b c
<dbl> <dbl> <dbl>
1 2 2 2
2 2 2 2
[[3]]
# A tibble: 1 x 2
b c
<dbl> <dbl>
1 3 3
Using bind_rows results in:
bind_rows(list(
NULL,
tibble(a=c(2, 2), b=c(2, 2), c=c(2, 2)),
tibble(b=3, c=3)
))
# A tibble: 3 x 3
a b c
<dbl> <dbl> <dbl>
1 2 2 2
2 2 2 2
3 NA 3 3

Related

Scraping Website with Unchanging URL in R

I would like to scrape a series of tables from a website whose URL does not change when I click through the tables in my browser. Each table corresponds to a unique date. The default table is that which corresponds to today's date. I can scroll through past dates in my browser, but can't seem to find a way to do so in R.
Using library(rvest) this bit of code will reliably download the table that corresponds to today's date (I'm only interested in the first of the three tables).
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_table()
off <- off[[1]]
How can I download the table that corresponds to, say "2022-10-04", to "2022-10-06", or to yesterday?
I've tried to work through it by identifying the node under which the table lies, in the hopes that I could manipulate it to reflect a prior date. However, the following reproduces the same table as above:
webad <- "https://official.nba.com/referee-assignments/"
off <- webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > div > div.dayContent > div > table") %>%
html_table()
off <- off[[1]]
Scrolling through past dates in my browser, I've identified various places in the html that reference the prior date; but I can't seem to change it from R, yet alone get the table I download to reflect a change:
webad %>%
read_html() %>%
html_nodes("#main > div > section:nth-child(1) > article > header > div")
I've messed around some with html_form(), follow_link(), and set_values() also, but to no avail.
Is there a good way to navigate this particular URL in R?
You can consider the following approach :
library(RSelenium)
library(rvest)
port <- as.integer(4444L + rpois(lambda = 1000, 1))
rd <- rsDriver(chromever = "105.0.5195.52", browser = "chrome", port = port)
remDr <- rd$client
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
Here is another approach that can be considered :
library(RDCOMClient)
library(rvest)
url <- "https://official.nba.com/referee-assignments/"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Date <- doc$querySelector("#ref-filters-menu > li > div > button")
web_Obj_Date$dispatchEvent(clickEvent)
web_Obj_Date_Input <- doc$GetElementById('ref-date')
web_Obj_Date_Input[["Value"]] <- "2022-10-05"
web_Obj_Go_Button <- doc$querySelector("#date-filter")
web_Obj_Go_Button$dispatchEvent(clickEvent)
html_Content <- doc$Body()$innerHTML()
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 8 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Game" "Official 1" "Official 2" "Official 3" "Alternate" NA NA
2 "S" "M" "T" "W" "T" "F" "S"
3 "" "" "" "" "" "" "1"
4 "2" "3" "4" "5" "6" "7" "8"
5 "9" "10" "11" "12" "13" "14" "15"
6 "16" "17" "18" "19" "20" "21" "22"
7 "23" "24" "25" "26" "27" "28" "29"
8 "30" "31" "" "" "" "" ""
[[3]]
# A tibble: 7 x 7
Game `Official 1` `Official 2` `Official 3` Alternate `` ``
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "S" "M" "T" "W" "T" "F" "S"
2 "" "" "" "" "" "" "1"
3 "2" "3" "4" "5" "6" "7" "8"
4 "9" "10" "11" "12" "13" "14" "15"
5 "16" "17" "18" "19" "20" "21" "22"
6 "23" "24" "25" "26" "27" "28" "29"
7 "30" "31" "" "" "" "" ""
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA
If you install the Docker software (see https://docs.docker.com/engine/install/), you can consider the following approach with firefox :
library(RSelenium)
library(rvest)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
url <- "https://official.nba.com/referee-assignments/"
remDr$navigate(url)
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Date_Input <- remDr$findElement("id", 'ref-date')
web_Obj_Date_Input$clearElement()
web_Obj_Date_Input$sendKeysToElement(list("2022-10-05"))
web_Obj_Date_Input$doubleclick()
web_Obj_Date <- remDr$findElement("css selector", "#ref-filters-menu > li > div > button")
web_Obj_Date$clickElement()
web_Obj_Go_Button <- remDr$findElement("css selector", "#date-filter")
web_Obj_Go_Button$submitElement()
html_Content <- remDr$getPageSource()[[1]]
read_html(html_Content) %>% html_table()
[[1]]
# A tibble: 5 x 5
Game `Official 1` `Official 2` `Official 3` Alternate
<chr> <chr> <chr> <chr> <lgl>
1 Indiana # Charlotte John Goble (#10) Lauren Holtkamp (#7) Phenizee Ransom (#70) NA
2 Cleveland # Philadelphia Marc Davis (#8) Jacyn Goble (#68) Tyler Mirkovich (#97) NA
3 Toronto # Boston Josh Tiven (#58) Matt Boland (#18) Intae hwang (#96) NA
4 Dallas # Oklahoma City Courtney Kirkland (#61) Mitchell Ervin (#27) Cheryl Flores (#91) NA
5 Phoenix # L.A. Lakers Bill Kennedy (#55) Rodney Mott (#71) Jenna Reneau (#93) NA
[[2]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[3]]
# A tibble: 0 x 5
# ... with 5 variables: Game <lgl>, Official 1 <lgl>, Official 2 <lgl>, Official 3 <lgl>, Alternate <lgl>
# i Use `colnames()` to see all variable names
[[4]]
# A tibble: 6 x 7
S M T W T F S
<int> <int> <int> <int> <int> <int> <int>
1 NA NA NA NA NA NA 1
2 2 3 4 5 6 7 8
3 9 10 11 12 13 14 15
4 16 17 18 19 20 21 22
5 23 24 25 26 27 28 29
6 30 31 NA NA NA NA NA

Joining/Merge two keys to one key and removing duplicates

I want to merge two tables using SQL (Toad for Oracle) merging one key from one table to two keys of the main table while deleting the duplicated entries. I guess an illustration provides you better understanding of my problem:
Dataframe A:
Key1 Key2 ColA
1 2 1993
1 2 1992
1 4 1991
2 4 1990
2 5 1989
2 5 1988
3 5 1987
3 6 1986
3 6 1985
Dataframe B:
Key1&2 ColB
1 Adress1
1 Adress1
1 Adress1
2 Adress2
2 Adress2
3 Adress3
3 Adress3
3 Adress3
4 Adress4
4 Adress4
5 Adress5
6 Adress6
6 Adress6
Desired Dataframe:
Key1 Key2 ColA ColB-1 ColB-2
1 2 1993 Adress1 Adress2
1 2 1992 Adress1 Adress2
1 4 1991 Adress1 Adress4
2 4 1990 Adress2 Adress4
2 5 1989 Adress2 Adress5
2 5 1988 Adress2 Adress5
3 5 1987 Adress3 Adress5
3 6 1986 Adress3 Adress6
3 6 1985 Adress3 Adress6
I tried using following statement so far:
SELECT *
FROM A
LEFT JOIN B ON A.key1=B.key1&2
LEFT JOIN B ON A.key2=B.key1&2
However, as I explained, because dataframe B has duplicated rows in its keys, duplicated rows in my output as well.
Hope its clear.
Thanks,
KS
Is this what you want?
SELECT A.*, B1.ColB, B2.ColB
FROM A LEFT JOIN
(SELECT DISTINCT key1&2, ColB
FROM B
) B1 ON A.key1 = B1.key1&2 LEFT JOIN
(SELECT DISTINCT key1&2, ColB
FROM B
) B2 ON A.key2 = B2.key1&2;

How to extract strings from rows which have .json like format?

I have imported a .json file using library(jsonlite) stream_in(file(".json"))
However, one of the columns still looks as a .json format.
Im not really sure how proceed in order to extact the columns ID and email from the .json column.
My example:
date <- as.Date(as.character( c("2015-02-13",
"2015-02-14",
"2015-02-14")))
ID <- c(1,2,3)
name <- c("John","Michael","Thomas")
drinks <- c("Beer","Coffee","Tee")
consumed <- c(2,5,3)
john<- "{\"employeID\":\"1\",\"other_details\":{\"email\":\"john#gmx.com\"},\"computer\":\"yes\"}"
michael<- "{\"employeID\":\"2\",\"other_details\":{\"email\":\"michael#yahoo.com\"},\"computer\":\"yes\"}"
thomas<- "{\"employeID\":\"3\",\"other_details\":{\"email\":\"thomas#gmail.com\"},\"computer\":\"yes\"}"
json <- c(john,michael,thomas)
df <- data.frame(date,ID,name,drinks,consumed,json)
Where the data.frame looks like that:
I would like to get the following format:
date ID name drinks consumed email computer
#1 2015-02-13 1 John Beer 2 john#gmx.com yes
#2 2015-02-14 2 Michael Coffee 5 michael#yahoo.com no
#3 2015-02-14 3 Thomas Tee 3 thomas#gmail.com yes
What I have tried was to was first to use the library(jsonlite) again in different variations but it always results in:
fromJSON(df$json[1])
Error: Argument 'txt' must be a JSON string, URL or file.
How can I extract these fields properly?
df$json is a factor vector while fromJSON only accepts a JSON string, URL or file. You can try
fromJSON(as.character(df$json[1]))
or add stringsAsFactor=FALSE when you create df.
You do your task, you can try:
library(tidyverse)
df %>%
filter(json != "{}") %>% # Drop rows with json == "{}"
rowwise() %>%
do(data.frame(ID = .$ID, jsonlite::fromJSON(.$json), stringsAsFactors=FALSE)) %>%
merge(df %>% select(-json), by="ID", all.y=TRUE)
Output:
ID employeID email computer date name drinks consumed
1 1 1 john#gmx.com yes 2015-02-13 John Beer 2
2 2 2 michael#yahoo.com yes 2015-02-14 Michael Coffee 5
3 3 3 thomas#gmail.com yes 2015-02-14 Thomas Tee 3
It can handle cases with "{}" in json column.
df2 <- df %>%
rbind(data.frame(date="2015-02-14", ID=4, name="Kitman",
drinks="Chocolate", consumed=1, json="{}"))
df2 %>%
filter(json != "{}") %>%
rowwise() %>%
do(data.frame(ID = .$ID, jsonlite::fromJSON(.$json), stringsAsFactors=FALSE)) %>%
merge(df2 %>% select(-json), by="ID", all.y=TRUE)
Output:
ID employeID email computer date name drinks consumed
1 1 1 john#gmx.com yes 2015-02-13 John Beer 2
2 2 2 michael#yahoo.com yes 2015-02-14 Michael Coffee 5
3 3 3 thomas#gmail.com yes 2015-02-14 Thomas Tee 3
4 4 <NA> <NA> <NA> 2015-02-14 Kitman Chocolate 1
Outdated:
cbind(
df %>% select(-json),
df$json %>%
map(~as.data.frame(jsonlite::fromJSON(.))) %>%
do.call("rbind", .)
)
Output:
date ID name drinks consumed employeID email computer
1 2015-02-13 1 John Beer 2 1 john#gmx.com yes
2 2015-02-14 2 Michael Coffee 5 2 michael#yahoo.com yes
3 2015-02-14 3 Thomas Tee 3 3 thomas#gmail.com yes
First, try:
ndjson::stream_in("filename.json")
The ndjson package is faster than jsonlite and was built for flattening (it's very task-specific and not as swiss-army-knife-ish as the highly useful jsonlite pkg).
Or, we can keep the tidyverse idioms all the way through:
library(tidyverse)
map_df(df$json, ~jsonlite::fromJSON(as.character(.))) %>%
bind_cols(select(df, -json)) %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.list, as.character) %>%
select(ID, name, drinks, consumed, everything())
## # A tibble: 3 × 8
## ID name drinks consumed computer employeID other_details.email date
## <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <date>
## 1 1 John Beer 2 yes 1 john#gmx.com 2015-02-13
## 2 2 Michael Coffee 5 yes 2 michael#yahoo.com 2015-02-14
## 3 3 Thomas Tee 3 yes 3 thomas#gmail.com 2015-02-14
And, you get your character columns.

How can I replace empty cells with NA in R?

I'm new to R, and have been trying a bunch of examples but I couldn't get anything to change all of my empty cells into NA.
library(XML)
theurl <- "http://www.pro-football-reference.com/teams/sfo/1989.htm"
table <- readHTMLTable(theurl)
table
Thank you.
The result you get from readHTMLTable is giving you a list of two tables, so you need to work on each list element, which can be done using lapply
table <- lapply(table, function(x){
x[x == ""] <- NA
return(x)
})
table$team_stats
Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds TD Int NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14 4.0 124 109 922 17
2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9 3.7 76 75 581 29
3 Lg Rank Offense 1 1 <NA> <NA> 2 10 1 <NA> 20 2 1 1 1 <NA> 13 10 12 13 <NA> <NA> <NA> <NA>
4 Lg Rank Defense 3 4 <NA> <NA> 11 9 9 <NA> 25 11 3 9 5 <NA> 1 3 3 8 <NA> <NA> <NA> <NA>
You have a list of data.frames of factors, though the actual data is mostly numeric. Converting to the appropriate type with type.convert will automatically insert the appropriate NAs for you:
df_list <- lapply(table, function(x){
x[] <- lapply(x, function(y){type.convert(as.character(y), as.is = TRUE)});
x
})
df_list[[1]][, 1:18]
## Player PF Yds Ply Y/P TO FL 1stD Cmp Att Yds.1 TD Int NY/A 1stD.1 Att.1 Yds.2 TD.1
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1 209 493 1966 14
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3 178 372 1383 9
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0 NA 13 10 12
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0 NA 1 3 3
Or more concisely but with a lot of packages,
library(tidyverse) # for purrr functions and readr::type_convert
library(janitor) # for clean_names
df_list <- map(table, ~.x %>% clean_names() %>% dmap(as.character) %>% type_convert())
df_list[[1]]
## # A tibble: 4 × 23
## player pf yds ply y_p to fl x1std cmp att yds_2 td int ny_a
## <chr> <int> <int> <int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 Team Stats 442 6268 1021 6.1 25 14 350 339 483 4302 35 11 8.1
## 2 Opp. Stats 253 4618 979 4.7 37 16 283 316 564 3235 15 21 5.3
## 3 Lg Rank Offense 1 1 NA NA 2 10 1 NA 20 2 1 1 1.0
## 4 Lg Rank Defense 3 4 NA NA 11 9 9 NA 25 11 3 9 5.0
## # ... with 9 more variables: x1std_2 <int>, att_2 <int>, yds_3 <int>, td_2 <int>, y_a <dbl>,
## # x1std_3 <int>, pen <int>, yds_4 <int>, x1stpy <int>

Convert JSON to data.frame with more than 2 columns

I am trying to properly convert a JSON to a data.frame with 3 columns.
This is a simplification of my data
# simplification of my real data
my_data <- '{"Bag 1": [["bananas", 1], ["oranges", 2]],"Bag 2": [["bananas", 3], ["oranges", 4], ["apples", 5]]}'
library(jsonlite)
my_data <- fromJSON(my_data)
> my_data
$`Bag 1`
[,1] [,2]
[1,] "bananas" "1"
[2,] "oranges" "2"
$`Bag 2`
[,1] [,2]
[1,] "bananas" "3"
[2,] "oranges" "4"
[3,] "apples" "5"
I try to convert that to a data.frame
# this return an error about "arguments imply differing number of rows: 2, 3"
my_data <- as.data.frame(my_data)
> my_data <- as.data.frame(my_data)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 2, 3
This is my solution to create the data.frame
# my solution
my_data <- data.frame(fruit = do.call(c, my_data),
bag_number = rep(1:length(my_data),
sapply(my_data, length)))
# how it looks
my_data
> my_data
fruit bag_number
Bag 11 bananas 1
Bag 12 oranges 1
Bag 13 1 1
Bag 14 2 1
Bag 21 bananas 2
Bag 22 oranges 2
Bag 23 apples 2
Bag 24 3 2
Bag 25 4 2
Bag 26 5 2
But my idea is to obtain something like this to avoid problems like doing my_data[a:b,1] when I want to use ggplot2 and others.
fruit | quantity | bag_number
oranges | 2 | 1
bananas | 1 | 1
oranges | 4 | 2
bananas | 3 | 2
apples | 5 | 2
library(plyr)
# import data (note that the rJSON package does this differently than the jsonlite package)
data.import <- jsonlite::fromJSON(my_data)
# combine all data using plyr
df <- ldply(data.import, rbind)
# clean up column names
colnames(df) <- c('bag_number', 'fruit', 'quantity')
bag_number fruit quantity
1 Bag 1 bananas 1
2 Bag 1 oranges 2
3 Bag 2 bananas 3
4 Bag 2 oranges 4
5 Bag 2 apples 5
purrr / tidyverse version. You also get proper types with this and rid of "Bag":
library(jsonlite)
library(purrr)
library(readr)
fromJSON(my_data, flatten=TRUE) %>%
map_df(~as.data.frame(., stringsAsFactors=FALSE), .id="bag") %>%
type_convert() %>%
setNames(c("bag_number", "fruit", "quantity")) -> df
df$bag_number <- gsub("Bag ", "", df$bag_number)