R: Missed values when scraping webpage

R: Missed values when scraping webpage - html

When scraping data from a webpage, some elements/values are not returned.
Specifically, I use the rvest package to scrap.
The webpage that contains the information I want is https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/ - however, when I scrap the data, the columns with prices only return "$-".
Sample code:
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
Output of first df:
> List of 22 $ :'data.frame': 7 obs. of 6 variables: ..$ Instance
> : chr [1:7] "B1L" "B1S" "B2S" "B1MS" ... ..$ Cores
> : int [1:7] 1 1 2 1 2 4 8 ..$ RAM
> : chr [1:7] "0.50 GiB" "1.00 GiB" "4.00 GiB" "2.00 GiB" ... ..$
> Temporary Storage : chr [1:7] "1 GiB" "2
> GiB" "8 GiB" "4 GiB" ... ..$ Price
> : chr [1:7] "$-" "$-" "$-" "$-" ... ..$ Prices with Azure Hybrid
> Benefit1 (% savings): chr [1:7] "$-" "$-" "$-" "$-" ...
What can I do to get the whole value of these specific elements?

They have a single set of price data irrespective of the filter. So you need to take that attribute's value and parse the json.
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
ss <- webpage %>% html_nodes("table span.price-data ") %>% xml_attr('data-amount')
lapply(ss,function(x){data.frame(jsonlite::fromJSON(x))})
Sample output:
[[176]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 1.496 1.496 1.376 1.376
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 1.488 1.464 1.448 1.373 1.504
regional.us.west regional.us.west.2
1 1.376 1.248
[[177]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 4.464 4.464 4.224 4.224
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 4.448 4.4 4.368 4.365 4.48
regional.us.west regional.us.west.2
1 4.224 3.968
You need to match that particular value and take the price from this.

Related

fast JSON to data.frame/data.table

I have this R code to convert JSON data to a data.frame. It works fine but it is rather slow for huge JSON files. What's the more efficient way to do this (won't mind having a data.table output)?
json_data <- fromJSON(json_dt_url)
json_data <- json_data[['data']]
my_df <- data.frame()
for (i in 1:length(json_data))
{
my_df <- rbind(my_df, as.data.frame(json_data[[i]]))
}

If you are looking for fast JSON parsing, take a look at RcppSimdJson.
library(RcppSimdJson)
jsonfile <- system.file("jsonexamples", "small", "demo.json", package="RcppSimdJson")
res <- fload(jsonfile)
str(res)
#> List of 1
#> $ Image:List of 6
#> ..$ Width : int 800
#> ..$ Height : int 600
#> ..$ Title : chr "View from 15th Floor"
#> ..$ Thumbnail:List of 3
#> .. ..$ Url : chr "http://www.example.com/image/481989943"
#> .. ..$ Height: int 125
#> .. ..$ Width : int 100
#> ..$ Animated : logi FALSE
#> ..$ IDs : int [1:4] 116 943 234 38793
Created on 2020-08-05 by the reprex package (v0.3.0)
Using the benchmarking code from the package, we can compare different parsing approaches:
file <- system.file("jsonexamples", "mesh.json", package = "RcppSimdJson")
res <- bench::mark(
RcppSimdJson = RcppSimdJson::fload(file),
jsonlite = jsonlite::fromJSON(file),
jsonify = jsonify::from_json(file),
RJSONIO = RJSONIO::fromJSON(file),
ndjson = ndjson::stream_in(file),
check = FALSE
)
res
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 RcppSimdJson 1.51ms 1.67ms 582. 5.82MB 5.98
#> 2 jsonlite 44.68ms 48.95ms 18.8 2.74MB 22.6
#> 3 jsonify 9.76ms 11.34ms 87.5 1.12MB 43.7
#> 4 RJSONIO 33.11ms 35.17ms 28.6 2.93MB 3.82
#> 5 ndjson 136.35ms 138.67ms 7.21 9.41MB 30.6
Created on 2020-08-05 by the reprex package (v0.3.0)
We see that RcppSimdJson is by far the fastest.

data2 <- fromJSON("data.json", flatten = TRUE)
reference https://rdrr.io/cran/jsonlite/f/vignettes/json-apis.Rmd

Try this way:
library(jsonlite)
json_data <- read_json("data.json", simplifyVector = TRUE)
Include the sample input so that I can test the solution myself!

Filter in Nested Data Frame

I am playing around with the Yelp data set and want to filter the business set according to the category.
I imported the JSON file into R with
yelp_business = stream_in(file("yelp_academic_dataset_business.json"))
which results then in the following data frame:
'data.frame': 77445 obs. of 15 variables:
$ business_id : chr "5UmKMjUEUNdYWqANhGckJw" "UsFtqoBl7naz8AVUBZMjQQ" "3eu6MEFlq2Dg7bQh8QbdOg" "cE27W9VPgO88Qxe4ol6y_g" ...
$ full_address : chr "4734 Lebanon Church Rd\nDravosburg, PA 15034" "202 McClure St\nDravosburg, PA 15034" "1 Ravine St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" ...
$ hours :'data.frame': 77445 obs. of 7 variables:
..$ Friday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Tuesday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Thursday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Wednesday:'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Monday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr "21:00" NA NA NA ...
.. ..$ open : chr "11:00" NA NA NA ...
..$ Sunday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
..$ Saturday :'data.frame': 77445 obs. of 2 variables:
.. ..$ close: chr NA NA NA NA ...
.. ..$ open : chr NA NA NA NA ...
$ open : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ categories :List of 77445
..$ : chr "Fast Food" "Restaurants"
..$ : chr "Nightlife"
..$ : chr "Auto Repair" "Automotive"
..$ : chr "Active Life" "Mini Golf" "Golf"
..$ : chr "Shopping" "Home Services" "Internet Service Providers" "Mobile Phones" ...
..$ : chr "Bars" "American (New)" "Nightlife" "Lounges" ...
..$ : chr "Active Life" "Trainers" "Fitness & Instruction"
..$ : chr "Bars" "American (Traditional)" "Nightlife" "Restaurants"
..$ : chr "Auto Repair" "Automotive" "Tires"
..$ : chr "Active Life" "Mini Golf"
..$ : chr "Home Services" "Contractors"
..$ : chr "Veterinarians" "Pets"
..$ : chr "Libraries" "Public Services & Government"
..$ : chr "Automotive" "Auto Parts & Supplies"
I now want to filter all rows according to the business category and want to include all categories that have food in the category list.
However, if I just try it that way:
input ="food"
engage = filter(yelp_business, grepl(input, categories))
I receive the following error code:
Error: data_frames can only contain 1d atomic vectors and lists
I first suspected the nested structure to be a reason for that. However using tidyjson does not help either as category is a list and not a dataframe within the main dataframe.
Does anyone have an idea how to solve this? I just need a list of all food restaurant's business ids to then filter the review json file from Yelp to extract the written reviews.
Any help with this is really appreciated! Thanks a lot!

tidyjson does not yet support ndjson, and I am not quite sure how to nicely work with stream_in().
However, it is possible to read the file directly and process naturally with tidyjson. I am using the development version from devtools::install_github('jeremystan/tidyjson').
document.id gives a nice identification of objects, so I find the document.ids that have "food" in one of the "categories." From that point, we filter and do whatever additional data analysis is desired.
library(dplyr)
library(stringr)
library(tidyjson)
j <- readLines("yelp_academic_dataset_business.json")
raw <- j %>% as.tbl_json()
## pull out the categories for filtering
prep <- raw %>% enter_object("categories") %>%
gather_array() %>% append_values_string()
## filter to 'food' categories (use document.id to identify json objects)
keepids <- prep[str_detect(str_to_lower(prep$string), "food"), ]$document.id %>%
unique()
## filter and do any further data analysis you want to do
raw %>% filter(document.id %in% keepids) %>%
spread_values(
name = json_chr(name),
city = json_chr(city),
state = json_chr(state),
stars = json_chr(stars))
#> # A tbl_json: 21 x 5 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name city
#> <chr> <int> <chr> <chr>
#> 1 "{\"business_id\":..." 2 Cut and Taste Las Vegas
#> 2 "{\"business_id\":..." 8 Taco Bell Scottsdale
#> 3 "{\"business_id\":..." 10 Sehne Backwaren Stuttgart
#> 4 "{\"business_id\":..." 20 Graceful Cake Creations Mesa
#> 5 "{\"business_id\":..." 26 Chipotle Mexican Grill Toronto
#> 6 "{\"business_id\":..." 30 Carrabba's Italian Grill Glendale
#> 7 "{\"business_id\":..." 32 I Deal Coffee Toronto
#> 8 "{\"business_id\":..." 34 Lo-Lo's Chicken & Waffles Phoenix
#> 9 "{\"business_id\":..." 38 Kabob Palace Las Vegas
#> 10 "{\"business_id\":..." 43 Tea Shop 168 Markham
#> # ... with 11 more rows, and 2 more variables: state <chr>, stars <chr>
NOTE - I only processed the first 100 records of the yelp_academic_dataset_business.json file.

Transform list cell in data frame into rows

I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.

Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13

"NA" in JSON file translates to NA logical

I have json files with data for countries. One of the files has the following data:
"[{\"count\":1,\"subject\":{\"name\":\"Namibia\",\"alpha2\":\"NA\"}}]"
I have the following code convert the json into a data.frame using the jsonlite package:
df = as.data.frame(fromJSON(jsonfile), flatten=TRUE))
I was expecting a data.frame with numbers and strings:
count subject.name subject.alpha2
1 Namibia "NA"
Instead, the NA alpha2 code is being automatically converted into NA logical, and this is what I get:
str(df)
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: logi NA
I want alpha2 to be a string, not logical. How do I fix this?

That particular implementation of fromJSON (and there are three different packages with that name for a function) has a simplifyVector argument which appears to prevent the corecion:
require(jsonlite)
> as.data.frame( fromJSON(test, simplifyVector=FALSE ) )
count subject.name subject.alpha2
1 1 Namibia NA
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : Factor w/ 1 level "Namibia": 1
$ subject.alpha2: Factor w/ 1 level "NA": 1
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ,stringsAsFactors=FALSE) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: chr "NA"
I tried seeing if that option worked well with the flatten argument, but was disappointed:
> str( fromJSON(test, simplifyVector=FALSE, flatten=TRUE) )
List of 1
$ :List of 2
..$ count : int 1
..$ subject:List of 2
.. ..$ name : chr "Namibia"
.. ..$ alpha2: chr "NA"

The accepted answer did not solve my use case.
However, rjson::fromJSON does this naturally, and to my surprise, 10 times faster on my data.

How do I write a json array from R that has a sequence of lat and long?

How do I write a json array from R that has a sequence of lat and long?
I would like to write:
[[[1,2],[3,4],[5,6]]]
the best I can do is:
toJSON(matrix(1:6, ncol = 2, byrow = T))
#"[ [ 1, 2 ],\n[ 3, 4 ],\n[ 5, 6 ] ]"
How can I wrap the thing in another array (the json kind)?
This is important to me so I can write files into a geojson format as a LineString.

I usually use fromJSON to get the target object :
ll <- fromJSON('[[[1,2],[3,4],[5,6]]]')
str(ll)
List of 1
$ :List of 3
..$ : num [1:2] 1 2
..$ : num [1:2] 3 4
..$ : num [1:2] 5 6
So we should create , a list of unnamed list, each containing 2 elements:
xx <- list(setNames(split(1:6,rep(1:3,each=2)),NULL))
identical(toJSON(xx),'[[[1,2],[3,4],[5,6]]]')
[1] TRUE

If you have a matrix
m1 <- matrix(1:6, ncol=2, byrow=T)
may be this helps:
library(rjson)
paste0("[",toJSON(setNames(split(m1, row(m1)),NULL)),"]")
#[1] "[[[1,2],[3,4],[5,6]]]"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

R: Missed values when scraping webpage - html

Related

fast JSON to data.frame/data.table

Filter in Nested Data Frame

Transform list cell in data frame into rows

"NA" in JSON file translates to NA logical

How do I write a json array from R that has a sequence of lat and long?

Categories

Resources