Convert HTML Entity to proper character R - html

Does anyone know of a generic function in r that can convert ä to its unicode character â? I have seen some functions that take in â, and convert it to a normal character. Any help would be appreciated. Thanks.
Edit: Below is a record of data, which I probably have over 1 million records. Is there an easier solution other than reading the data into a massive vector, and for each element, changing the records?
wine/name: 1999 Domaine Robert Chevillon Nuits St. Georges 1er Cru Les Vaucrains
wine/wineId: 43163
wine/variant: Pinot Noir
wine/year: 1999
review/points: N/A
review/time: 1337385600
review/userId: 1
review/userName: Eric
review/text: Well this is awfully gorgeous, especially with a nicely grilled piece of Copper River sockeye. Pine needle and piercing perfume move to a remarkably energetic and youthful palate of pure, twangy, red fruit. Beneath that is a fair amount of umami and savory aspect with a surprising amount of tannin. Lots of goodness here. Still quite young but already rewarding at this stage.
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!
Update:
Using the function stri_trans_general function will convert any  to a correct lowercase character, and vapply results need to be assigned to save changes.
#cellartracker-10records is the test file to use
tester <- "/Users/petergensler/Desktop/Wine Analysis/cellartracker-10records.txt"
decode <- function(x) { xmlValue(getNodeSet(htmlParse(tester), "//p")[[1]]) }
#Using vector, as we want to iterate over the raw file for cleaning
poop <- vapply(tester, decode, character(1), USE.NAMES = FALSE)
#Now use stringi to convert all characters to correct characters poop
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
writeLines(poop, "wines.txt")

Here's one way via the XML package:
txt <- "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.
This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.
If you want to run this for a character vector of length >1, then lapply() this:
txt <- rep(txt, 2)
decode <- function(x) {
xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)
or if you want it as a vector, vapply():
> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:
txt <- "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"
out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
This gives me
> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"
Which if you write out using writeLines()
writeLines(out, "wines.txt")
You'll get a text file, which can be read in again using your other parsing code:
> readLines("wines.txt")
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
[2] "Riesling Spätlese"
[3] "wine/wineId: 3058"
[4] "wine/variant: Riesling"
[5] "wine/year: 2001"
[6] "review/points: N/A"
[7] "review/time: 1095120000"
[8] "review/userId: 1"
[9] "review/userName: Eric"
[10] "review/text: Hideously corked!"
And it is a file (from my BASH terminal)
$ cat wines.txt
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

Related

How to correctly identify html node

I want to scrape the price of a product on a webshop, but I struggle to correctly allocate the correct nodes to the price i want to scrape.
The relevant part of my code looks like this:
"https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/"%>%
read_html()%>%
html_nodes('span.woocommerce-Price-amount.amount')%>%
html_text()
When executing this code, I do get prices as a result, but not the ones i want (it shows the prices of other produts that are listed beneath.
How can I now correctly identify the node to the price of the product itself (375.-)
First: I don't know R.
This page uses JavaScript to add this price in HTML
but I don't know if rvest can run JavaScript.
But I found this value in <form data-product_variations="..."> as JSON
and I could display prices for all options:
data <- "https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/" %>%
read_html() %>%
html_nodes('form.variations_form.cart') %>%
html_attr('data-product_variations') %>%
fromJSON
data$display_price
data$regular_price
data$image$title
Result:
> data$display_price
[1] 479 375 439 479 479
> data$display_regular_price
[1] 699 549 629 699 699
> data$image$title
[1] "aqua marina fusion bamboo padddel"
[2] "aqua marina fusion aluminium padddel"
[3] "aqua marina fusion carbon padddel"
[4] "aqua marina fusion hibi padddel"
[5] "aqua marina fusion silver padddel"
> colnames(data)
[1] "attributes" "availability_html" "backorders_allowed"
[4] "dimensions" "dimensions_html" "display_price"
[7] "display_regular_price" "image" "image_id"
[10] "is_downloadable" "is_in_stock" "is_purchasable"
[13] "is_sold_individually" "is_virtual" "max_qty"
[16] "min_qty" "price_html" "sku"
[19] "variation_description" "variation_id" "variation_is_active"
[22] "variation_is_visible" "weight" "weight_html"
[25] "is_bookable" "number_of_dates" "your_discount"
[28] "gtin" "your_delivery"
EDIT:
To work with page which uses JavaScript you may need other tools - like phantomjs
How to Scrape Data from a JavaScript Website with R | R-bloggers

rvest html_nodes() returns empty character

I am trying to scrape a website (https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281). In particular, I am trying to scrape all 281 "release dates" (with the first being '30-Oct-2006')
To do this, I am using the R package rvest and the SelectorGadget Chrome extension. I am using Mac version 10.15.6.
I attempted the following code:
library(rvest)
library(httr)
library(xml2)
library(dplyr)
link = "https://genelab-data.ndc.nasa.gov/genelab/projects?page=1&paginate_by=281"
page = read_html(link)
year = page %>% html_nodes("td:nth-child(4) ul") %>% html_text()
However, this returns 'character(0)`.
I used the code td:nth-child(4) ul because this is what SelectorGadget highlighted for each of the 281 release dates. I also tried to "View source page" but could not find these years listed on the source page.
I have read that rvest does not always work depending on the type of website. In this case, what is a possible workaround? Thank you.
This site gets the data from this API call https://genelab-data.ndc.nasa.gov/genelab/data/study/all that returns JSON data. You can use httr to get the data and parse JSON :
library(httr)
url <- "https://genelab-data.ndc.nasa.gov/genelab/data/study/all"
output <- content(GET(url), as = "parsed", type = "application/json")
#sort by glds_id
output = output[order(sapply(output, `[[`, i = "glds_id"))]
#build dataframe
result <- list();
index <- 1
for(t in output[length(output):1]){
result[[index]] <- t$metadata
result[[index]]$accession <- t$accession
result[[index]]$legacy_accession <- t$legacy_accession
index <- index + 1
}
df <- do.call(rbind, result)
options(width = 1200)
print(df)
Output sample (without all columns)
accession legacy_accession public_release_date title
[1,] "GLDS329" "GLDS-329" "30-Oct-2006" "Transcription profiling of atm mutant, adm mutant and wild type whole plants and roots of Arabidops" [truncated]
[2,] "GLDS322" "GLDS-322" "27-Aug-2020" "Comparative RNA-Seq transcriptome analyses reveal dynamic time dependent effects of 56Fe, 16O, and " [truncated]
[3,] "GLDS320" "GLDS-320" "18-Sep-2014" "Gamma radiation and HZE treatment of seedlings in Arabidopsis"
[4,] "GLDS319" "GLDS-319" "18-Jul-2018" "Muscle atrophy, osteoporosis prevention in hibernating mammals"
[5,] "GLDS318" "GLDS-318" "01-Dec-2019" "RNA seq of tumors derived from irradiated versus sham hosts transplanted with Trp53 null mammary ti" [truncated]
[6,] "GLDS317" "GLDS-317" "19-Dec-2017" "Galactic cosmic radiation induces stable epigenome alterations relevant to human lung cancer"
[7,] "GLDS311" "GLDS-311" "31-Jul-2020" "Part two: ISS Enterobacteriales"
[8,] "GLDS309" "GLDS-309" "12-Aug-2020" "Comparative Genomic Analysis of Klebsiella Exposed to Various Space Conditions at the International" [truncated]
[9,] "GLDS308" "GLDS-308" "07-Aug-2020" "Differential expression profiles of long non-coding RNAs during the mouse pronucleus stage under no" [truncated]
[10,] "GLDS305" "GLDS-305" "27-Aug-2020" "Transcriptomic responses of Serratia liquefaciens cells grown under simulated Martian conditions of" [truncated]
[11,] "GLDS304" "GLDS-304" "28-Aug-2020" "Global gene expression in response to X rays in mice deficient in Parp1"
[12,] "GLDS303" "GLDS-303" "15-Jun-2020" "ISS Bacillus Genomes"
[13,] "GLDS302" "GLDS-302" "31-May-2020" "ISS Enterobacteriales Genomes"
[14,] "GLDS301" "GLDS-301" "30-Apr-2020" "Eruca sativa Rocket Science RNA-seq"
[15,] "GLDS298" "GLDS-298" "09-May-2020" "Draft Genome Sequences of Sphingomonas sp. Isolated from the International Space Station Genome seq" [truncated]
...........................................................................

Read HTML Table from Greyhound via R

I'm trying to read the HTML data regarding Greyhound bus timings. An example can be found here. I'm mainly concerned with getting the schedule and status data off the table, but when I execute the following code:
library(XML)
url<-"http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
greyhound<-readHTMLTable(url)
greyhound<-greyhound[[2]]
This just produces the following table:
I'm not sure why it's grabbing data that's not even on the page, as opposed to the
you can not retrieve the data using readHTMLTable because the traject result are sent as javascript script. So you should select that script and parse it to extract the right information.
Her a solution , that do this :
Extract the javascript script that contain the json data
extract the json data from the script using regular expression
parse the json data to an R list
Reshape the resulted list into a table ( data.table here)
The code looks maybe short but it is really compact ( it takes me an hour to do produce it)!
library(XML)
library(httr)
library(jsonlite)
library(data.table)
dc <- htmlParse(GET(url))
script <- xpathSApply(dc,"//script/text()",xmlValue)[[5]]
res <- strsplit(script,"stopArray.push({",fixed=TRUE)[[1]][-1]
dcast(point~name,data=rbindlist(Map(function(x,y){
x <- paste('{',sub(');|);.*docum.*',"",x))
dx <- unlist(fromJSON(x))
data.frame(point=y,name=names(dx),value=dx)
},res,seq_along(res))
,fill=TRUE)[name!="polyline"])
the table result :
point category direction id lat linkName lon
1: 1 2 empty 562310 41.878589630127 Chicago_Amtrak_IL -87.6398544311523
2: 2 2 empty 560252 41.8748474121094 Chicago_IL -87.6435165405273
3: 3 1 empty 561627 41.7223281860352 Chicago_95th_&_Dan_Ryan_IL -87.6247329711914
4: 4 2 empty 260337 41.6039199829102 Gary_IN -87.3386917114258
5: 5 1 empty 260447 40.4209785461426 Lafayette_e_IN -86.8942031860352
6: 6 2 empty 260392 39.7617835998535 Indianapolis_IN -86.161018371582
7: 7 2 empty 250305 39.1079406738281 Cincinnati_OH -84.5041427612305
name shortName ticketName
1: Chicago Amtrak: 225 S Canal St, IL 60606 Chicago Amtrak, IL CHD
2: Chicago: 630 W Harrison St, IL 60607 Chicago, IL CHD
3: Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628 Chicago 95th & Dan Ryan, IL CHD
4: Gary: 100 W 4th Ave, IN 46402 Gary, IN GRY
5: Lafayette (e): 401 N 3rd St, IN 47901 Lafayette (e), IN XIN
6: Indianapolis: 350 S Illinois St, IN 46225 Indianapolis, IN IND
7: Cincinnati: 1005 Gilbert Ave, OH 45202 Cincinnati, OH CIN
As #agstudy notes, the data is rendered to HTML; it's not delivered via HTML directly from the server. Therefore, you can (a) use something like RSelenium to scrape the rendered content, or (b) extract the data from the <script> tags that contain the data.
To explain #agstudy's work, we observe that the data is contained in a series of stopArray.push() commands in one of the (many) script tags. For example:
stopArray.push({
"id" : "562310",
"name" : "Chicago Amtrak: 225 S Canal St, IL 60606",
"shortName" : "Chicago Amtrak, IL",
"ticketName" : "CHD",
"category" : 2,
"linkName" : "Chicago_Amtrak_IL",
"direction" : "empty",
"lat" : 41.87858963012695,
"lon" : -87.63985443115234,
"polyline" : "elr~Fnb|uOmC##nG?XBdH#rC?f#?P?V#`AlAAn#A`CCzBC~BE|CEdCA^Ap#A"
});
Now, this is json data contained inside each function call. I tend to think that if someone has gone to the work of formatting data in a machine-readable format, well golly we should appreciate it!
The tidyverse approach to this problem is as follows:
Download the page using the rvest package.
Identify the appropriate script tag to use by employing an xpath expression that searches for all script tags that contain the string url =.
Use a regular expression to pull out everything inside each stopArray.push() call.
Fix the formatting of the resulting object by (a) separating each block with commas, (b) surrounding the string by [] to indicate a json list.
Use jsonlite::fromJSON to convert into a data.frame.
Note that I hide the polyline column near the end, since it's too large to previous appropriately.
library(tidyverse)
library(rvest)
library(stringr)
library(jsonlite)
url <- "http://bustracker.greyhound.com/routes/4511/I/Chicago_Amtrak_IL-Cincinnati_OH/4511/10-26-2016"
page <- read_html(url)
page %>%
html_nodes(xpath = '//script[contains(text(), "url = ")]') %>%
html_text() %>%
str_extract_all(regex("(?<=stopArray.push\\().+?(?=\\);)", multiline = T, dotall = T), F) %>%
unlist() %>%
paste(collapse = ",") %>%
sprintf("[%s]", .) %>%
fromJSON() %>%
select(-polyline) %>%
head()
#> id name
#> 1 562310 Chicago Amtrak: 225 S Canal St, IL 60606
#> 2 560252 Chicago: 630 W Harrison St, IL 60607
#> 3 561627 Chicago 95th & Dan Ryan: 14 W 95th St, IL 60628
#> 4 260337 Gary: 100 W 4th Ave, IN 46402
#> 5 260447 Lafayette (e): 401 N 3rd St, IN 47901
#> 6 260392 Indianapolis: 350 S Illinois St, IN 46225
#> shortName ticketName category
#> 1 Chicago Amtrak, IL CHD 2
#> 2 Chicago, IL CHD 2
#> 3 Chicago 95th & Dan Ryan, IL CHD 1
#> 4 Gary, IN GRY 2
#> 5 Lafayette (e), IN XIN 1
#> 6 Indianapolis, IN IND 2
#> linkName direction lat lon
#> 1 Chicago_Amtrak_IL empty 41.87859 -87.63985
#> 2 Chicago_IL empty 41.87485 -87.64352
#> 3 Chicago_95th_&_Dan_Ryan_IL empty 41.72233 -87.62473
#> 4 Gary_IN empty 41.60392 -87.33869
#> 5 Lafayette_e_IN empty 40.42098 -86.89420
#> 6 Indianapolis_IN empty 39.76178 -86.16102

How to loop - JSONP / JSON data using R

I thought I had parsed the data correctly using jsonlite & tidyjson. However, I am noticing that only the data from the first page is being parsed. Please advice how I could parse all the pages correctly. The total number of pages are over 1300 -if I look at the json output, so I think the data is available but not correctly parsed.
Note: I have used tidyjson, but am open to using jsonlite or any other library too.
library(dplyr)
library(tidyjson)
library(jsonlite)
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1?OPERATION-NAME=findItemsByKeywords&SERVICE-VERSION=1.0.0&SECURITY-APPNAME=xxxxxx&GLOBAL-ID=EBAY-US&RESPONSE-DATA-FORMAT=JSON&callback=_cb_findItemsByKeywords&REST-PAYLOAD&keywords=harry%20potter&paginationInput.entriesPerPage=100")
txt <- content(req, "text")
json <- sub("/**/_cb_findItemsByKeywords(", "", txt, fixed = TRUE)
json <- sub(")$", "", json)
data1 <- json %>% as.tbl_json %>%
enter_object("findItemsByKeywordsResponse") %>% gather_array %>% enter_object("searchResult") %>% gather_array %>%
enter_object("item") %>% gather_array %>%
spread_values(
ITEMID = jstring("itemId"),
TITLE = jstring("title")
) %>%
select(ITEMID, TITLE) # select only what is needed
############################################################
*Note: "paginationOutput":[{"pageNumber":["1"],"entriesPerPage":["100"],"totalPages":["1393"],"totalEntries":["139269"]}]
* &_ipg=100&_pgn=1"
No need for tidyjson. You will need to write another function/set of calls to get the total number of pages (it's over 1,400) to use the following, but that should be fairly straightforward. Try to compartmentalize your operations a bit more and use the full power of httr when you can to parameterize things:
library(dplyr)
library(jsonlite)
library(httr)
library(purrr)
get_pg <- function(i) {
cat(".") # shows progress
req <- httr::GET("http://svcs.ebay.com/services/search/FindingService/v1",
query=list(`OPERATION-NAME`="findItemsByKeywords",
`SERVICE-VERSION`="1.0.0",
`SECURITY-APPNAME`="xxxxxxxxxxxxxxxxxxx",
`GLOBAL-ID`="EBAY-US",
`RESPONSE-DATA-FORMAT`="JSON",
`REST-PAYLOAD`="",
`keywords`="harry potter",
`paginationInput.pageNumber`=i,
`paginationInput.entriesPerPage`=100))
dat <- fromJSON(content(req, as="text", encoding="UTF-8"))
map_df(dat$findItemsByKeywordsResponse$searchResult[[1]]$item, function(x) {
data_frame(ITEMID=flatten_chr(x$itemId),
TITLE=flatten_chr(x$title))
})
}
# "10" will need to be the max page number. I wasn't about to
# make 1,400 requests to ebay. I'd probably break them up into
# sets of 30 or 50 and save off temporary data frames as rdata files
# just so you don't get stuck in a situation where R crashes and you
# have to get all the data again.
srch_dat <- map_df(1:10, get_pg)
srch_dat
## Source: local data frame [1,000 x 2]
##
## ITEMID TITLE
## (chr) (chr)
## 1 371533364795 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 2 331128976689 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 3 131721213216 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 4 171430021529 New Harry Potter Hermione Granger Rotating Time Turner Necklace Gold Hourglass
## 5 261597812013 Harry Potter Time Turner+GOLD Deathly Hallows Charm Pendant necklace
## 6 111883750466 Harry Potter: Complete 8-Film Collection (DVD, 2011, 8-Disc Set)
## 7 251947403227 HOT New Harry Potter 14.5" Magical Wand Replica Cosplay In Box
## 8 351113839731 Marauder's Map Hogwarts Wizarding World Harry Potter Warner Bros LIMITED **NEW**
## 9 171912724869 Harry Potter Time Turner Necklace Hermione Granger Rotating Spins Gold Hourglass
## 10 182024752232 Harry Potter : Complete 8-Film Collection (DVD, 2011, 8-Disc Set) Free Shipping
## .. ... ...

Encoding Issue in R htmlParse XML

I try to scrape a website but can't handle this encoding issue:
# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")
# get content and parse it:
doc <- htmlParse(url)
# encoding isssue, like here..
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)
[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
[5] "Mà Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
[7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"
[8] "MO Rödel, R Ernst - Ecotropica, 2004 - gtoe.de"
# ...
any pointers?
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.1
> getOption("encoding")
[1] "native.enc"
This worked to some degree for me
doc <- htmlParse(url,encoding="UTF-8")
head(xpathSApply(doc, '//div[#class="gs_a"]', xmlValue))
#[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
#[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
#[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
#[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
#[5] "MÁ Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
#[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
thou
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)[[81]]
was displaying incorrectly on my windows box for example.
switching to Font DotumChe using GUI preferences however showed it displaying correctly so it may just be a display issue not a parsing one.