Using "rvest" scraping html table - html

I try to using rvest package to scrape a table:
library(rvest)
x <- read_html ("http://www.jcb.jp/rate/usd04182016.html")
x %>% html_node(".CSVTable") %>% html_table
Url elements look likes:
<table class="CSVTable">
<tbody>...</tbody>
<tbody class>...</tbody>
</table>
Why I occur the error "No matches"?

You're in luck (kind of). The site uses dynamic XHR requests to make that table, but said request is also a CSV file.
library(rvest)
library(stringr)
pg <- read_html("http://www.jcb.jp/rate/usd04182016.html")
# the <script> tag that does the dynamic loading is in position 6 of the
# list of <script> tags
fil <- str_match(html_text(html_nodes(pg, "script")[6]), "(/uploads/[[:digit:]]+\\.csv)")[,2]
df <- read.csv(sprintf("http://www.jcb.jp%s", fil), header=FALSE, stringsAsFactors=FALSE)
df <- setNames(df[,3:6], c("buy", "mid", "sell", "symbol"))
head(df)
## buy mid sell symbol
## 1 3.6735 3.6736 3.6737 AED
## 2 68.2700 69.0700 69.8700 AFN
## 3 122.3300 122.6300 122.9300 ALL
## 4 479.5000 481.0000 482.5000 AMD
## 5 1.7710 1.8110 1.8510 ANG
## 6 165.0600 165.3100 165.5600 AOA
But, that also means you can just get the CSV directly:
read.csv("http://www.jcb.jp/uploads/20160418.csv")
(just format the date properly in your requests).

Related

R web scraping with Rselenium and rvest

I need to scrap this webpage so I could have a data.frame like this:
value01 value02 id
SECTION I LIVE ANIMALS ANIMAL PRODUCTS sectionI
CHAPTER 1 LIVE ANIMALS chap0100000000
0101 Live horses, asses, mules and hinnies : (TN701) 0101000000-1
- Horses : 0101210000-2
0101 21 - - Pure-bred breeding animals (NC018) 0101210000-80
0101 29 - - Other : 0101290000-3
0101 29 10 - - - For slaughter 0101291000-80
0101 29 90 - - - Other 0101299000-80
0101 30 - Asses 0101300000-80
To obtain the first two rows of value01 and value02 I use:
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.section') %>% html_table())[2])
unlist((remDr$getPageSource()[[1]] %>% read_html(encoding = 'UTF-8') %>% html_elements('.chapter') %>% html_table())[2])
To obtain the rest of values of value01 and value02 I use (I need to clean the obtained values after I got them with this code, but I think there is better way to obtain the data):
remDr$getPageSource()[[1]] %>% read_html() %>% html_element(xpath = '//*[#id="div_description"]') %>% html_table()
So my problem now is to get the id column of the data.frame I want and to put it all together. Any advice on how to proceed from here to achieve my goal?
The code you need to run to function the previous examples:
suppressMessages(suppressWarnings(library(RSelenium)))
suppressMessages(suppressWarnings(library(rvest)))
rD <- rsDriver(browser = 'firefox', port = 6000L, verbose = FALSE)
remDr <- rD[['client']]
remDr$navigate('https://ec.europa.eu/taxation_customs/dds2/taric/measures.jsp?Lang=en&Domain=TARIC&Offset=0&ShowMatchingGoods=false&callbackuri=CBU-1&SimDate=20220719')
It is not quite clear to me what you want to scrape exactly from that page, but this is how you can get the data I think you are after.
pg <- remDr$getPageSource()[[1]]
doc <- xml2::read_html(pg)
# first two lines
rvest::html_elements(doc, '#sectionI table , .chapter') |>
rvest::html_table()
# get the data from each further line
lines <- rvest::html_elements(doc, ".evenLine")
data <- rvest::html_table(lines)
ids <- rvest::html_attrs(lines) |> sapply(function(x) x[1])
You'll need to clean the scraped data to your liking.
If this is not what you are looking for, you should clarify your question further.

How can I filter out numbers from an html table in R?

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

Extract the element from html page in R

I am new to R and trying to scrape the map data from the following webpage:
https://www.svk.se/en/national-grid/the-control-room/. The map is called "The flow of electricity". I am trying to scrape the capacity numbers (in blue) and the corresponding countries. So far I could not find a solution on how to find the countries' names in the HTML code and consequently scrape them.
Here is an example of data I need:
Would you have any idea?
Thanks a lot in advance.
The data is not in the table, hence we need to extract all the information individually.
Here is a way to do this using rvest.
library(rvest)
url <-'https://www.svk.se/en/national-grid/the-control-room/'
webpage <- url %>% read_html() %>%html_nodes('div.island')
tibble::tibble(country = webpage %>% html_nodes('span.country') %>% html_text(),
watt = webpage %>% html_nodes('span.watt') %>% html_text() %>%
gsub('\\s', '', .) %>% as.numeric(),
unit = webpage %>% html_nodes('span.unit') %>% html_text())
# country watt unit
# <chr> <dbl> <chr>
#1 SWEDEN 3761 MW
#2 DENMARK 201 MW
#3 NORWAY 2296 MW
#4 FINLAND 1311 MW
#5 ESTONIA 632 MW
#6 LATVIA 177 MW
#7 LITHUANIA 1071 MW
The flow data comes from an API call so you need to make an additional xhr (to an url you can find in the network tab via dev tools ) to get this data. You don't need to specify values for the timestamp (Ticks) and random (rnd) params in the querystring.
library(jsonlite)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
As a dataframe:
library(jsonlite)
library (plyr)
data <- jsonlite::read_json('https://www.svk.se/Proxy/Proxy/?a=http://driftsdata.statnett.no/restapi/PhysicalFlowMap/GetFlow?Ticks=&rnd=')
df <- ldply (data, data.frame)

Read HTML Table Into Data Frame with Hyperlinks in R

I am trying to read an HTML table from a publicly-accessible website into a data frame in R. The final column of the table contains hyperlinks, and I would like to read these hyperlinks into the table rather than the text that is displayed on the webpage. I've reviewed several posts here on StackOverflow and on other sites and have gotten almost there, but I haven't been able to read the hyperlinks themselves.
The table I'm trying to read is here: http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey.
The final column contains hyperlinks that point to the actual data in *.ZIP file format for download. I've managed to read the table into R as text, but I can't figure out how to resolve the hyperlinks in the final column.
Here's what I have so far:
library(XML)
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
page <- htmlParse( webURL )
tableNodes <- getNodeSet( sitePage, "//table" )
myTable <- readHTMLTable( tableNodes[[3]] )
However, this contains the text in the final column, not the hyperlink. How do I replace the word "zip" in the final column of this table in R with the values for the corresponding hyperlink in each row?
I find using the rvest package easier than XML.
Here is a solution to obtain a list of the links:
webURL <- 'http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey'
library(rvest)
page<-read_html(webURL)
links<-page %>% html_nodes("a") %>% html_attr("href")
This code will let you target either the XML files or the CSV files and you get the filename as well as the URL so you can then iterate over the URLs and filenames and save them with names you'll recognize later on.
library(rvest)
library(dplyr)
pg <- read_html("http://mis.ercot.com/misapp/GetReports.do?reportTypeId=12300&reportTitle=LMPs%20by%20Resource%20Nodes,%20Load%20Zones%20and%20Trading%20Hubs&showHTMLView=&mimicKey")
csv_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'csv')]/..")
data_frame(
fil_name = html_nodes(csv_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(csv_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> csv_df
glimpse(csv_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015151.LMPSROSNODENP6788_20170729_094011_csv.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923018", "/misdownload/servlets/mirD...
xml_fils <- html_nodes(pg, xpath=".//td[contains(#class, 'labelOptional_ind') and contains(., 'xml')]/..")
data_frame(
fil_name = html_nodes(xml_fils, "td.labelOptional_ind") %>% html_text(),
url = html_nodes(xml_fils, xpath=".//td[4]/div/a") %>% html_attr("href")
) -> xml_df
glimpse(xml_df)
## Observations: 1,560
## Variables: 2
## $ fil_name <chr> "cdr.00012300.0000000000000000.20170729.094015016.LMPSROSNODENP6788_20170729_094011_xml.zip", "cdr...
## $ url <chr> "/misdownload/servlets/mirDownload?mimic_duns=&doclookupId=572923015", "/misdownload/servlets/mirD...

When scraping with rvest expected html_node not appearing

The ITTO website produces a table of timber products and flows directly under the search form once the query is submitted (on the same page). Using information I obtained from Chrome's SelectorGadget I'm expecting the table to appear as the css element "td". Using rvest to scrape information on Albania for 2014...
library(rvest)
session <- html_session("http://www.itto.int/annual_review_output/?mode=searchdata")
form <- html_form(session)[[2]]
form <- set_values(form, "countries[]" = "8", "products[]" = "1" ,"flows[]" = "1", "years[]" = "2014")
query <- submit_form(session, form, submit = NULL)
page <- read_html(query) %>% html_nodes("td")
page
Which results in the table "td" being absent:
{xml_nodeset (0)}
Examining other elements of the page with html_nodes() suggests that submit_form() performed otherwise as expected.
So my question is where is the expected table?
It might be easier (in the long run) to scrape the select box options and just feed the POST call directly:
library(httr)
library(rvest)
res <- POST(url = "http://www.itto.int/annual_review_output/?mode=searchdata",
body = list(`countries[]` = "76",
`products[]` = "1", `flows[]` = "1",
`years[]` = "2014"),
encode = "form")
pg <- content(res, as="parsed")
html_nodes(pg, "td")
## {xml_nodeset (7)}
## [1] <td>Brazil</td>
## [2] <td>Ind. roundwood</td>
## [3] <td>Exports Quantity</td>
## [4] <td>1000 m3</td>
## [5] <td>2014</td>
## [6] <td style="text-align:right;">204.59</td>
## [7] <td>I</td>