Scrape table using rvest - Embedded symbols/links - html

I tried to scrape the table on the following webpage: http://www.comstats.de/squad/1-FC+Bayern+München
My approach is successfull at first glance using the following code:
read_html("http://www.comstats.de/squad/1-FC+Bayern+München") %>%
html_node("#inhalt > table.rangliste.autoColor.tablesorter.zoomable") %>%
html_table(header = TRUE, fill = TRUE)
However, in the second column there are differing number of linked symbols which lead to a corrupt table having different number of elements (which is why there is need for fill = TRUE).
I was researching for hours... Who can help me out?

In case someone is searching for an answer to such questions as well: One possible solution is to use package htmltable (https://cran.r-project.org/web/packages/htmltab/vignettes/htmltab.html):
library(htmltab)
htmltab(doc = "http://www.comstats.de/squad/1-FC+Bayern+München", which = '//*[#id="inhalt"]/table[2]')

Related

How to parse Table from Wikipedia using htmltab package?

All,
I am trying to parse 1 table located here https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population. And I would like to use htmltab package to achieve this task. Currently my code looks like following. However I am getting below Error. I tried passing "Rank", "% of world population " in which function, but still received an error. I am not sure, what could be wrong ?
Please Note: I am new to R and Webscraping, if you could provide explanation of the code, that will be great help.
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population"
list_of_countries<- htmltab(doc = url3, which = "//th[text() = 'Country(or dependent territory)']/ancestor::table")
Error: Couldn't find the table. Try passing (a different) information to the which argument.
This is an XPath problem not an R problem. If you inspect the HTML of that table the relevant header is
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">
Country<br><small>(or dependent territory)</small>
</th>
So text() on this is just "Country".
For example this could work (this is not the only option, you will just have to try out various xpath selectors to see).
htmltab(doc = url3, which = "//th[text() = 'Country']/ancestor::table")
Alternatively it's the first table on the page, so you could try which=1 instead.
(NB in Chrome you can do $x("//th[text() = 'Country']") and so on in the developer console to try these things out, and no doubt in other browsers also)

How can I get specific data from the table using Rselenium?

I am trying to scrap a table that I believe is using Java script. I want to get the data for indices (i.e., TSX). I would like to get the "Previous day data" for all indices. I am scrapping the data using Rselenium but it is unable to locate element.
Following is my code for scrapping previous day data for index called TSX:
library(RSelenium)
driver<- rsDriver(browser = "firefox")
remDr <- driver[["client"]]
remDr$navigate("http://bmgfunds.com/interactive-charts/")
elem <- remDr$findElement(using="xpath", value="//*[#id='indices-quotes']/table/tbody/tr[1]/td[2]")
In order to get the Xpath, I inspected the element and copied the Xpath by right clicking in the pan.
I also tried using rvest.
library(rvest)
st_table <- read_html("http://bmgfunds.com/interactive-charts/")
table<-html_nodes(st_table, "tbody tr")
Unfortunately, I get zero element {xml_nodeset (0)}
Any suggestion or help will be appreciated.Thanks
The table is within an iframe whose source is http://integration.nfusionsolutions.biz/client/bullionmanagementgroup/module/quotechartfull, so you can grab the table from there:
st_table <- read_html("http://integration.nfusionsolutions.biz/client/bullionmanagementgroup/module/quotechartfull")
(table <- html_table(st_table)[[3]])
This code grabs all the tables from the previous url with html_table and selects the table that you want (which is the third element of the list).

How can I scrape all content from each "option" of a "select" field of HTML with R?

I'm trying to use rvest package for web scraping an website.
This link will be used as an example: https://www.globalinnovationindex.org/analysis-indicator
The objective is to scrape the tables from all years (select id="ctl29_lstYear") and all indexes (select id="ctl29_lstIndex"). I already have a chunk that scrape and format thoose tables and turn them into lists (and yes... they are not an html <table>), but I can't use follow_link() or set_values() to navigate through the options of years and indexes, and scrape them all.
Let's use a single pair of "options" for this example (year="2013" and index=" Innovation Efficiency Ratio"):
So, I've looked at the rvest::set_values() documentation and I found this example:
search <- html_form(read_html("http://www.google.com"))[[1]]
set_values(search, q = "My little pony")
And then I tried this:
> session<-html_form(read_html("https://www.globalinnovationindex.org/analysis-indicator"))[[1]]
> set_values(session,list(ctl29$lstYear = "2013",ctl29$lstIndex="Innovation Efficiency Ratio"))
Error: unexpected '=' in "set_values(session,list(ctl29$lstYear ="
Why it was unexpected the '=' after the name of the fields that I want to modify? Do the set_values() is the best option for this kind of problem?

Analysis of deviance table model output in HTML

I am trying to export the output of an 'Analysis of deviance table' in HTML format, so that it can be inserted into a word document.
I created a GLM model as follows:
newmod <- glm(cbind(Recaptured, predated) ~ Morph * Plant * Site, data =
survival, family = binomial)
Running the following code gives me the output that I would like to export to HTML:
anova(newmod,test="Chisq")
I have tried the following code to create a HTML table using stargazer, however it doesn't seem to be working:
anova_mod<-anova(newmod,test="Chisq")
stargazer(newmod, type="html", out = "anova_output.htm")
Is there a simple way of doing this in r? I have managed to successfully export the summary statistics, but what I really need is the Analysis of deviance table.
I believe you are looking for:
print(xtable(anova_mod), type = "html")
as indicated by this answer: Exporting R tables to HTML
Here is my full code for reproducing something similar to your question:
plant.df = PlantGrowth
plant.df$group = factor(plant.df$group,labels = c("Control", "Treatment 1", "Treatment 2"))
newmod = lm(weight ~ group, data = plant.df)
anova_mod=anova(newmod)
anova_mod
install.packages("xtable")
require(xtable)
print(xtable(anova_mod), type = "html")
You can then paste the output to an html vizualizer such as: https://htmledit.squarefree.com/ to see the resulting table.
Instead of printing it, you can write it to a file. I have not personally tested this part, but the second answer in this question should work for you: Save html result to a txt or html file
Note: You can also reference all parts of the anova_mod separately by adding a $ after it like anova_mod$Df.

finding correct xpath for a table without an id

I am following a tutorial on R-Bloggers using rvest to scrape table. I think I have the wrong column id value, but I don't understand how to get the correct one. Can someone explain what value I should use, and why?
As #hrbrmstr points out this is against the WSJ terms of service, however the answer is useful for those who face a similar issue with a different webpage.
library("rvest")
interest<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()%>%html_nodes(xpath='//*[#id="column0"]/table[1]') %>% html_table()
The structure returns is an empty list.
For me it is usual a trial and error to find the correct table. In this case, the third table is what you are looking for:
library("rvest")
page<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()
tables<-html_nodes(page, "table")
html_table(tables[3])
Instead of using the xpath, I just parse out the "table" tag and looked at each table to locate the correct one. The piping command is handy but it makes it harder to debug when something goes wrong.