How to parse Table from Wikipedia using htmltab package? - html

All,
I am trying to parse 1 table located here https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population. And I would like to use htmltab package to achieve this task. Currently my code looks like following. However I am getting below Error. I tried passing "Rank", "% of world population " in which function, but still received an error. I am not sure, what could be wrong ?
Please Note: I am new to R and Webscraping, if you could provide explanation of the code, that will be great help.
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population"
list_of_countries<- htmltab(doc = url3, which = "//th[text() = 'Country(or dependent territory)']/ancestor::table")
Error: Couldn't find the table. Try passing (a different) information to the which argument.

This is an XPath problem not an R problem. If you inspect the HTML of that table the relevant header is
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">
Country<br><small>(or dependent territory)</small>
</th>
So text() on this is just "Country".
For example this could work (this is not the only option, you will just have to try out various xpath selectors to see).
htmltab(doc = url3, which = "//th[text() = 'Country']/ancestor::table")
Alternatively it's the first table on the page, so you could try which=1 instead.
(NB in Chrome you can do $x("//th[text() = 'Country']") and so on in the developer console to try these things out, and no doubt in other browsers also)

Related

How to read in a table from an HTML website using XML [R]

Issue
I am trying to read in the table from a website, specifically this website: https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=-1&Season=2004-05&SeasonType=Regular%20Season
Here is how I went about it:
library(XML)
url <- "https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=-1&Season=2004-05&SeasonType=Regular%20Season"
nbadata <- readHTMLTable(url,header=T, which = 1, stringAsFactors = F)
I am getting an error message that I am not familiar with:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: 'Season'
Questions
Where do I go wrong in my coding? Is XML the correct approach?
What does the error message mean?
I want to be able to extract the table from the 04-05 season all the way through to the 20-21 season. (You can see that the website offers a filter at the top left of the table that allows you to filter through seasons.) Is there an efficient way to extract each table from each season?

finding correct xpath for a table without an id

I am following a tutorial on R-Bloggers using rvest to scrape table. I think I have the wrong column id value, but I don't understand how to get the correct one. Can someone explain what value I should use, and why?
As #hrbrmstr points out this is against the WSJ terms of service, however the answer is useful for those who face a similar issue with a different webpage.
library("rvest")
interest<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()%>%html_nodes(xpath='//*[#id="column0"]/table[1]') %>% html_table()
The structure returns is an empty list.
For me it is usual a trial and error to find the correct table. In this case, the third table is what you are looking for:
library("rvest")
page<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()
tables<-html_nodes(page, "table")
html_table(tables[3])
Instead of using the xpath, I just parse out the "table" tag and looked at each table to locate the correct one. The piping command is handy but it makes it harder to debug when something goes wrong.

How to code Regular Expression with an IF ELSE function

I am trying to build a scraper to extract key metrics from a website. One of the metrics is to find the Model number of the products on the website. I am using Outwit as the base program but I'm now stuck when it comes to some exceptions in the sites source code.
Here is an example of the source code:
var zx_description = "Test Dress<br/><br/>Model: Nice01j<br/>
Where the information I am looking to extract is: Nice01j
The issue is that for some products the word Modell is spelled Model and also that the end of the actual model name/number does not always end with a row break but in some cases the code might look like this:
var zx_description = "Test Dress<br/><br/>Model: Nice01j";
I have managed to create the RegEx before the Modell number as below:
/var zx_description[\s\S]+?Modell:/
So now Im looking to alter it so that it also takes in consideration that the spelling might be Model with just one "l".
Also the second part is to create a RegEx for the capturing of te info after the actual Model name which in should be something like:
IF: < br comes before "; then < br ELSE ";
Is this possible to state in a Regular Expression and if so how would I do that?
Based on your use of [\s\S] it looks to me like you need to run through a regular expression tutorial. For your question, specifically focus on optional items and capturing groups.
http://www.regular-expressions.info/tutorial.html

Error on declaring current date (vb for office)

the code of my my form is this:
Dim dtmTest As Date
dtmTest = DateValue(Now)
and the error is: external procedure not valid.
it highlights the word now
Just use:
dtmTest = Date()
Or, for date and time use:
dtmTest = Now()
From the above, it seems that have a missing or broken reference. Look at the references (code window, tools->references) and check if any are marked missing, if there is one, untick it and look for a suitable matching reference. (http://support.microsoft.com/kb/283806)
It is generally best to use late binding in production, because the libraries for the various Office products, such as Excel, frequently reference, vary from PC to PC.
If you do not find a missing reference, you can try to delete the VBA library itself - it will not let you, but for some reason, this seems to help.

Snippet of code works on my PC, but not another person's

There is a bit of code I have written, which works on my PC, but doesn't work on someone else's. I am really confused.the code in question is
Dim temp As HtmlHtmlElement
Dim s As String
s = "2222222"
For Each temp In html.getElementsByTagName("option")
If temp.getAttribute("value") = s Then
r.Offset(0, 1) = (temp.innerText)
End If
Next temp
r is a Range object that is passed to the sub.
the variable html is an object that has been loaded with html from a webpage,using xmlHTTP
This code works fine on my pc, it finds the "option" tags in the html source , and then checks to see if the "value" attribute is equal to the string s. When I run it on someone elses pc , temp.getAttribute("value") returns a blank string, even though there is an attribute called value. The web page address is hard coded so its not that he's using the wrong URL
I use excel 2007, he uses 2010
Anyone got any ideas?
thanks
How have you declared and instantiated the html object?
For example, you say you're using xmlHTTP but is that the only option? Does your code try and Set html to "Microsoft.XMLHTTP" first and if not found then try "MSXML2.XMLHTTP" or even different version numbers ServerXMLHTTP30/ServerXMLHTTP60?
If so, perhaps the problem is that the specific reference hasn't been enabled and your fetching the webpage through different objects. Each of these can return a webpage slightly differently with different encoding, UPPER/LOWERCASE etc based on webserver settings and the object.
Edit: You may find this useful Using the right version of MSXML