finding correct xpath for a table without an id - html

I am following a tutorial on R-Bloggers using rvest to scrape table. I think I have the wrong column id value, but I don't understand how to get the correct one. Can someone explain what value I should use, and why?
As #hrbrmstr points out this is against the WSJ terms of service, however the answer is useful for those who face a similar issue with a different webpage.
library("rvest")
interest<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()%>%html_nodes(xpath='//*[#id="column0"]/table[1]') %>% html_table()
The structure returns is an empty list.

For me it is usual a trial and error to find the correct table. In this case, the third table is what you are looking for:
library("rvest")
page<-url("http://online.wsj.com/mdc/public/page/2_3020-libor.html")%>%read_html()
tables<-html_nodes(page, "table")
html_table(tables[3])
Instead of using the xpath, I just parse out the "table" tag and looked at each table to locate the correct one. The piping command is handy but it makes it harder to debug when something goes wrong.

Related

How to retrieve data from within a node name?

I am able to retrieve the data between the nodes, but not from in the node itself. I searched far and wide, but can't seem to find a solution for this.
My XML looks like the following:
And this XML is saved inside a nvarchar column called fileXML in SQL (Server 2008R2).
I want to retrieve the History Date, which is inside the node name.
My current code which is retrieving the "18" from the node value is the following:
, fileXML.value('(/commands/command/measure/categories/category/components/component/history)[1]', 'varchar(100)') as HisDate
Like you can see on the picture above, this is working.
But I can't seem to retrieve the info from within the node.
I searched on the web, and tried several things like:
fileXML.value('(/commands/command/measure/categories/category/components/component/history.name)[1]', 'varchar(100)') as HisDate
fileXML.value('(/commands/command/measure/categories/category/components/component/history/local-name)[1]', 'varchar(100)') as HisDate
fileXML.value('(/commands/command/measure/categories/category/components/component/history/local-name(.))[1]', 'varchar(100)') as HisDate
Where the first 2 became a NULL value, and the last one gave an error message that a function is not supported. I can give much more example on what I tried, but this would make the post a bit messy.
Any help is greatly appreciated.
date is an attribute of the history element. So your path should be
/commands/command/measure/categories/category/components/component/history/#date
Untested as you supplied the XML as a picture.

Chicago Data Portal API format for filter with multiple conditions

Probably very easy, however I cant find the answer in the documentation.
I have the following API url which I like to extend with additional filter condition for the "location_description". It currently filters on "residence":
https://data.cityofchicago.org/resource/6zsd-86xi.json?$$app_token=xxxxxxxxx&primary_type=BURGLARY&location_description=RESIDENCE
However, I like to extend this to include "APARTMENT,RESIDENCE-GARAGE".
So When I try this format:
https://data.cityofchicago.org/resource/6zsd-86xi.json?$$app_token=xxxxxxxxx&primary_type=BURGLARY&location_description=RESIDENCE,APARTMENT,RESIDENCE-GARAGE
It will not work.
Tried different formats including "", () etc, but no luck.
Question: How do I format this URL correctly so that I can filter on multiple "location_description"?
Thanks for the assistance.
For this, it might be a bit easier to use the where statement since it'll resemble a SQL query and may be more familiar (and easier to look at general query documentation). For now, I've removed the $$app_token to simplify the URL:
https://data.cityofchicago.org/resource/6zsd-86xi.json?$where=primary_type='BURGLARY' AND (location_description='RESIDENCE' OR location_description='APARTMENT' OR location_description='RESIDENCE-GARAGE')

How to parse Table from Wikipedia using htmltab package?

All,
I am trying to parse 1 table located here https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population. And I would like to use htmltab package to achieve this task. Currently my code looks like following. However I am getting below Error. I tried passing "Rank", "% of world population " in which function, but still received an error. I am not sure, what could be wrong ?
Please Note: I am new to R and Webscraping, if you could provide explanation of the code, that will be great help.
url3 <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population"
list_of_countries<- htmltab(doc = url3, which = "//th[text() = 'Country(or dependent territory)']/ancestor::table")
Error: Couldn't find the table. Try passing (a different) information to the which argument.
This is an XPath problem not an R problem. If you inspect the HTML of that table the relevant header is
<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">
Country<br><small>(or dependent territory)</small>
</th>
So text() on this is just "Country".
For example this could work (this is not the only option, you will just have to try out various xpath selectors to see).
htmltab(doc = url3, which = "//th[text() = 'Country']/ancestor::table")
Alternatively it's the first table on the page, so you could try which=1 instead.
(NB in Chrome you can do $x("//th[text() = 'Country']") and so on in the developer console to try these things out, and no doubt in other browsers also)

Google sheets importxml failure - Can't find the correct path to table from the link

I'm trying to retrieve a table which is updating twice per day. On other websites i was able to find the element but i saw that the way i see don't work on all websites where i tried.
In this case the issue is:
In google sheets using importxml, i can't find the correct path to table from the link or identify the element.
The website for this example is: http://lotopolonia.com/tabel/arhiva/index.php
1. I need to retrieve the dates and numbers.
2. They are updated twice per day and being updated in my sheet with adding just the last line at the top of the others. But this one after i solve the first one.
I looked at xpath tutorial from w3c and understood the syntax a bit.
The problem is how to identify correctly the elements and nodes in the inspector to retrieve the data i need.
Also, i've installed a chrome extension (XPath Helper) which shows xpath better that what i got from chrome.
I tried the following:
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row']/td[#class='colon2']")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row']/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='first_row'][1]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//*[#class='table_01']/table/tbody/tr[#class='first_row'][1]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[3]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[*]/td[*]")
=IMPORTXML("http://lotopolonia.com/tabel/arhiva/index.php","//table[#class='table_01']/tbody/tr[#class='second_row'][1]/child::td[*]")
The formula looks ok, without errors, but at all above requests i get the same result: imported content is empty
Unfortunately i ran out of ideas and how to interpret that elements...
Any ideea how to go on?
Cheers
How about this answer? I used //table[#class='table_01']/tr[position()>2] as a xpath. "A1" has http://lotopolonia.com/tabel/arhiva/index.php.
=IMPORTXML(A1,"//table[#class='table_01']/tr[position()>2]")
Using table[#class='table_01'], retrieve the table.
Using tr[position()>2], retrieve the dates and numbers.
Result :
Note :
If you want to retrieve the whole table, please use =IMPORTXML(A1,"//table[#class='table_01']/tr").
If this was not what you want, I'm sorry.

Wikipedia api fulltext search to return articles with title, snippet and image

I've been looking for a way to query the wikipedia api based on a search string for a list of articles with the following properties:
Title
Snippet/Description
One or more images related to the article.
I also have to make the query using jsonp.
I've tried using the list=search parameter
http://en.wikipedia.org/w/api.php?action=query&list=search&prop=images&format=json&srsearch=test&srnamespace=0&srprop=snippet&srlimit=10&imlimit=1
But it seems to ignore the prop=images, I've also tried variations using the prop=imageinfo and prop=pageimages. But they all give me the same result as just using the list=search.
I've also tried action=opensearch
http://en.wikipedia.org/w/api.php?action=opensearch&search=test&limit=10&format=xml
Which gives me exactly what I want when i set format=xml, but returns a simple array of page titles when using format=json and therefore fails because of the jsonp requirement.
Is there another approach to doing this? I'd really like to solve this in a single request rather than make the first search request and then a second request for the images using titles=x|y|z
As Bergi suggested, using generators is the way to go here. Specifically what I would do:
use list=search as a generator, to get the list of articles
use prop=pageimages to get a representative image for each article
use prop=extracts to get a description for each article
The whole query could look like this:
http://en.wikipedia.org/w/api.php?format=json&action=query&generator=search&gsrnamespace=0&gsrsearch=test&gsrlimit=10&prop=pageimages|extracts&pilimit=max&exintro&explaintext&exsentences=1&exlimit=max
I've tried using the list=search parameter, but it seems to ignore the prop=images
If you want to retrieve any properties, you need to specify a list of pages for which you want to get these; e.g. by using the titles=, pageids=, or revids= parameters. You didn't send any, so you did not get a result for the prop=images.
If you did use api.php?action=query&list=search&srsearch=test&prop=images&titles=test you would have gotten the search results for test and the images of the Test page.
You can however also use the collection that the list query generates for your property query, using the list module as a generator. The query would look like
api.php?action=query&generator=search&gsrsearch=test&gsrnamespace=0&gsrprop=snippet&prop=images. Unfortunately, it does not yield the attributes that the list contained, but only used the pageids for a basic property query.
Using two queries is probably the way to go. Btw, I'd recommend to use the pageimages property, it will likely give you the best results.