Extracting population data from website; wiki town webpages - html

G'day Everyone,
I am looking for a raster layer for human population/habitation in Australia. I have tried finding some free datasets online but couldn't really find anything in a useful formate. I thought it might be interesting to try and scrape population data from wikipedia and make my own raster layer. To this end I have tried getting the info from wiki, but not knowing anything about html has not help me.
The idea is to supply a list of all the towns in Australia that have wiki pages and extract the appropriate data into a data.frame.
I can get the webpage source data into R, but am stuck on how to extract the particular data that I want. The code below shows where I am stuck, any help would be really appreciated or some hints in the right direction.
I thought I might be able to use readHTMLTable() because, in the normal webpage, the info I want is off to the right in a nice table. But when I use this function I get an error (below). Is there any way I can specify this table when I am getting the source info?
Sorry if this question doesn't make much sense, I don't have any idea what I am doing when it comes to searching HTML files.
Thanks for your help, it is greatly appreciated!
Cheers,
Adam
require(RJSONIO)
loc.names <- data.frame(town = c('Sale', 'Bendigo'), state = c('Victoria', 'Victoria'))
u <- paste('http://en.wikipedia.org/wiki/',
sep = '', loc.names[,1], ',_', loc.names[,2])
res <- lapply(u, function(x) htmlParse(x))
Error when I use readHTMLTable:
tabs <- readHTMLTable(res[1])
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"list"’
For instance, some of the data I need looks like this in the html stuff. My question is how do I specify these locations in the HTML stuff I have?
/ <span class="geo">-38.100; 147.067
title="Victoria (Australia)">Victoria</a>. It has a population (2011) of 13,186

res returns a list in this case you need to use res[[1]] rather then res[1] to access its elements.
Using readHTMLTable on these elements will give you all tables. The tables with geo info is contained in a table with class = "infobox vcard" you can just extract these tables seperately then pass them to readHTMLTable
require(XML)
lapply(sapply(res, getNodeSet, path = '//*[#class="infobox vcard"]')
, readHTMLTable)
If you are not familiar with xpaths the selectr package allows you to use css selectors which maybe easier.
require(selectr)
> querySelectorAll(res[[1]], "table span .geo")
[[1]]
<span class="geo">-38.100; 147.067</span>
[[2]]
<span class="geo">-38.100; 147.067</span>

Related

How to deal with missing row when binding column to data frame (a scraping issue!)

I'm attempting to create data frames by attaching URLs to a scraped HTML table, and then writing these to individual csv files. The data are concerned with the passage of Bills through their respective stages in both the House of Commons and Lords. I've written a function (see below) which reads the tables, parses the HTML code, scrapes the URLS required, binds the two together, extracts the rows concerned with the House of Lords, and then writes the csv files. This function is then run across two lists (one of links to the Bill stage page and another of simplified file names).
library(XML)
lords_tables <- function (x, y) {
tables <- as.data.frame(readHTMLTable(x))
sitePage <- htmlParse(x) # This parses web code
hrefs <- xpathSApply(sitePage, "//td/descendant::a[1]",
xmlGetAttr, 'href') ## First href child of the a nodes
table_bind <- cbind(tables, hrefs)
row_no <- grep(".+: House of Lords|Royal Assent",
table_bind$NULL.V2) #Gives row position of Lords|Royal Assent
lords_rows <- table_bind[grep(".+: House of Lords|Royal Assent", table_bind$NULL.V2), ] # Subsets rows containing House of Lords|Royal Assent
write.csv(lords_rows, file = paste0(y, ".csv"))
}
# x = a list of links to the Bill pages/ y = list of simplified names
mapply(lords_tables, x=link_list, y=gsub_URL)
This works perfectly well for the cases where debates occurred for every stage. However, some cases pose a problem, such as:
browseURL("http://services.parliament.uk/bills/2010-12/armedforces/stages.html")
For this example, no debate occurred at the '3rd reading: House of Commons' and again at the 'Royal Assent'. This results in the following error being returned:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 21, 19
In overcoming this error I'd like to have an NA against the missing stage. Has anyone got an idea of how to achieve this? I'm a relative n00b so feel free to suggest a more elegant approach to the whole problem.
Thanks in advance!

How to make a complex list into a dataframe in R?

I have a complex list which is get from a json file.
The json file was get from a map service api in China.
I searched the website to solve the problem but I can't find a proper solution to my question, so I put it in this question and hope it can be solved.
If I missing something that I didn't find in the website, I apologize for that.
The code to get the list are as follows:`
library(rjson)
library(RCurl)
key<-"fd5a14632c36aecd2e759a0cc91a3b4a"
origin<-"大润发东环店"
urlorigin <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",origin,"&city=苏州",sep = "")
dataorigin<-readLines(urlorigin,encoding="UTF-8")
origininfo<-fromJSON(dataorigin)
originpoi<-origininfo$geocodes[[1]]$location
destination<-"苏州大学本部北门"
urldest <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",destination,"&city=苏州",sep = "")
datadest<-readLines(urldest,encoding="UTF-8")
destinfo<-fromJSON(datadest)
destpoi<-destinfo$geocodes[[1]]$location
urlpath <- paste("http://restapi.amap.com/v3/direction/driving?key=",key,"&origin=",originpoi,"&destination=",destpoi, "&originid=&destinationid=&extensions=all&strategy=0&waypoints=&avoidpolygons=&avoidroad=",sep = "")
pathjson<-paste(readLines(urlpath,encoding = "UTF-8"),collapse = "")
pathinfo<-fromJSON(pathjson)
The pathinfo was the list I get at last and I want to convert it into a dataframe that I can work with.
Thank you for your time.
I'm from China and my English is not that good, I apologize for that.
My Chinese is very limited as well. But your code to get the data is working (with some warnings).
pathinfo_df <- as.data.frame(lapply(pathinfo,rbind))
pathinfo_df is now a data_frame.
summary(pathinfo_df)
status info infocode count
1:1 OK:1 10000:1 1:1
route.origin.Length route.origin.Class route.origin.Mode
1 -none- character
route.destination.Length route.destination.Class route.destination.Mode
1 -none- character
route.taxi_cost.Length route.taxi_cost.Class route.taxi_cost.Mode
1 -none- character
route.paths.Length route.paths.Class route.paths.Mode
1 -none- list
So, there's plenty to select and play with. Read up on selecting from lists. see also:
str(pathinfo_df)
Then map it on Google Earth. Looks like the taxi might be costly. Have a good trip!

Scrape html Twitter followers using R

I have a continous task that I think can be automated using R.
Using the twitteR-package I have extracted a list of tweets. Those have been categorized into positive (and neutral) and negative tweets. This have been a manuel task - but I am looking into doing some machine learning on it.
My problem is the reach-part. I want to know not only the number of positive and negative tweets but also the number of people who potentialle have been exposed to the tweet.
There is a way to do this using the twitteR-package, but it is slow, as it requires the machine to sleep between each and every search. And with thousands of tweets this is not a proper way for me.
My thought was therefore if it is possible to extract the number of followers from the html-sourcecode of twitter using the html <- webpage <- getURL("http://www.twitter.com/AngelHaze") and here extract the number of followers.
Also, on top of this, I want to be able to do this using a vector of URL's ("http://www.twitter.com/AngelHaze") and then combining them into a dataframe with the ScreenName (AngelHaze) and the number of followers. I am from Denmark, so the sourcecode containing the number of followers look like this
a class="ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor" title="196.262 følgere" data-nav="followers"
href="/AngelHaze/followers""
Where "196.262 følgere" is the relevant part.
Is this possible? And if yes, can anyone help me going?
Best, Sander Ehmsen.

Parsing HTML content into a MySQL database using a parser

I want to be able to parse specific content from a website into a mySQL database. For example, on site http://allrecipes.com/Recipe/Fluffy-Pancakes-2/Detail.aspx I want to parse into my database (which has a table with columns RecipeName, Ingredients 1-10).
So basically my database will contain the name and all the ingredients for that recipe. There is no need to edit the content, simply parse them in as is (i.e. 3/4 cup milk) since i am using character in my database.
How exactly do I go about doing this? I was looking a pre-built parsers and it seems its tough to find one that's easy to use since I am fairly new to programming. Of course, I can manually enter values in but I want to parse them in.
Would it be possible to just parse this content and write a file that has a RecipieName, Ingredient string which I can then parse into my database? Or should I just do it directly into the database? I am unsure as to how to connect a database to a parser also directly, but I might be able to find some information online.
Basically, I am looking for help on how to exactly go about doing this since I am not very well versed in programming and this seems to be a lot more complicated than it might be.
I am using Java as my main language right now, although I can't say I am very good at it. But I should be able to understand the basic concepts.
Any suggestions on what parser to use or how to do this?
Thanks!
This is how I would do it in PHP. This is almost certainly NOT the most efficient way to do it, nor has it been debugged.
function parseHTML($rawHTML){
$startPosition = strpos($rawHTML,'<div class="ingredients"'); //Find the position of the beginning of the ingredients list, return the character number.
$endPosition = strpos($rawHTML,'</div>',$startPosition); //Find the position of the end of the ingredients list, begin searching from the beginning of the list (found in step 1)
$relevantPart = substr($rawHTML,$startPosition,$endPosition); //Isolate the ingredients list
$parsedString = strip_tags($relevantPart); //Strip the HTML tags off of the ingredients list
return $parsedString;
}
Still to be done: You say you have a mySQL database with 10 separate ingredients columns. This code outputs everything as one big string. You would have to change the strip_tags($relevantPart) function to strip_tags($relevantPart,"<li>"). That would let the <li> tags through. Then, you would have to loop through every <li> tag, performing a similar function to this. It shouldn't be too hard, but I don't feel comfortable writing it with no functioning PHP server.

Graphs - find common data

I've just started to read upon graph-teory and data structures.
I'm building an example application which should be able to find the xpath for the most common links. Imagine a Google serp, my application should be able to find the xpath for all links pointing to a result.
Imagine that theese xpaths were found:
/html/body/h2/a
/html/body/p/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/p/strong/a
/html/body/div[#class=footer]/span[#id=copyright]/a
From these xpats, i've thought of a graph like this (i might be completely lost here):
html
|
body
h2 - p - div[#class=footer]
| | |
a (1) a - strong span[#id=copyright]
| |
a (3) a (1)
Is this the best approach to this problem?
What would be the best way (data structure) to store this in memory? The language does not mather. We can see that we have 3 links matching the path html -> body -> p -> strong -> a.
As I said, i'm totally new to this so please forgive me if I thought of this completely wrong.
EDIT: I may be looking for the trie data structure?
Don't worry about tries yet. Just construct a tree using standard graph representation (node = {value, count, parent} while immediately collapsing same branches and incrementing the counter. Then, sort all the leaves by count in descending order and traverse from each leaf upwards to get a path.