How to extract text from a several "div class" (html) using R?

How to extract text from a several "div class" (html) using R? - html

My goal is to extract info from this html page to create a database:
https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing
One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.
Example A
<div class="listing_column listing_row_price">
<div class="row_price">
$ 14,800
</div>
<div class="row_info">Ayer 19:53</div>
Example B
<div class="listing_column listing_row_price">
<div class="row_info">Ayer 19:50</div>
I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:
...
10 4000
11 14800
12 NA
13 14000
14 8000
...
But so far I've get this one and another full with NA.
...
10 4000
11 14800
12 14000
13 8000
14 8500
...
Commands used but didn't get what I want:
html1<-read_html("file.html")
title<-html_nodes(html1,"div")
html1<-toString(title)
pattern1<-'div class="row_price">([^<]*)<'
title3<-unlist(str_extract_all(title,pattern1))
title3<-title3[c(1:35)]
pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
title3<-unlist(str_extract(title3,pattern2))
title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))
I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:
library(rvest)
page <- read_html('page.html')
# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')
# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2,
html_text(html_children(x)[1]),
NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Related

Why does readHTMLTable cannot successfully read premier league tables for May month?

The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table

OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.

Parsing unstructured data - Append rows with non-existant values rows to a dataset

I am attempting to build a dataset from unstructured data. The raw data is a series of json files each of which contains information about a single element of the data (eg. each file becomes a row in the final data). I am looping through the jsons using jsonlite to turn each into a huge nested list. This whole operation is falling apart on the basis of a seemingly simple problem:
I need to append rows to my data where some of the elements do not exist.
The raw data is like so:
jsn1 <- list(id='id1', col1=1, col2='A', col3=11)
jsn2 <- list(id='id2', col1=2, col2='B', col3=12)
jsn3 <- list(id='id3', col2='C', col3=13)
jsn4 <- list(id='id4', col1=3, col3=14)
The structure I am trying to get to is this:
df <- data.frame(id=c('id1','id2','id3','id4'),
col1=c(1,2,NA,4),
col2=c('A','B','C',NA),
col3=c(11,12,13,14))
> df
id col1 col2 col3
1 id1 1 A 11
2 id2 2 B 12
3 id3 NA C 13
4 id4 4 <NA> 14
My approach is along the lines of:
#Collect the json names in a vector
files=c('jsn1','jsn2','jsn3','jsn4')
#Initialize the dataframe with the first row filling in any missing values.
#I didn't do this at first, but it seems helpful.
df1=data.frame(id=jsn1$id,
col1=jsn1$col1,
col2=jsn1$col2,
col3=jsn1$col3,
stringsAsFactors=F)
#Create a loop to loop through the files extracting the values then add them to a dataframe.
for (i in 2:length(files)) {
a <- get(files[i])
new.row <- list(id=a$id,
col1=a$col1,
col2=a$col2,
col3=a$col3)
df1 <- rbind(df1,b)
}
However, this doesn't work because df1 <- rbind(df1,new.row) requires the columns to be the same length. I have tried df1 <- rbind.fill(df1,new.row), rbindlist(list(df1,new.row),use.names=T,fill=T), and df[nrow(df1) +1,names(new.row)] <- new.row. And read this and this among others.
Most answers can add to the data frame by "knowing" a priori what columns will be null /not null. Then constructing a df without those columns and adding it with fill. This won't work as I have no idea what columns will be present ahead of time. The missing ones currently end up with 0 elements which is the root of the problem, but I need to check if they are present. It seems like there should be an easy way to handle this either "on read" or on the rbind, but I can't seem to figure it out.
There are potentially hundreds of columns and millions of rows (though 10s and 100s right now). The jsons are large so reading them all into memory / contacting the lists somehow is probably not possible with the real data. A solution using data.table would probably be ideal. But any help is appreciated. Thanks.

You could do
data.table::rbindlist(mget(ls(pattern = "jsn[1-4]")), fill = TRUE)
# id col1 col2 col3
# 1: id1 1 A 11
# 2: id2 2 B 12
# 3: id3 NA C 13
# 4: id4 3 NA 14
Here mget(ls(pattern = "jsn[1-4]")) is a more programmatic way to gather the lists from the global environment that match the pattern jsn followed by the numbers 1-4. It's just the same as list(jsn1, jsn2, jsn3, jsn4) except it comes with names. You could just as easily do
rbindlist(list(jsn1, jsn2, jsn3, jsn4), fill = TRUE)
The ls() method will be better if you have many more jsn* lists.

You want rbind.fill from plyr. First you have to convert all your lists to dataframes (here using lapply(mylists, as.data.frame)), then you can use rbind.fill to bind them and fill missing rows with NA:
library(plyr)
rbind.fill(lapply(list(jsn1, jsn2, jsn3, jsn4), as.data.frame))
id col1 col2 col3
1 id1 1 A 11
2 id2 2 B 12
3 id3 NA C 13
4 id4 3 <NA> 14

I am going to jump back in because I have figured this out and want to share in case anyone else comes across this problem. The problem was reading values that are not present in the loop was creating nulls in my list that prevent rbindlist and rbind.fill from working.
To be more specific, I was doing this:
new.row <- list(id=a$id,
col1=a$col1,
col2=a$col2,
col3=a$col3)
when a$col2 was not in the list read from the json. This causes new.row$col2 to be NULL and then you cannot use new.row in rbindlist or rbind.fill. However, all that I needed to do was remove these nulls from the list like so
plyr::compact(new.row)
Source
before then using rbindlist. Both answers were helpful by showing me that rbindlist or rbind.fill would work without the null values.

How to import csv data where some observations are on two rows

I have a dataset with a couple million rows. It is in csv format. I wish to import it into Stata. I can do this, but there is a problem - a small percentage (but still many) of the observations appear on two lines in the CSV file. Most of the entries occur on only one line. The troublesome observations that take up 2 lines still follow the same pattern as far as being delimited by commas. But in the Stata dataset, the observation shows up on two rows, both rows containing only part of the total data.
I used import delimited to import the data. Is there anything that can be done at the data import stage of the process in Stata? I would prefer to not have to deal with this in the original CSV file if possible.
***Update
Here is an example of what the csv file looks like:
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Notice that there is no comma at the end of the line. Also notice that the problem is with the observation that begins with text 11.
This is basically how it shows up in Stata:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 1
4 2 text 13 text14 text15
5 text16 text17 text18 text19 text20
That sometimes the number is right next to text isn't a mistake - it is just to illustrate that the data is more complex than is shown here.
Of course, this is how I need the data:
var1 var2 var3 var4 var5
1 text 1 text 2 text 3 text 4 text 5
2 text 6 text 7 text 8 text9 text10
3 text 11 text 12 text 13 text14 text15
4 text16 text17 text18 text19 text20

A convoluted way is (comments inline):
clear
set more off
*----- example data -----
// change delimiter, if necessary
insheet using "~/Desktop/stata_tests/test.csv", names delim(;)
list
*----- what you want -----
// compute number of commas
gen numcom = length(var1var2var3var4var5) ///
- length(subinstr(var1var2var3var4var5, ",", "", .))
// save all data
tempfile orig
save "`orig'"
// keep observations that are fine
drop if numcom != 4
// save fine data
tempfile origfine
save "`origfine'"
*-----
// load all data
use "`orig'", clear
// keep offending observations
drop if numcom == 4
// for the -reshape-
gen i = int((_n-1)/2) +1
bysort i : gen j = _n
// check that pairs add up to 4 commas
by i : egen check = total(numcom)
assert check == 4
// no longer necessary
drop numcom check
// reshape wide
reshape wide var1var2var3var4var5, i(i) j(j)
// gen definitive variable
gen var1var2var3var4var5 = var1var2var3var4var51 + var1var2var3var4var52
keep var1var2var3var4var5
// append new observations with original good ones
append using "`origfine'"
// split
split var1var2var3var4var5, parse(,) gen(var)
// we're "done"
drop var1var2var3var4var5 numcom
list
But we don't really have the details of your data, so this may or may not work. It's just meant to be a rough draft. Depending on the memory space occupied by your data, and other details, you may need to improve parts of the code so it be made more efficient.
Note: the file test.csv looks like
var1,var2,var3,var4,var5
text 1, text 2,text 3 ,text 4,text 5
text 6,text 7,text 8,text9,text10
text 11,text 1
2,text 13,text14,text15
text16,text17,text18,text19,text20
Note 2: I'm using insheet because I don't have Stata 13 at the moment. import delimited is the way to go if available.
Note 3: details on how the counting of commas works can be reviewed at Stata tip 98: Counting substrings within strings, by Nick Cox.

I would try the following strategy.
Import as a single string variable.
Count commas on each line and combine following lines if lines are incomplete.
Delete redundant material.
The comma count will be
length(variable) - length(subinstr(variable, ",", "", .))

If the observations in question are quoted in the CSV file, then you can use the bindquote(strict) option.

A bit of speculation without seeing the exact data: following Roberto Ferrer's comment, you might find the Stata command filefilter useful in cleaning the csv file before importing. You can substitute new and old string patterns, using basic characters as well as more complex \n and \r terms.

I can't offer any code at the moment, but I suggest you take a good look at help import. The infile and infix commands state:
An observation can be on more than one line.
(I don't know if this means that all observations should be on several lines, or if it can handle cases where only some observations are on more than one line.)
Check also the manuals if the examples and notes in the help files turn out to be insufficient.

Reading XML data into R from a html source

I'd like to import data into R from a given webpage, say this one.
In the source code (but not on the actual page), the data I'd like to get is stored in a single line of javascript code which starts like this:
chart_Line1.setDataXML("<graph rotateNames (stuff omitted) >
<set value='699.99' name='16.02.2013' />
<set value='731.57' name='18.02.2013' />
<set value='more values' name='more dates' />
...
<trendLines> (now a different command starts, stuff omitted)
</trendLines></graph>")
(Note that I've included line breaks for readability; the data is in one single line in the original file. It would suffice to import only the line which starts with chart_Line1.setDataXML - it's line 56 in the source if you want to have a look yourself)
I can read the whole html file into a string using scan("URLofFile", what="raw"), but how do I extract the data from this?
Can I specify the data format with what="...", keeping in mind that there are no line breaks to separate the data, but several line breaks in the irrelevant prefix and suffix?
Is this something which can be done in a nice way using R tools, or do you suggest that this data acquisition should rather be done with a different script?

With some trial & error, I was able to find the exact line where the data is contained. I read the whole html file, and then dispose of all other lines.
require(zoo)
require(stringr)
# get html data, scrap all lines but the interesting one
theurl <- "https://www.magickartenmarkt.de/Black_Lotus_Unlimited.c1p5093.prod"
sec <- scan(file =theurl, what = "character", sep="\n")
sec <- sec[45]
# extract all strings of the form "value='X'", where X is a 1 to 3 digit number with some separator and 2 decimal places
values <- str_extract_all(sec, "value='[0-9]{1,3}.[0-9]{2}'")
# dispose of all non-numerical, non-separator values
values <- str_replace_all(unlist(values),"[^0-9/.]","")
# get all dates in the form "name='DD.MM.YYYY"
dates <- str_extract_all(sec, "name='[0-9]{2}.[0-9]{2}.[0-9]{4}'")
# dispose of all non-numerical, non-separator values
dates <- str_replace_all(unlist(dates),"[^0-9/.]","")
# convert dates to canonical format
dates <- as.Date(dates,format="%d.%m.%Y")
# put values and dates into a list of ordered observations, converting the values from characters to numbers first.
MyZoo <- zoo(as.numeric(values),dates)

Extract text from HTML node tree with R

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-
require(RCurl)
require(XML)
query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)
query.IMDB
query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[#class='rating rating-list']")
My first attempt was just to use grep on the resulting vector, but this fails.
data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable
My next attempt was to use grep on the individual points in the query.IMDB vector:-
vect <- numeric(length(df.IMDB))
for (i in 1:length(df.IMDB)){
vect[i] <- data[grep("Users rated this", "", df.IMDB)]
}
but this also throws the closure not subsettable error.
Finally trying the above function without data[] around the grep throws
Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero
I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.
Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector

No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:
library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire 8.2 2,694
2 Bart the Genius 7.8 1,167
3 Homer's Odyssey 7.5 1,005
4 There's No Disgrace Like Home 7.9 1,017
5 Bart the General 8.0 992
6 Moaning Lisa 7.4 988
This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to extract text from a several "div class" (html) using R? - html

Related

Why does readHTMLTable cannot successfully read premier league tables for May month?

Parsing unstructured data - Append rows with non-existant values rows to a dataset

How to import csv data where some observations are on two rows

Reading XML data into R from a html source

Extract text from HTML node tree with R

Categories

Resources