Extract text from HTML node tree with R

Extract text from HTML node tree with R - html

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-
require(RCurl)
require(XML)
query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)
query.IMDB
query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[#class='rating rating-list']")
My first attempt was just to use grep on the resulting vector, but this fails.
data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable
My next attempt was to use grep on the individual points in the query.IMDB vector:-
vect <- numeric(length(df.IMDB))
for (i in 1:length(df.IMDB)){
vect[i] <- data[grep("Users rated this", "", df.IMDB)]
}
but this also throws the closure not subsettable error.
Finally trying the above function without data[] around the grep throws
Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero
I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.
Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector

No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:
library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire 8.2 2,694
2 Bart the Genius 7.8 1,167
3 Homer's Odyssey 7.5 1,005
4 There's No Disgrace Like Home 7.9 1,017
5 Bart the General 8.0 992
6 Moaning Lisa 7.4 988
This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.

Related

COBOL .csv File IO into Table Not Working

I am trying to learn Cobol as I have heard of it and thought it would be fun to take a look at. I came across MicroFocus Cobol, not really sure if that is pertinent to this post though, and since I like to write in visual studio it was enough incentive to try and learn it.
I've been reading alot about it and trying to follow documentation and examples. So far I've gotten user input and output to the console working so then I decided to try file IO out. That went ok when I was just reading in a 'record' at a time, I realize that 'record' may be incorrect jargon. Although I've been programming for a while I am an extreme noob with cobol.
I have a c++ program that I have written before that simply takes a .csv file and parses it then sorts the data by whatever column the user wants. I figured it wouldn't be to hard to do the same in cobol. Well apparently I have misjudged in this regard.
I have a file, edited in windows using notepad++, called test.csv which contains:
4001942600,140,4
4001942700,141,3
4001944000,142,2
This data is from the us census, which has column headers titled: GEOID, SUMLEV, STATE. I removed the header row since I couldn't figure out how to read it in at the time and then read in the other data. Anywho...
In Visual Studio 2015, on Windows 7 Pro 64 Bit, using Micro Focus, and step debugging I can see in-record containing the first row of data. The unstring works fine for that run but the next time the program 'loops' I can step debug, and view in-record and see it contains the new data however the watch display when I expand the watch elements looks like the following:
REC-COUNTER 002 PIC 9(3)
+ IN-RECORD {Length = 42} : "40019427004001942700 000 " GROUP
- GEOID {Length = 3} PIC 9(10)
GEOID(1) 4001942700 PIC 9(10)
GEOID(2) 4001942700 PIC 9(10)
GEOID(3) <Illegal data in numeric field> PIC 9(10)
- SUMLEV {Length = 3} PIC 9(3)
SUMLEV(1) <Illegal data in numeric field> PIC 9(3)
SUMLEV(2) 000 PIC 9(3)
SUMLEV(3) <Illegal data in numeric field> PIC 9(3)
- STATE {Length = 3} PIC X
STATE(1) PIC X
STATE(2) PIC X
STATE(3) PIC X
So I'm not sure why that just before the Unstring operation the second time around I can see the proper data, but after the unstring happens incorrect data is then stored in the 'table'. What is also interesting is that if I continue on the third time around the correct data is stored in the 'table'.
identification division.
program-id.endat.
environment division.
input-output section.
file-control.
select in-file assign to "C:/Users/Shittin Kitten/Google Drive/Embry-Riddle/Spring 2017/CS332/group_project/cobol1/cobol1/test.csv"
organization is line sequential.
data division.
file section.
fd in-file.
01 in-record.
05 record-table.
10 geoid occurs 3 times pic 9(10).
10 sumlev occurs 3 times pic 9(3).
10 state occurs 3 times pic X(1).
working-storage section.
01 switches.
05 eof-switch pic X value "N".
* declaring a local variable for counting
01 rec-counter pic 9(3).
* Defining constants for new line and carraige return. \n \r DNE in cobol!
78 NL value X"0A".
78 CR value X"0D".
78 TAB value X"09".
******** Start of Program ******
000-main.
open input in-file.
perform
perform 200-process-records
until eof-switch = "Y".
close in-file;
stop run.
*********** End of Program ************
******** Start of Paragraph 2 *********
200-process-records.
read in-file into in-record
at end move "Y" to eof-switch
not at end compute rec-counter = rec-counter + 1;
end-read.
Unstring in-record delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
display "GEOID " & TAB &">> " & TAB & geoid of record-table(rec-counter).
display "SUMLEV >> " & TAB & sumlev of record-table(rec-counter).
display "STATE " & TAB &">> " & TAB & state of record-table(rec-counter) & NL.
************* End of Paragraph 2 **************
I'm very confused about why I can actually see the data after the read operation, but it isn't stored in the table. I have tried changing the declarations of the table to pic 9(some length) as well and the result changes but I can't seem to pinpoint what I'm not getting about this.

I think there are a few things you've not grasped yet, and which you need to.
In the DATA DIVISION, there are a number of SECTIONs, each of which has a specific purpose.
The FILE SECTION is where you define data structures which represent data on files (input, output or input-output). Each file has an FD, and subordinate to an FD will be one or more 01-level structures, which can be extremely simple, or complex.
Some of the exact behaviour is down to particular implementation for a compiler, but you should treat things this way, for your own "minimal surprise" and for the same of anyone who has to later amend your programs: for an input file, don't change the data after a READ, unless you are going to update the record (of if you are using a keyed READ, perhaps). You can regard the "input area" as a "window" on your data-file. The next READ, and the window is pointed to a different position. Alternatively, you can regard it as "the next record arrives, obliterating what was there previously". You have put the "result" of your UNSTRING into the record-area. The result will for sure disappear on the next read. You have the possibility (if the window is true for your compiler, and depending on the mechanism it uses for IO) of squishing the "following" data as well.
Your result should be in the WORKING-STORAGE, where it will remain undisturbed by new records being read.
READ filname INTO data-description is an implicit MOVE of the data from the record-area to data-description. If, as you have specified, data-description is the record-area, the result is "undefined". If you only want the data in the record-area, just a plain READ filename is all that is needed.
You have a similar issue with your original UNSTRING. You have the source and target fields referencing the same storage. "Undefined" and not the result you want. This is why the unnecessary UNSTRING "worked".
You have a redundant inline PERFORM. You process "something" after end-of-file. You make things more convoluted by using unnecessary "punctuation" in the PROCEDURE DIVISION (which you've apparently omitted to paste). Try using ADD instead of COMPUTE there. Look at the use of FILE STATUS, and of 88-level condition-names.
You don't need a "new line" for DISPLAY, because you get one for free unless you use NO ADVANCING.
You don't need to "concatenate" in the DISPLAY, because you get that for free as well.
DISPLAY and its cousin, ACCEPT, are the verbs (only intrinsic functions are functions in COBOL (except where your compiler supports user-defined functions)) which vary the most from compiler to compiler. If your complier supports SCREEN SECTION in the DATA DIVISION you can format and process user-input in "screens". If you were to use IBM's Enterprise COBOL you'd have very basic DISPLAY/ACCEPT.
You "declare a local variable". Do you? In what sense? Local to the program.
You can pick up quite a lot of tips by looking at COBOL questions here from the last few years.

Well I figured it out. While step debugging again, and hovering the mouse over record-table I noticed 26 white spaces present after the last data field. Now earlier tonight I attempted to change this data on the 'fly' as it were, because normally visual studio allows this. I attempted to make the change but did not verify that it took, normally I don't have to, but apparently it did not take. Now I should have known better since the icon displayed to the left of record-table displays a little closed pad-lock.
I normally program C, C++, and C# so when I see the little pad lock it usually has something to do with scoping and visibility. Not knowing COBOL well enough I overlooked this little detail.
Now I decided to unstring in-record delimited by spaces into temp-string. just prior to the
Unstring temp-string delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
The result of this was the properly formatted data, at least as I understand it, stored into the table and printed to the console screen.
Now I have read that the unstring 'function' can utilize multiple 'operators' such as so I may try to combine these two unstring operations into one.
Cheers!
**** Update ****
I have read the Mr. Woodger's reply below. If I could ask for a bit more assistance with this. I have also read this post which is similar but above my level at this time. COBOL read/store in table
That is pretty much what I'm trying to do but I don't understand some of things Mr. Woodger is trying to explain. Below is the code a bit more refined with some questions I have as comments. I would very much like some assistance with this or maybe if I could have an offline conversation that would be fine too.
`identification division.
* I do not know what 'endat' is
program-id.endat.
environment division.
input-output section.
file-control.
* assign a file path to in-file
select in-file assign to "C:/Users/Shittin Kitten/Google Drive/Embry-Riddle/Spring 2017/CS332/group_project/cobol1/cobol1/test.csv"
* Is line sequential what I need here? I think it is
organization is line sequential.
* Is the data devision similar to typedef in C?
data division.
* Does the file sectino belong to data division?
file section.
* Am I doing this correctly? Should this be below?
fd in-file.
* I believe I am defining a structure at this point
01 in-record.
05 record-table.
10 geoid occurs 3 times pic A(10).
10 sumlev occurs 3 times pic A(3).
10 state occurs 3 times pic A(1).
* To me the working-storage section is similar to ADA declarative section
* is this a correct analogy?
working-storage section.
* Is this where in-record should go? Is in-record a representative name?
01 eof-switch pic X value "N".
01 rec-counter pic 9(1).
* I don't know if I need these
78 NL value X"0A".
78 TAB value X"09".
01 sort-col pic 9(1).
********************************* Start of Program ****************************
*Now the procedure division, this is alot like ada to me
procedure division.
* Open the file
perform 100-initialize.
* Read data
perform 200-process-records
* loop until eof
until eof-switch = "Y".
* ask user to sort by a column
display "Would which column would you like to bubble sort? " & TAB.
* get user input
accept sort-col.
* close file
perform 300-terminate.
* End program
stop run.
********************************* End of Program ****************************
******************************** Start of Paragraph 1 ************************
100-initialize.
open input in-file.
* Performing a read, what is the difference in this read and the next one
* paragraph 200? Why do I do this here instead of just opening the file?
read in-file
at end
move "Y" to eof-switch
not at end
* Should I do this addition here? Also why a semicolon?
add 1 to rec-counter;
end-read.
* Should I not be unstringing here?
Unstring in-record delimited by "," into geoid of record-table,
sumlev of record-table, state of record-table.
******************************** End of Paragraph 1 ************************
********************************* Start of Paragraph 2 **********************
200-process-records.
read in-file into in-record
at end move "Y" to eof-switch
not at end add 1 to rec-counter;
end-read.
* Should in-record be something else? I think so but don't know how to
* declare and use it
Unstring in-record delimited by "," into
geoid in record-table(rec-counter),
sumlev in record-table(rec-counter),
state in record-table(rec-counter).
* These lines seem to give the printed format that I want
display "GEOID " & TAB &">> " & TAB & geoid of record-table(rec-counter).
display "SUMLEV >> " & TAB & sumlev of record-table(rec-counter).
display "STATE " & TAB &">> " & TAB & state of record-table(rec-counter) & NL.
********************************* End of Paragraph 2 ************************
********************************* Start of Paragraph 3 ************************
300-terminate.
display "number of records >>>> " rec-counter;
close in-file;
**************************** End of Paragraph 3 *****************************
`

Why does readHTMLTable cannot successfully read premier league tables for May month?

The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table

OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.

Convert JSON URL to R Data Frame

I'm having trouble converting a JSON file (from an API) to a data frame in R. An example is the URL http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json
I've tried a few different suggestions from S/O, including
convert json data to data frame in R and various blog posts such as http://zevross.com/blog/2015/02/12/using-r-to-download-and-parse-json-an-example-using-data-from-an-open-data-portal/
The closest I've been is using the code below which gives me a large matrix with 4 "rows" and a bunch of "varables" (V1, V2, etc.). I'm assuming that this JSON file is in a different format than "normal" ones.
library(RJSONIO)
raw_data <- getURL("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
data <- fromJSON(raw_data)
final_data <- do.call(rbind, data)
I'm pretty agnostic as to how to get it to work so any R packages/process are welcome. Thanks in advance.

The jsonlite package automatically picks up the dataframe:
library(jsonlite)
mydata <- fromJSON("http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
names(mydata$players)
# [1] "id" "esbid" "gsisPlayerId" "name"
# [5] "position" "teamAbbr" "stats" "seasonPts"
# [9] "seasonProjectedPts" "weekPts" "weekProjectedPts"
head(mydata$players)
# id esbid gsisPlayerId name position teamAbbr stats.1
# 1 100029 FALSE FALSE San Francisco 49ers DEF SF 16
# 2 729 ABD660476 00-0025940 Husain Abdullah DB KC 15
# 3 2504171 ABR073003 00-0019546 John Abraham LB 15
# 4 2507266 ADA509576 00-0025668 Michael Adams DB 13
# 5 2505708 ADA515576 00-0022247 Mike Adams DB IND 15
# 6 1037889 ADA534252 00-0027610 Phillip Adams DB ATL 11
You can control this using the simplify arguments in jsonlite::fromJSON().

There's nothing "abnormal" about this JSON, its just not a rectangular structure that fits trivially into a data frame. JSON can represent much richer data structures.
For example (using the rjson package, you've not said what you've used):
> data = rjson::fromJSON(file="http://api.fantasy.nfl.com/v1/players/stats?statType=seasonStats&season=2010&week=1&format=json")
> length(data[[4]][[10]]$stats)
[1] 14
> length(data[[4]][[1]]$stats)
[1] 21
(data[[1 to 3]] look like headers)
the "stats" of the 10th element of data[[4]] has 14 elements, the "stats" of the first has 21. How is that going to fit into a rectangular data frame? R has stored it in a list because that's R's best way of storing irregular data structures.
Unless you can define a way of mapping the irregular data into a rectangular data frame, you can't store it in a data frame. Do you understand the structure of the data? That's essential.

RJson and Jsonlite have similar commands, like fromJSON but depending on the order you load them, they will override each other. For my purposes, rJson structures data much better than JsonLite, so I make sure to load in the correct order/only load Rjson

jsonlite is load
library(jsonlite)
Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
Import Quandl data:
quandl_data <- fromJSON(quandl_url)
quandl_data in list type
quandl_data

How to extract text from a several "div class" (html) using R?

My goal is to extract info from this html page to create a database:
https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing
One of the variables is the price of the apartments. I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B). Hence I would like that R could read the observations without the price as NA, otherwise it will mixed the database by giving the price from the observation that follows.
Example A
<div class="listing_column listing_row_price">
<div class="row_price">
$ 14,800
</div>
<div class="row_info">Ayer 19:53</div>
Example B
<div class="listing_column listing_row_price">
<div class="row_info">Ayer 19:50</div>
I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:
...
10 4000
11 14800
12 NA
13 14000
14 8000
...
But so far I've get this one and another full with NA.
...
10 4000
11 14800
12 14000
13 8000
14 8500
...
Commands used but didn't get what I want:
html1<-read_html("file.html")
title<-html_nodes(html1,"div")
html1<-toString(title)
pattern1<-'div class="row_price">([^<]*)<'
title3<-unlist(str_extract_all(title,pattern1))
title3<-title3[c(1:35)]
pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
title3<-unlist(str_extract(title3,pattern2))
title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))
I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another. A reasonably simple strategy that works in this case, though:
library(rvest)
page <- read_html('page.html')
# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')
# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2,
html_text(html_children(x)[1]),
NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Use MySQL database query results to plot R graph using RApache and Brew

I am trying to plot a graph using R which is populated by MySQL query results. I have the following code:
rs = dbSendQuery(con, "SELECT BuildingCode, AccessTime from access")
data = fetch(rs, n=-1)
x = data[,1]
y = data[,2]
cat(colnames(data),x,y)
This gives me an output of:
BuildingCode AccessTime TEST-0 TEST-1 TEST-2 TEST-3 TEST-4 14:40:59 07:05:00 20:10:59 08:40:00 07:30:59
But this is where I get stuck. I have idea how to pass the "cat" data into an R plot. I have spend hours searching online and most of the examples of R plots I have come across use read.tables(text=""). This is not feasible for me as the data has to come from a database and not be hard coded in. I also found something about saving the output as a CSV but MySQL can not overwrite existing files so after the code was executed once I was unable to do it again as a file already existed.
My question is, how can I use the "cat" data (or another way of doing it if there is a better way) to plot a graph using data that isn't hard coded?
Note: I am using RApache as my web server and I have installed the Brew package.

Make the plot using R and just pass the path to the file back in cat
<%
## Your other code to get the data, assuming it gets a data.frame called data
## Plot code
library(Cairo)
myplotfilename <- "/path/to/dir/myplot.png"
CairoPNG(filename = myplotfilename, width = 480, height = 480)
plot(x=data[,1],y=data[,2])
tmp <- dev.off()
cat(myplotfilename)
%>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008