Parsing HTML in R using XML and RCurl

Parsing HTML in R using XML and RCurl - html

I am trying to parse the content of a website but I receive an error message. I don't know how to deal with the error:
require(RCurl)
require(XML)
html <- getURL("http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt")
doc <- htmlParse(html, asText=TRUE)
This is the error message I get:
Error: XML content does not seem to be XML, nor to identify a file name
I am working on a Mac:
> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8 rJava_0.9-4 R.utils_1.26.2 R.oo_1.13.9 R.methodsS3_1.4.4 gsubfn_0.6-5 proto_0.3-10 RCurl_1.95-4.1
[9] bitops_1.0-6 splus2R_1.2-0 stringr_0.6.2 foreign_0.8-54 XML_3.95-0.2
loaded via a namespace (and not attached):
[1] tcltk_3.0.1 tools_3.0.1
Any ideas on how to solve this issue?

You don't need curl to get the file, the built in tools can read test from urls (eg. scan or read.table).
The reason you're getting this error is the file isn't valid XML or HTML. Strip out all the lines before the <HTML> tag and you should be good to go.
sec <- scan(file = "http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt", what = "character", sep ="\n", allowEscapes = TRUE)
sec <- sec[56:length(sec)]
secHTML <- htmlParse(sec)
There are other, less ugly ways to get the file, but once you strip the 'text' preamble XML should be able to parse it.
Alternately I think there's a parameter to htmlParse which allows you to specify a number of lines to skip.

> txt <- readLines(url("http://www.sec.gov/Archives/edgar/data/8947/000119312506125763/0001193125-06-125763.txt"))
> head(txt)
[1] "-----BEGIN PRIVACY-ENHANCED MESSAGE-----"
[2] "Proc-Type: 2001,MIC-CLEAR"
[3] "Originator-Name: webmaster#www.sec.gov"
[4] "Originator-Key-Asymmetric:"
[5] " MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen"
[6] " TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB"
> length(txt)
[1] 5517

The filings stored on the www.sec.gov website are a mixture of different types of files. Some are plain text, some are jpg, some are gif, some are pdf, some are XML, some are XBRL, and some are html, and others. The example file you are using is the "RAW Dissemination" file type, that is actually a combination of any, or all of the other types.
The file name "0001193125-06-125763.txt" is a concatenation of the "Accession Number" and the txt extension. This RAW Dissemination file is made up of Header Data, and a series of "<DOCUMENT> ....</DOCUMENT>" Tag sets. What comes between the start and end DOCUMENT Tag is the various "files" within the "filing".
The each of the different files within a filing should be treated separately. The PDFS, JPG, GIF file types are UUEncoded, and should be UUdecoded. Others like TXT, HTML, XML, XBRL should be treated as plain text, and if needed, parsed as the appropriate type.
The Header Data tags information about the companies, people, filers, filer agents, etc who have submitted the filing.

Related

read multiples local html files in a folder in R

I have several HTML files in a folder in my pc. I would like to read them in R, trying to keep the original format as much as posible. There is only text, by the way. I have tried two approaches, which failed misserably:
##first approach
library (tm)
cname <- file.path("C:", "Users", "usuario", "Desktop", "DEADataset", "The Phillipines", "gazzetes.presihtml")
docs <- Corpus(DirSource(cname))
## second approach
list_files_path<- list.files(path = './gazzetes.presihtml')
a<- paste0(list_files_path, names) # vector names contain the names of the file with the .HTML extension
rawHTML <- readLines(a)
Any guess? all the best

Your second approach is close to working, except that readLines only accepts one connection, but you are giving it a vector with multiple files. You can use lapply with readLines to achieve this. Here is an example:
# generate vector of html files
files <- c('/path/to/your/html/file1', '/path/to/your/html/file2')
# readLines for each file and put them in a list
lineList <- lapply(files, readLines)
# create a character vector that contains all lines from all files
lineVector <- unlist(lineList)
# collapse the character vector into a single string
html <- paste(lineVector , collapse = '\n')
# print the string with original formatting
cat(html)

how to read 7z json file in R

Cannot find the answer how to load 7z file in R. I can't use this:
s <- system("7z e -o <path> <archive>")
because of error 127. Maybe that's because I'm on Windows? However, 7z opens when I click in TotalCommander.
I'm trying something like this:
con <- gzfile(path, 'r')
ff <- readLines(con, encoding = "UTF-8")
h <- fromJSON(ff)
I have Error:
Error: parse error: trailing garbage
7z¼¯' ãSp‹ Ë:ô–¦ÐÐY#4U¶å¿ç’
(right here) ------^
The encoding is totally not there, when I load this file uncompressed it's ok without specifying the encoding. Moreover it's 2x longer. I have thousands of 7z files need to read them one by one in a loop, read, analyze and get out. Could anyone give me some hints how to do it effectively?
When uncompressed it easily works using:
library(jsonlite)
f <- read_json(path, simplifyVector = T)
EDIT
There are many json files in one 7z file. The above error is probably caused by parser which reads raw data of whole file. I don't know how to link these files or specify the connection attributes.

How to correctly deal with escaped Unicode Characters in R's library RJSONIO when reading json from a file

I am using R's RJSONIO to read json from a file. The json contains unicode characters, which get read incorrectly.
The code works when the json is passed as string as shown by the author of the R package in the question on stackoverflow How to correctly deal with escaped Unicode Characters in R e.g. the em dash (—).
However when the json is read from a file, it does not produce the correct unicode representation. As seen below:
fromJSON(content="~/MTS/temp")
$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0
$query$categorymembers[[1]]$title
[1] "Banach\023Tarski paradox"
Where ~/MTS/temp contains:
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}`

An alternative package called jsonlite works the way you would expect on my system (OS X) -- but I did verify that RJSONIO does not. This is after I saved your JSON snippet to a file called utext.txt:
file.show("utext.txt")
## {"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
jsonlite::fromJSON("~/temp/utext.txt")
## $query
## $query$categorymembers
## ns title
## 1 0 Banach–Tarski paradox
Here is another solution that is a bit more platform-dependent: Encode your Unicode escaped files prior to reading them. (Whether or not your platform has this utility, I do not know, but even for Windows you can probably find it.)
My system locale encoding is UTF-8 (OS X standard), so when I run the command line utility native2ascii I can encode it as UTF-8, and then read it into R, where my locale is set to en_GB.UTF-8.
From a Terminal/shell:
native2ascii -reverse ~/temp/utext.txt ~/temp/utextUTF8.txt
Then in R:
RJSONIO::fromJSON("~/temp/utextUTF8.txt")
## $query
## $query$categorymembers
## $query$categorymembers[[1]]
## $query$categorymembers[[1]]$ns
## [1] 0
##
## $query$categorymembers[[1]]$title
## [1] "Banach–Tarski paradox"
Voil\u00e0 problem solved.

Is there any R package to convert PDF to HTML [duplicate]

Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.

A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.

A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.

install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])

The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.

I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}

create a Corpus from many html files in R

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.
Essentially, this is what I want to do:
Create a Corpus from multiple html files.
I tried to use DirSource:
library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))
but it returns "invalid directory parameters"
Read in html files from the Corpus all at once.
Not sure how to do it.
Parse them, convert them to plain text, remove tags.
Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.
Thanks very much.

This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.
# get data
setwd("C:/Downloads/html") # this folder has your HTML files
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))
# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))
# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10)))
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,]
inspect(a.dtm2)
# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean")
fit <- hclust(d, method="ward")
plot(fit)
# just for fun...
library(wordcloud)
library(RColorBrewer)
m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE)
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

This will correct the error.
b<-Corpus(a, ## I change DireSource(a) by a
readerControl=list(language="eng", reader=readPlain))
But I think to read your Html you need to use xml reader. Something like :
r <- Corpus(DirSource('c:\test'),
readerControl = list(reader = readXML),spec)
But you need to supply the spec argument, which depends with your file structure.
see for example readReut21578XML. It is a good example of xml/html parser.

To read all the html files into an R object you can use
# Set variables
folder <- 'C:/test'
extension <- '.htm'
# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)
# Read all the files into a list
htmls <- lapply(X=files,
FUN=function(file){
.con <- file(description=paste(folder, file, sep='/'))
.html <- readLines(.con)
close(.con)
names(.html) <- file
.html
})
That will give you a list, and each element is the HTML content of each file.
I'll post later on parsing it, I'm in a hurry.

I found the package boilerpipeR particularly useful to extract only the "core" text of an html page.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parsing HTML in R using XML and RCurl - html

Related

read multiples local html files in a folder in R

how to read 7z json file in R

How to correctly deal with escaped Unicode Characters in R's library RJSONIO when reading json from a file

Is there any R package to convert PDF to HTML [duplicate]

create a Corpus from many html files in R

Categories

Resources