File compression for and storing of HTML content - html

For HTML content retrieved via R, I wonder what (other) options I have with respect to either
file compression (maximum compression rate / minimum file size; the time it takes to compress is of secondary importance) when saving the content to disk
most efficiently storing the content (by whatever means, OS filesystem or DBMS)
My current findings are that gzfile offers the best compression rate in R. Can I do better? For example, I tried getting rid of unncessary space in the HTML code before saving, but seems like gzfile already takes care of that as I don't end up with smaller file sizes in comparison.
Extended curiosity question:
How do search engines handle this problem? Or are they throwing away the code as soon as it has been indexed and thus something like this is not relevant for them?
Illustration
Getting example HTML code:
url_current <- "http://cran.at.r-project.org/web/packages/available_packages_by_name.html"
html <- readLines(url(url_current))
Saving to disk:
path_txt <- file.path(tempdir(), "test.txt")
path_gz <- gsub("\\.txt$", ".gz", path_txt)
path_rdata <- gsub("\\.txt$", ".rdata", path_txt)
path_rdata_2 <- gsub("\\.txt$", "_raw.rdata", path_txt)
write(html, file=path_txt)
write(html, file=gzfile(path_gz, "w"))
save(html, file=path_rdata)
html_raw <- charToRaw(paste(html, collapse="\n"))
save(html_raw, file=path_rdata_2)
Trying to remove unncessary whitespace:
html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)
path_gz_2 <- gsub("\\.txt$", "_2.gz", path_txt)
write(html_2, gzfile(path_gz_2, "w"))
html_2 <- gsub("\\n", "", html_2)
path_gz_3 <- gsub("\\.txt$", "_3.gz", path_txt)
write(html_2, gzfile(path_gz_3, "w"))
Resulting file sizes:
files <- list.files(dirname(path_txt), full.names=TRUE)
fsizes <- file.info(files)$size
names(fsizes) <- sapply(files, basename)
> fsizes
test.gz test.rdata test.txt test_2.gz test_3.gz
164529 183818 849647 164529 164529
test_raw.rdata
164608
Checking validity of processed HTML code:
require("XML")
html_parsed <- htmlParse(html)
> xpathSApply(html_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML
html_2_parsed <- htmlParse(readLines(gzfile(path_gz_2)))
> xpathSApply(html_2_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML
html_3_parsed <- htmlParse(readLines(gzfile(path_gz_3)))
> xpathSApply(html_3_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML

html_2 <- gsub(">\\s*<", "", html)
strips away the > and <
Instead try:
html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)

Related

R - Issue with the DOM of the danish parliament (webscraping)

I've been working on a webscraping project for the political science department at my university.
The Danish parliament is very transparent about their democratic process and they are uploading all the legislative documents on their website. I've been crawling over all pages starting 2008. Right now I'm parsing the information into a dataframe and I'm having an issue that I was not able to resolve so far.
If we look at the DOM we can see that they named most of the objects div.tingdok-normal. The number of objects varies between 16-19. To parse the information correctly for my dataframe I tried to grep out the necessary parts according to patterns. However, the issue is that sometimes my pattern match more than once and I don't know how to tell R that I only want the first match.
for the sake of an example I include some code:
final.url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
to.save <- getURL(final.url)
p <- read_html(to.save)
normal <- p %>% html_nodes("div.tingdok-normal > span") %>% html_text(trim =TRUE)
tomatch <- c("Forkastet regeringsforslag", "Forkastet privat forslag", "Vedtaget regeringsforslag", "Vedtaget privat forslag")
type <- unique (grep(paste(tomatch, collapse="|"), results, value = TRUE))
Maybe you can help me with that
My understanding is that you want to extract the text of the webpage, because the "tingdok-normal" are related to the text. I was able to get the text of the webpage with the following code. Also, the following code identifies the position of the first "regex hit" of the different patterns to match.
library(pagedown)
library(pdftools)
library(stringr)
pagedown::chrome_print("https://www.ft.dk/samling/20161/lovforslag/l154/index.htm",
"C:/.../danish.pdf")
text <- pdftools::pdf_text("C:/.../danish.pdf")
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}
Here is another approach :
library(RDCOMClient)
library(stringr)
library(rvest)
url <- "https://www.ft.dk/samling/20161/lovforslag/l154/index.htm"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
Sys.sleep(5)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
tomatch <- c("(A|a)ftalen", "(O|o)pholdskravet")
nb_Tomatch <- length(tomatch)
list_Position <- list()
list_Text <- list()
for(i in 1 : nb_Tomatch)
{
# Locates the first hit of the regex
# To locate all regex hit, use stringr::str_locate_all
list_Position[[i]] <- stringr::str_locate(text , pattern = tomatch[i])
list_Text[[i]] <- stringr::str_sub(string = text,
start = list_Position[[i]][1, 1],
end = list_Position[[i]][1, 2])
}

How to read a <li> table in a webpage

I debug the program many times to get the result as follows:
url 研究所知识库列表
/handle/1471x/1 力学研究所
/handle/1471x/8865 半导体研究所
However, no metter what parameters I use, the result is not correct. The content in this table is one part of the basis of my further analysis, and I am very trembled for it. I'm looking forward to your help with great sincerity.
## download community-list ---the 1st level of IR Grid
#loading webpage and analyzing
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
# get table specs
tableNodes <- getNodeSet(com_parsed, "//table")
com_tb<-readHTMLTable(tableNodes[[8]], header=TRUE)
# get External links
xpath <- "//a/#href"
getHTMLExternalFiles(tableNodes[[8]], xpQuery = xpath)
it is unclear exactly what you want your end result to look like but if you modify your xpath statements a bit to take advantage of the DOM structure you can get something like this:
library(XML)
community_url<-"http://www.irgrid.ac.cn/community-list"
com_source <- readLines(community_url, encoding = "UTF-8")
com_parsed <- htmlTreeParse(com_source, encoding = "UTF-8", useInternalNodes = TRUE)
list_header <- xpathSApply(com_parsed, '//table[.//li]//h1', xmlValue)
hrefs <- xpathSApply(com_parsed, '//li[#class="communityLink"]//#href', function(x) unname(x))
display_text <- xpathSApply(com_parsed, '//li[#class="communityLink"]//a', xmlValue)
table_data <- cbind(display_text, hrefs)
colnames(table_data) <- c(list_header, "url")
table_data
console output causes stackoverflow to think this answer is spam but here is a screen shot:

Persisting HTML documents to disk

I am trying to save about 300 HTML objects to disk using R.
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
xml2::write_xml(read_html_test1, "testwrite.html")
read_html <- xml2::read_html("testwrite.html")
But this will eventually save about 300 separate files to disk. Ideally, what I would like is to save a single R object to disk that contains these 300 documents.
Converting each document to text before saving for some reason does not work. For example the following will product some weird (unhelpful) error:
str_html <- as.character(read_html_test1)
xml2::read_html(str_html)
If I try to use the output of xml2::read_html() it is a a pointer to a C structure and therefore this will not persist to disk.
Any suggestions for a hack to make this work...?
I managed it with the httr package, whose content function can take an as = "text" argument, which stops it from parsing the HTML.
library(xml2)
library(httr)
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
# use `GET` to make the request, and pull out the html with `content`; returns text string
x <- content(GET(str_url), as = 'text')
# make a list of html documents to save
list_xs <- list(x, x)
# save list with `saveRDS`
saveRDS(list_xs, 'test.rds')
Now to see if it works:
# read in rds file we saved
saved_html <- readRDS('test.rds')
# parse the second element in it with `xml2::read_html`
saved_x_parsed <- read_html(saved_html[[2]])
# and let's see...
saved_x_parsed
# {xml_document}
# <html>
# [1] <head><title>
\n\tNew Zealand holiday homes, baches and vacation homes for rent.
\ ...
# [2] <body id="ctl00_Body" class="Page-List">
\n <div class="SatNavBarPlaceholder"/>&#13 ...
How to save R objects to disk:
Save R Objects
I took your example code and produced working, human readable, R-loadable output as follows:
str_url <- "https://www.holidayhouses.co.nz/Browse/List.aspx?page=1"
read_html_test1 <- xml2::read_html(str_url)
str_html <- as.character(read_html_test1)
x <- xml2::read_html(str_html)
save(x, file="c:\\temp\\text.txt",compress=FALSE,ascii=TRUE)

R HTML clean up - how to get rid of strange characters in output?

I'm using R to clean up html files stored in my hard drive and then export as txt files. However, in the output text files I see a lot of strange characters such as < U+0093>,< U+0094> < U+0093> etc. It seems to me either quote mark or bullet point (or maybe some others) is not parsed/displayed correctly. How do I fix this issue?
Here is the original HTML file
Below is the code I've been using:
library(bitops)
library(RCurl)
library(XML)
rawHTML <- paste(readLines("2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n")
doc = htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
write.table(plain.text, file="2488.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
If you just need the text, you an do a conversion to ASCII with iconv. Also, you don't need to use write.table for this as writeLines will do nicely:
library(bitops)
library(RCurl)
library(XML)
rawHTML <- paste(readLines("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm"), collapse="\n")
doc <- htmlParse(rawHTML, asText=TRUE, encoding="UTF-8")
plain.text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
writeLines(iconv(plain.text, to="ASCII"), "~/Dropbox/2488wl.txt")
You could also use rvest (you still need iconv):
library(xml2)
library(rvest)
pg <- html("~/Dropbox/2488-R20130221-C20121229-F22-0-1.htm")
target <- "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]"
pg %>%
html_nodes(xpath=target) %>%
html_text() %>%
iconv(to="ASCII") %>%
writeLines("~/Dropbox/2488rv.txt")
You can also avoid pipes if you want to:
converted <- iconv(html_text(html_nodes(pg, xpath=target)), to="ASCII")
writeLines(converted, "~/Dropbox/2488rv.txt")

How to click links onto the next page using RCurl?

I am trying to scrape this table from this website using RCurl. I am able to do this and put it into a nice dataframe using the code:
clinVar <- getURL("http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq", "Phenotype","Clinical significance","Status", "Chr","Location")
However, I can only extract the data on the first page, and the table spans multiple pages. How do you access data on the next page? I have looked at the HTML code for the website and the region that the "Next" button exists in is here (I believe!):
<a name="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" href="#" sid="3" page="3" accesskey="k" id="EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.Page">Next ></a>
I would like to know how to access this link using getURL, postForm etc. I think I should be doing something like this, to get data from the second page but it's still just giving me the first page:
url <- "http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1"
clinVar <- postForm(url,
"EntrezSystem2.PEntrez.clinVar.clinVar_Entrez_ResultsPanel.Entrez_Pager.cPage" ="2")
docForm2 <- htmlTreeParse(clinVar,useInternalNodes = T)
xp_expr = "//table[#class= 'jig-ncbigrid docsum_table\']/tbody/tr"
nodes = getNodeSet(docForm2, xp_expr)
extractedData <- xmlToDataFrame(nodes)
colnames(extractedData) <- c("Info","Gene", "Variation","Freq","Phenotype","Clinical significance","Status", "Chr","Location")
Thanks to anyone who can help.
I would use E-utilities to access data at NCBI instead.
url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=brca1"
readLines(url)
[1] "<?xml version=\"1.0\" ?>"
[2] "<!DOCTYPE eSearchResult PUBLIC \"-//NLM//DTD eSearchResult, 11 May 2002//EN\" \"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd\">"
[3] "<eSearchResult><Count>1080</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_36649974_130.14.18.34_9001_1386348760_356908530</WebEnv><IdList>"
Pass the QueryKey and WebEnv to esummary and get the XML summary (this changes with each esearch, so copy and paste the new keys into the url below)
url2 <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&query_key=1&WebEnv=NCID_1_36649974_130.14.18.34_9001_1386348760_356908530"
brca1 <- xmlParse(url2)
Next, view a single record and then extract the fields you need. You may need to loop through the set if there are 0 to many values assigned to a tag. Others like clinical significance description always have 1 value.
getNodeSet(brca1, "//DocumentSummary")[[1]]
table(xpathSApply(brca1, "//clinical_significance/description", xmlValue) )
Benign conflicting data from submitters not provided other
129 22 6 1
Pathogenic probably not pathogenic probably pathogenic risk factor
508 68 19 43
Uncertain significance
284
Also, there are many packages with E-utilities on github and BioC (rentrez, reutils, genomes and others). Using the genomes package on BioC, this simplifies to
brca1 <- esummary( esearch("brca1", db="clinvar"), parse=FALSE )
Using the e-utilities feature on the NCBI database, see http://www.ncbi.nlm.nih.gov/books/NBK25500/ for more details.
## use eSearch feature in eUtilities to search NCBI for ids corresponding to each row of data.
## note to see all ids, not not just top 10 set retmax to a high number
## to get query id and web env info, set usehistory=y
library(RCurl)
library(XML)
baseSearch <- ("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=") ## eSearch
db <- "clinvar" ## database to query
gene <- "BRCA1" ## gene of interest
query <- paste('[gene]+AND+"','clinsig pathogenic"','[Properties]+AND+"','single nucleotide variant"','[Type of variation]&usehistory=y&retmax=1110',sep="") ## query, see below for details
baseFetch <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=" ## base fetch
searchURL <- paste(baseSearch,db, "&term=",gene,query,sep="")
getSearch <- getURL(searchURL)
searchHTML <- htmlTreeParse(searchURL, useInternalNodes =T)
nodes <- getNodeSet(searchHTML,"//querykey") ## this name "querykey" was extracted from the HTML source code for this page
querykey <- xmlToDataFrame(nodes)
nodes <- getNodeSet(searchHTML,"//webenv") ## this name "webenv" was extracted from the HTML source code for this page
webenv <- xmlToDataFrame(nodes)
fetchURL <- paste(baseFetch,db,"&query_key=",querykey,"&WebEnv=",webenv[[1]],"&rettype=docsum",sep="")
getFetch <- getURL(fetchURL)
fetchHTML <- htmlTreeParse(getFetch, useInternalNodes =T)
nodes <- getNodeSet(fetchHTML, "//position")
extractedDataAll <- xmlToDataFrame(nodes)
colnames(extractedDataAll) <- c("pathogenicSNPs")
print(extractedDataAll)
Please note, I found the query information by going to http://www.ncbi.nlm.nih.gov/clinvar/?term=BRCA1 selecting my filters (pathogenic, etc) and then clicking the advanced button. The most recent filters applied should come up in the main box, I used this for the query.
ClinVar now offers XML download of the whole database so webscraping is not necessary.