Encoding Issue in R htmlParse XML

Encoding Issue in R htmlParse XML - html

I try to scrape a website but can't handle this encoding issue:
# putting together the url:
search_str <- "allintitle:amphibian richness OR diversity"
url <- paste("http://scholar.google.at/scholar?q=",
search_str, "&hl=en&num=100&as_sdt=1,5&as_vis=1", sep = "")
# get content and parse it:
doc <- htmlParse(url)
# encoding isssue, like here..
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)
[1] "M Vences, M Thomasâ€¦ - â€¦ of the Royal â€¦, 2005 - rstb.royalsocietypublishing.org"
[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
[4] "LB Buckley, W Jetz - Proceedings of the Royal â€¦, 2007 - rspb.royalsocietypublishing.org"
[5] "MÃ RodrÃguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
[7] "D Vallan - Journal of Tropical Ecology, 2002 - Cambridge Univ Press"
[8] "MO RÃ¶del, R Ernst - Ecotropica, 2004 - gtoe.de"
# ...
any pointers?
> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RCurl_1.91-1.1 bitops_1.0-4.1 XML_3.9-4.1
loaded via a namespace (and not attached):
[1] tools_2.15.1
> getOption("encoding")
[1] "native.enc"

This worked to some degree for me
doc <- htmlParse(url,encoding="UTF-8")
head(xpathSApply(doc, '//div[#class="gs_a"]', xmlValue))
#[1] "M Vences, M Thomas… - … of the Royal …, 2005 - rstb.royalsocietypublishing.org"
#[2] "PB Pearman - Conservation Biology, 1997 - Wiley Online Library"
#[3] "D Vallan - Biological Conservation, 2000 - Elsevier"
#[4] "LB Buckley, W Jetz - Proceedings of the Royal …, 2007 - rspb.royalsocietypublishing.org"
#[5] "MÁ Rodríguez, JA Belmontes, BA Hawkins - Acta Oecologica, 2005 - Elsevier"
#[6] "TJC Beebee - Biological Conservation, 1997 - Elsevier"
thou
xpathSApply(doc, '//div[#class="gs_a"]', xmlValue)[[81]]
was displaying incorrectly on my windows box for example.
switching to Font DotumChe using GUI preferences however showed it displaying correctly so it may just be a display issue not a parsing one.

Related

How to correctly identify html node

I want to scrape the price of a product on a webshop, but I struggle to correctly allocate the correct nodes to the price i want to scrape.
The relevant part of my code looks like this:
"https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/"%>%
read_html()%>%
html_nodes('span.woocommerce-Price-amount.amount')%>%
html_text()
When executing this code, I do get prices as a result, but not the ones i want (it shows the prices of other produts that are listed beneath.
How can I now correctly identify the node to the price of the product itself (375.-)

First: I don't know R.
This page uses JavaScript to add this price in HTML
but I don't know if rvest can run JavaScript.
But I found this value in <form data-product_variations="..."> as JSON
and I could display prices for all options:
data <- "https://www.surfdeal.ch/produkt/2019-aqua-marina-fusion-orange/" %>%
read_html() %>%
html_nodes('form.variations_form.cart') %>%
html_attr('data-product_variations') %>%
fromJSON
data$display_price
data$regular_price
data$image$title
Result:
> data$display_price
[1] 479 375 439 479 479
> data$display_regular_price
[1] 699 549 629 699 699
> data$image$title
[1] "aqua marina fusion bamboo padddel"
[2] "aqua marina fusion aluminium padddel"
[3] "aqua marina fusion carbon padddel"
[4] "aqua marina fusion hibi padddel"
[5] "aqua marina fusion silver padddel"
> colnames(data)
[1] "attributes" "availability_html" "backorders_allowed"
[4] "dimensions" "dimensions_html" "display_price"
[7] "display_regular_price" "image" "image_id"
[10] "is_downloadable" "is_in_stock" "is_purchasable"
[13] "is_sold_individually" "is_virtual" "max_qty"
[16] "min_qty" "price_html" "sku"
[19] "variation_description" "variation_id" "variation_is_active"
[22] "variation_is_visible" "weight" "weight_html"
[25] "is_bookable" "number_of_dates" "your_discount"
[28] "gtin" "your_delivery"
EDIT:
To work with page which uses JavaScript you may need other tools - like phantomjs
How to Scrape Data from a JavaScript Website with R | R-bloggers

How do I split a txt file by html tags or regex in order to save it as separate txt files in R?

I have the output of a LexisNexis batch download of news articles in both html and txt format. The file itself contains the headers, metadata, and body of several different news articles that I need to systematically separate and save as independent txt files. The head of the txt version looks like:
> head(textz, 100)
[1] "ï»¿"
[2] " 1 of 103 DOCUMENTS"
[3] ""
[4] ""
[5] " Foreign Affairs"
[6] ""
[7] " May 2013 - June 2013"
[8] ""
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"
[10] ""
[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"
[16] "Center of Excellence at Fort Benning, Georgia."
[17] ""
[18] "SECTION: Vol. 92 No. 4 PAGE: 129"
[19] ""
[20] "LENGTH: 2856 words"
[21] ""
[22] ""
[23] "Ever since World War II, the United States has depended on armored forces --"
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....
A snapshot of the html version looks like this:
<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world's most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>
The unique documents are separated by the "[0-9] of [0-9] DOCUMENTS" lines in each, but between the grep family and strsplit I have been unable to find a way to split the txt (or html) file in R in a way that cleanly separates the component articles and allows me to save them as independent txt files. A thorough search of other questions threads was either unhelpful or required use of Python. Any advice would be great!

library rvest makes it easy parse html. Your documents are not quite consistent with the <DOCFULL> and <DOC NUMBER > headers. The answer below uses your provided document extended to show the next document (104). You can use the lapply structure to do other things like write a text file per article. Note the css selector in html_nodes. There doesn't seem to be much structure in the html, but if you find some patterns you could target bits of each article with selectors.
library(rvest)
library(stringr)
articles <- str_replace_all(doc, "\\n", " ") %>% # remove new line to simplify
str_replace_all("<DOCFULL>\\s+\\-\\->", " " ) %>% # remove redundant header
strsplit("<DOC NUMBER=\\d+>") %>% # split on DOC NUMBER header
unlist() # to a vector
# drop the first empty result form the split
articles <- articles[-1]
# use lapply to travers all articles.
c2_texts <- lapply(articles, function (article) {
article %>%
read_html() %>% # character input parsed as html
html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
html_text() }) # extract text from within the node
c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"
# [2] "The New York Times"
# [3] " 26, 2011 Tuesday"
# [4] "Â "
# [5] "Â Late Edition - Final"
# [6] "By MIKE MULLEN. "
# [7] "Mike Mullen, a "
# [8] " is the chairman of the Joint Chiefs of Staff. "
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"
# [10] "794 words"
# [11] "Washington"
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together. "
#
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"

To split the txt version, assuming text is in doc_text, and write each to a sequentially named files .txt, file2.txt etc.
lapply for file writing adapted from #P Lapointe
texts <- unlist(strsplit(doc_text, "\\s+\\d+\\sof\\s\\d+\\sDOCUMENTS") )
texts <- texts[-1] # drop the first empty split
lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})

Convert HTML Entity to proper character R

Does anyone know of a generic function in r that can convert ä to its unicode character â? I have seen some functions that take in â, and convert it to a normal character. Any help would be appreciated. Thanks.
Edit: Below is a record of data, which I probably have over 1 million records. Is there an easier solution other than reading the data into a massive vector, and for each element, changing the records?
wine/name: 1999 Domaine Robert Chevillon Nuits St. Georges 1er Cru Les Vaucrains
wine/wineId: 43163
wine/variant: Pinot Noir
wine/year: 1999
review/points: N/A
review/time: 1337385600
review/userId: 1
review/userName: Eric
review/text: Well this is awfully gorgeous, especially with a nicely grilled piece of Copper River sockeye. Pine needle and piercing perfume move to a remarkably energetic and youthful palate of pure, twangy, red fruit. Beneath that is a fair amount of umami and savory aspect with a surprising amount of tannin. Lots of goodness here. Still quite young but already rewarding at this stage.
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!
Update:
Using the function stri_trans_general function will convert any Â to a correct lowercase character, and vapply results need to be assigned to save changes.
#cellartracker-10records is the test file to use
tester <- "/Users/petergensler/Desktop/Wine Analysis/cellartracker-10records.txt"
decode <- function(x) { xmlValue(getNodeSet(htmlParse(tester), "//p")[[1]]) }
#Using vector, as we want to iterate over the raw file for cleaning
poop <- vapply(tester, decode, character(1), USE.NAMES = FALSE)
#Now use stringi to convert all characters to correct characters poop
poop <- stringi::stri_trans_general(poop, "Latin-ASCII")
writeLines(poop, "wines.txt")

Here's one way via the XML package:
txt <- "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.
This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.
If you want to run this for a character vector of length >1, then lapply() this:
txt <- rep(txt, 2)
decode <- function(x) {
xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)
or if you want it as a vector, vapply():
> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:
txt <- "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"
out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
This gives me
> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"
Which if you write out using writeLines()
writeLines(out, "wines.txt")
You'll get a text file, which can be read in again using your other parsing code:
> readLines("wines.txt")
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
[2] "Riesling Spätlese"
[3] "wine/wineId: 3058"
[4] "wine/variant: Riesling"
[5] "wine/year: 2001"
[6] "review/points: N/A"
[7] "review/time: 1095120000"
[8] "review/userId: 1"
[9] "review/userName: Eric"
[10] "review/text: Hideously corked!"
And it is a file (from my BASH terminal)
$ cat wines.txt
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

trying to Upload .rmd file to wordpress

I'm having trouble uploading .rmd file to wordpress. I'm not exactly sure what's going on but the error suggests I don't have privileges to remotely publish to wordpress even though from what I understand Wordpress allows remote publishing even for free accounts. I've searched all the wordpress R queries on stack overflow and nothing seems to work. Here's my work flow:
devtools:::install_github("duncantl/RWordPress", force=T)
library(RWordPress)
# Set login parameters (replace admin,password and blog_url!)
options(WordPressLogin = c(admin = 'password'), WordPressURL = 'blog_url/xmlrpc.php')
library(markdown)
library(knitr)
options(markdown.HTML.options = c(markdownHTMLOptions(default = T),"toc"))
# Upload plots: set knitr options
opts_knit$set(upload.fun = function(file){library(RWordPress);uploadFile(file)$url;})
postThumbnail <- RWordPress::uploadFile("File.rmd",overwrite = TRUE)
That produces the following error:
Error: faultCode: 401 faultString: You do not have permission to upload files.
I also tried the following:
knit2wp('fake.rmd', title = 'TITLE', publish = FALSE)
And that produces the same error.
Here's my session info:
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] ggplot2_2.1.0 rmarkdown_1.0 knitr_1.13
[4] markdown_0.7.7 RWordPress_0.2-3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.5 formatR_1.4
[3] plyr_1.8.3 bitops_1.0-6
[5] base64enc_0.1-3 tools_3.3.0
[7] digest_0.6.10 jsonlite_1.0
[9] evaluate_0.9 tibble_1.1
[11] gtable_0.2.0 viridisLite_0.1.3
[13] lattice_0.20-33 png_0.1-7
[15] DBI_0.4-1 mapproj_1.2-4
[17] proto_0.3-10 gridExtra_2.2.1
[19] dplyr_0.5.0 httr_1.2.1
[21] stringr_1.0.0 caTools_1.17.1
[23] RgoogleMaps_1.2.0.7 htmlwidgets_0.7
[25] maps_3.1.0 grid_3.3.0
[27] R6_2.1.2 jpeg_0.1-8
[29] plotly_4.1.0 XML_3.98-1.4
[31] RSelenium_1.4.2 RJSONIO_1.3-0
[33] sp_1.2-3 ggmap_2.6.1
[35] tidyr_0.5.1 reshape2_1.4.1
[37] magrittr_1.5 XMLRPC_0.3-0
[39] scales_0.4.0 htmltools_0.3.5
[41] assertthat_0.1 formattable_0.2
[43] colorspace_1.2-6 geosphere_1.5-1
[45] labeling_0.3 stringi_1.0-1
[47] RCurl_1.95-4.8 lazyeval_0.2.0
[49] munsell_0.4.3 rjson_0.2.15
I'd also like to note, I checked the password and username and they're both correct (if I enter incorrect information I get a different error indicating that). I've also gotten a similar error trying user written functions:
Error: faultCode: 401 faultString: Sorry, you are not allowed to publish posts on this site.
By the way, when I run getUsersBlogs() I get:
$isAdmin
[1] TRUE
$isPrimary
[1] TRUE
$url
[1] "https://blogname.wordpress.com/"
$blogid
[1] "115210981"
$blogName
[1] "Site Title"
$xmlrpc
[1] "https://blogname.wordpress.com/xmlrpc.php"

As implied by #Lloyd Christmas, the problem is with your specification of options. If you change "WordPressURL" to "WordpressURL", you'll probably be fine.

Extract contents within html tags using R

I am now trying to extract contents between specific html tags, e.g.:
<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22ADB%22&as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22AMS%22&as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67.</dd>
...
</dl>
link
I plan to extract contents within <h2> </h2> and contents within <dd> and </dd>. I searched the stackOverFlow for similar questions, but still cannot figure it out, is there anybody who has a simple way to solve this question using R?

This creates a two column matrix m whose first column is h2 and whose second column is associated dd values. Since there is no information in the question on the form of the input we have assumed that the input is a string Lines but the htmlTreeParse line can be changed appropriately if not. Try ?htmlTreeParse for more info.
library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
f <- function(x) cbind(h2 = xmlValue(x), dd = xpathSApply(x, "//dd", xmlValue))
L <- xpathApply(doc, "//h2", f)
m <- do.call(rbind, L)
Here we display the h2 column and the first 10 characters of the dd column:
> cbind(h2 = m[,1], dd = substr(m[,2], 1, 10))
h2 dd
[1,] "ADB" "Allgemeine"
[2,] "ADB" "American m"
[3,] "ADB" "Abbott, Ch"
[4,] "AMS" "Allgemeine"
[5,] "AMS" "American m"
[6,] "AMS" "Abbott, Ch"
[7,] "Abbott, C. C. 1861" "Allgemeine"
[8,] "Abbott, C. C. 1861" "American m"
[9,] "Abbott, C. C. 1861" "Abbott, Ch"
This is the input used above:
Lines <- '<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22ADB%22&as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22AMS%22&as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67.</dd>
</dl>'

Or, doing the scraping the proper way:
library(xml2)
library(rvest)
pg <- read_html("https://www.darwinproject.ac.uk/bibliography")
h2 <- html_text(html_nodes(pg, "dt > h2"))
head(h2)
## [1] "ADB" "AMS"
## [3] "Abbott, C. C. 1861" "Abich, O. H. W. 1841"
## [5] "Accum, Frederick. 1820" "Acevedo Moraga, Fernando. 1987"
dd <- html_text(html_nodes(pg, "dd"))
head(dd)
## [1] "Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912."
## [2] "American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27."
## [3] "Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67."
## [4] "Abich, Otto Hermann Wilhelm. 1841. Geologische Betrachtungen über die vulkanischen Erscheinungen und Bildungen in Unter- und Mittel-Italien. Braunschweig."
## [5] "Accum, Frederick. 1820. A treatise on the art of brewing, exhibiting the London practice of brewing porter, brown stout, ale, table beer, and various other kinds of malt liquors. London: Longman, Hurst, Rees, Orme, and Brown."
## [6] "Acevedo Moraga, Fernando. 1987. La Escuela de Minas de la Serena. In La Serena University, edited by Claudo Canut de Bon: 1–18. Chile."
I feel compelled to include a snippet from their ToS:
Subject to statutory allowances, extracts of material from the site may be accessed, downloaded and printed for your personal and non-commercial use and you may draw the attention of others within your organisation to material posted on the site. You may not:
use any part of the material on the site for direct or indirect commercial purposes or advantage without obtaining a licence to do so from the University or its licensors
you may not modify or alter the paper or digital copies of any material printed off or downloaded in any way
sell, resell, license, transfer, transmit, display in any form, perform, hire, lease or loan any content in whole or in part printed or downloaded from the site
systematically extract and/or re-utilise substantial parts of the content or material on the site
create and/or publish your own database that features substantial parts of this site.
If you print, copy, download or use any part of the site in breach of these terms of use, your right to use the site will cease immediately and you must at the option of the University return or destroy any copies of the material you have made.

htmlpattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(htmlpattern, "\\1", txt)
cat(plain.text)
Note : txt is html text

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Encoding Issue in R htmlParse XML - html

Related

How to correctly identify html node

How do I split a txt file by html tags or regex in order to save it as separate txt files in R?

Convert HTML Entity to proper character R

trying to Upload .rmd file to wordpress

Extract contents within html tags using R

Categories

Resources