Extracting plain text from html files in R [duplicate] - html

I'm trying to read web page source into R and process it as strings. I'm trying to take the paragraphs out and remove the html tags from the paragraph text. I'm running into the following problem:
I tried implementing a function to remove the html tags:
cleanFun=function(fullStr)
{
#find location of tags and citations
tagLoc=cbind(str_locate_all(fullStr,"<")[[1]][,2],str_locate_all(fullStr,">")[[1]][,1]);
#create storage for tag strings
tagStrings=list()
#extract and store tag strings
for(i in 1:dim(tagLoc)[1])
{
tagStrings[i]=substr(fullStr,tagLoc[i,1],tagLoc[i,2]);
}
#remove tag strings from paragraph
newStr=fullStr
for(i in 1:length(tagStrings))
{
newStr=str_replace_all(newStr,tagStrings[[i]][1],"")
}
return(newStr)
};
This works for some tags but not all tags, an example where this fails is following string:
test="junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
The goal would be to obtain:
cleanFun(test)="junk junk junk junk"
However, this doesn't seem to work. I thought it might be something to do with string length or escape characters, but I couldn't find a solution involving those.

This can be achieved simply through regular expressions and the grep family:
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
This will also work with multiple html tags in the same string!
This finds any instances of the pattern <.*?> in the htmlString and replaces it with the empty string "". The ? in .*? makes it non greedy, so if you have multiple tags (e.g., <a> junk </a>) it will match <a> and </a> instead of the whole string.

You can also do this with two functions in the rvest package:
library(rvest)
strip_html <- function(s) {
html_text(read_html(s))
}
Example output:
> strip_html("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"
Note that you should not use regexes to parse HTML.

Another approach, using tm.plugin.webmining, which uses XML internally.
> library(tm.plugin.webmining)
> extractHTMLStrip("junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk")
[1] "junk junk junk junk"

An approach using the qdap package:
library(qdap)
bracketX(test, "angle")
## > bracketX(test, "angle")
## [1] "junk junk junk junk"

It is best not to parse html using regular expressions. RegEx match open tags except XHTML self-contained tags
Use a package like XML. Source the html code in parse it using for example htmlParse and use xpaths to find the quantities relevant to you.
UPDATE:
To answer the OP's question
require(XML)
xData <- htmlParse('yourfile.html')
xpathSApply(xData, 'appropriate xpath', xmlValue)

It may be easier with sub or gsub ?
> test <- "junk junk<a href=\"/wiki/abstraction_(mathematics)\" title=\"abstraction (mathematics)\"> junk junk"
> gsub(pattern = "<.*>", replacement = "", x = test)
[1] "junk junk junk junk"

First, your subject line is misleading; there are no backslashes in the string you posted. You've fallen victim to one of the classic blunders: not as bad as getting involved in a land war in Asia, but notable all the same. You're mistaking R's use of \ to denote escaped characters for literal backslashes. In this case, \" means the double quote mark, not the two literal characters \ and ". You can use cat to see what the string would actually look like if escaped characters were treated literally.
Second, you're using regular expressions to parse HTML. (They don't appear in your code, but they are used under the hood in str_locate_all and str_replace_all.) This is another of the classic blunders; see here for more exposition.
Third, you should have mentioned in your post that you're using the stringr package, but this is only a minor blunder by comparison.

Related

With R and XPath, how do you remove format elements such as \n and \t from the results?

Using XML I can scrape the URL I need, but when I use xpathSApply on it, R returns unwanted \n and \t indicators (new lines and tabs). Here is an example:
doc <- htmlTreeParse("http://www.milesstockbridge.com/offices/", useInternal = TRUE) # scrape and parse an HTML site
xpathSApply(doc, "//div[#class='info']//h3", xmlValue)
[1] "\n\t\t\t\t\t\tBaltimore\t\t\t\t\t" "\n\t\t\t\t\t\tCambridge\t\t\t\t\t" "\n\t\t\t\t\t\tEaston\t\t\t\t\t" "\n\t\t\t\t\t\tFrederick\t\t\t\t\t"
[5] "\n\t\t\t\t\t\tRockville\t\t\t\t\t" "\n\t\t\t\t\t\tTowson\t\t\t\t\t" "\n\t\t\t\t\t\tTysons Corner\t\t\t\t\t" "\n\t\t\t\t\t\tWashington\t\t\t\t\t"
As explained in this question, regex functions can easily remove the unwanted format elements
how to delete the \n\t\t\t in the result from website data collection? but I would rather xpath do the work first, if possible (I have hundreds of these to parse).
Also, there are functions such as translate, apparently, as in this question:
Using the Translate function to remove newline characters in xml, but how do I ignore certain tags? as well as strip() that I saw in a Python question. I do not know which are available when using R and xpath.
It may be that a text() function helps, but I do not know how to include it in my xpathSApply expression. Likewise with normalize-space().
You just want the trim = TRUE argument in your xmlValue() call.
> xpathSApply(doc, "//div[#class='info']//h3", xmlValue, trim = TRUE)
#[1] "Baltimore" "Cambridge" "Easton"
#[4] "Frederick" "Rockville" "Towson"
#[7] "Tysons Corner" "Washington"

Find and replace Heading tags with regex in Notepad++

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. But I need to break the lines from <h1> to <h5>, 1. & 1.1. & 1.1.1. so its easy to re-tag using the tool.
The XML code looks like this:
`<h1>text</h1><h2>text</h3><h3>text</h3>"
and
1.text.2.text.3.text.1.1.text.1.1.1.text
And I need to break the lines like this using a Regex in notepad++.
<h1>text</h1>
<h2>text</h2>
<h3>text</h3>
and
1.text.
2.text.
3.text.
and
1.1.text.
1.1.1.text.
I used </h1>\s* to find an </h1>\n but it only breaks h1 tags. I need to break all "H" tags and 1., 2., 1.1., 1.1.1. tags too.
At the risk of getting downvoted, i think you may be better served by a parser. In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal.
You can use this search and replace (if your h1, h2, ... tags don't contain other tags):
search: (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1
note: if you need Windows newlines, you must change \n with \r\n.
pattern details:
(?<!^) # not preceded by the begining of the string
( # open the capture group 1
<h[1-6][^<]* # <h, a digit between 1 to 6, all characters until
# the next < (to skip all the content between
# h1, h2... tags)
| # OR
(?<![0-9]\.)[0-9]+\. # one or more digits and a dot not preceded by a digit
# and a dot
) # close the capture group 1
$1 is a reference to the content of the capture group 1

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html:
<dt><a name="2004-10-25"><strong>October 25th</strong></a></dt>
<dd>
<p>
[Content]
</p>
</dd>
... and so on.
and here's the example script I'm studying:
#!/usr/bin/perl -w
use strict;
use XML::RSS;
use LWP::Simple;
use HTML::Entities;
my $rss = new XML::RSS (version => '1.0');
my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html";
my $page = get($url);
$rss->channel(title => "The more accurate diary. Really.",
link => $url,
description => "Telsa's diary of life with a hacker:"
. " the current ramblings");
foreach (split ('<dt>', $page))
{
if (/<a\sname="
([^"]*) # Anchor name
">
<strong>
([^>]*) # Post title
<\/strong><\/a><\/dt>\s*<dd>
(.*) # Body of post
<\/dd>/six)
{
$rss->add_item(title => $2,
link => "$url#$1",
description => encode_entities($3));
}
}
If you have a moment to better help me understand, my questions are:
how does the following line work:
([^"]*) # Anchor name
how does the following line work:
([^>]*) # Post title
what does the "six" mean in the following line:
</dd>/six)
Thanks so much in advance for all your help! I'm also researching the answers to my own questions at the moment, but was hoping someone could give me a boost!
how does the following line work...
([^"]*) # Anchor name
zero or more things which aren't ", captured as $1, $2, or whatever, depending on the number of brackets ( in we are.
how does the following line work...
([^>]*) # Post title
zero or more things which aren't >, captured as $1, $2, or whatever.
what does the "six" mean in the
following line...
</dd>/six)
s = match as single line (this just means that "." matches everything, including \n, which it would not do otherwise)
i = match case insensitive
x = ignore whitespace in regex.
x also makes it possible to put comments into the regex itself, so the things like # Post title there are just comments.
See perldoc perlre for more / better information. The link is for Perl 5.10. If you don't have Perl 5.10 you should look at the perlre document for your version of Perl instead.
[^"]* means "any string of zero or more characters that doesn't contain a quotation mark". This is surrounded by quotes making forming a quoted string, the kind that follows <a name=
[^>]* is similar to the above, it means any string that doesn't contain >. Note here that you probably mean [^<], to match until the opening < for the next tag, not including the actual opening.
that's a collection of php specific regexp flags. I know i means case insensitive, not sure about the rest.
The code is an extended regex. It allows you to put whitespace and comments in your regexes. See perldoc perlre and perlretut. Otherwise like normal.
Same.
The characters are regex modifiers.

Regular expression to match empty HTML tags that may contain embedded JSTL?

I'm trying to construct a regular expression to look for empty html tags that may have embedded JSTL. I'm using Perl for my matching.
So far I can match any empty html tag that does not contain JSTL with the following?
/<\w+\b(?!:)[^<]*?>\s*<\/\w+/si
The \b(?!:) will avoid matching an opening JTSL tag but that doesn't address the whether JSTL may be within the HTML tag itself (which is allowable). I only want to know if this HTML tag has no children (only whitespace or empty). So I'm looking for a pattern that would match both the following:
<div id="my-id">
</div>
<div class="<c:out var="${my.property}" />"></div>
Currently the first div matches. The second does not. Is it doable? I tried several variations using lookahead assertions, and I'm starting to think it's not. However, I can't say for certain or articulate why it's not.
Edit: I'm not writing something to interpret the code, and I'm not interested in using a parser. I'm writing a script to point out potential issues/oversights. And at this point, I'm curious, too, to see if there is something clever with lookaheads or lookbehinds that I may be missing. If it bothers you that I'm trying to "solve" a problem this way, don't think of it as looking for a solution. To me it's more of a challenge now, and an opportunity to learn more about regular expressions.
Also, if it helps, you can assume that the html is xhtml strict.
Try
<(\w+)(?:\s+\w+="[^"]+(?:"\$[^"]+"[^"]+)?")*>\s*</\1>
A short explanation:
< # match a '<'
(\w+) # match one or more a-z, A-Z, 0-9 or '_' and store it in group 1
(?: # open non-matching-group 1
\s+ # match one or more white space characters
\w+ # match one or more a-z, A-Z, 0-9 or '_'
=" # match '="'
[^"]+ # match one or more characters other than '"'
(?: # open non-matching-group 2
"\$ # match '"$'
[^"]+ # match one or more characters other than '"'
" # match '"'
[^"]+ # match one or more characters other than '"'
)? # close non-matching-group 2, and make it optional
" # match '"'
)* # close non-matching-group 1, and make repeat itself zero or more times
> # match '>'
\s* # match zero or more white space characters
</\1> # match '</X>' where `X` is what is captured in group 1
This works for both you examples but I am sure someone can construct html that you want to match but will not be matched by the regex. But after reading your 'edit', it seems you are aware of that.
It's not a good idea to use regexes for HTML as there are many constructs that cannot be matched by most regex systems. Also much HTML (as opposed to XHTML) has many difficult constructs. Suggest you use an HTML parser. [This has been frequently addressed on SO and the universal answer is don't use regex).
Using an HTML parser doesn't mean you're interpreting or running the content: it means you are transforming it from a string of characters into a nested object. HTML is not regular, so regular expressions aren't the best solution to this problem.
See the docs for HTML::TreeBuilder as a good place to start. Other good resources include HTML::Parser and of course this site. :)
Edit: I'll pretend that your question has nothing to do with HTML and is just an interesting regex puzzle, and as such will ponder it... ...[still thinking.. edit coming] (puzzle abandoned in the face of a really awesome solution presented above)
If you assume that your input is valid XML, as you say, my tool of choice would be XML::Twig.
based on what i have read i believe the (?: is a non-capturing group not a non-matching group, thus the comment on the regex should be changed.
A non-matching group would be (?!

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{
< # opening angled bracket
(?>/?) # ratchet past optional /
(?:
[^pP] # non-p tag
| # ...or...
[pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
)
[^>]* # everything until closing angled bracket
> # closing angled bracket
}{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict;
use HTML::TokeParser;
my $parser = HTML::TokeParser->new('/some/file.html')
or die "Could not open /some/file.html - $!";
while(my $t = $parser->get_token)
{
# Skip start or end tags that are not "p" tags
next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');
# Print everything else normally (see HTML::TokeParser docs for explanation)
if($t->[0] eq 'T')
{
print $t->[1];
}
else
{
print $t->[-1];
}
}
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML /
<HEAD /
<TITLE / > /
<P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html>
<head>
<title>
>
</title>
</head>
<body>
<p>
>
</p>
</body>
</html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
I came up with this:
<(?!\/?p(?=>|\s.*>))\/?.*?>
x/
< # Match open angle bracket
(?! # Negative lookahead (Not matching and not consuming)
\/? # 0 or 1 /
p # p
(?= # Positive lookahead (Matching and not consuming)
> # > - No attributes
| # or
\s # whitespace
.* # anything up to
> # close angle brackets - with attributes
) # close positive lookahead
) # close negative lookahead
# if we have got this far then we don't match
# a p tag or closing p tag
# with or without attributes
\/? # optional close tag symbol (/)
.*? # and anything up to
> # first closing tag
/
This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.
It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.
Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:
(<[^pP].*?>|</[^pP]>)
Verbose:
(
< # < opening tag
[^pP].*? # p non-p character, then non-greedy anything
> # > closing tag
| # ....or....
</ # </
[^pP] # a non-p tag
> # >
)
I used Xetius regex and it works fine. Except for some flex generated tags which can be : with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :
<(?!\/?p(?=>|\s?.*>))\/?.*?>
I'm using it to clear tags from flex generated html text so i also added more excepted tags :
<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
With all the disclaimers about using regex to parse html, here is a simple way to do it.
#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";
See this live demo
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.
Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.
So as I say, I don't really think regexps are the right tool for the job.
Since HTML is not a regular language
HTML isn't but HTML tags are and they can be adequatly described by regular expressions.
Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:
/<\/?[^p][^>]*>/
EDIT:
But that won't match a <pre> or <param> tag, unfortunately.
This, perhaps?
/<\/?(?!p>|p )[^>]+>/
That should cover <p> tags that have attributes, too.
You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.
The original regex can be made to work with very little effort:
<(?>/?)(?!p).+?>
The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.
(That said I agree that generally parsing HTML with regexes is not the way to go).
Try this, it should work:
/<\/?([^p](\s.+?)?|..+?)>/
Explanation: it matches either a single letter except ā€œpā€, followed by an optional whitespace and more characters, or multiple letters (at least two).
/EDIT: I've added the ability to handle attributes in p tags.
This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.
~(<\/?[^>]*(?<!<\/p|p)>)~ig
You should probably also remove any attributes on the <p> tag, since someone bad could do something like:
<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>
The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.