regular expression to extract text from HTML - html

I would like to extract from a general HTML page, all the text (displayed or not).
I would like to remove
any HTML tags
Any javascript
Any CSS styles
Is there a regular expression (one or more) that will achieve that?

Remove javascript and CSS:
<(script|style).*?</\1>
Remove tags
<.*?>

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

Needed a regex solution (in php) that would return the plain text just as well (or better than) PHPSimpleDOM, only much faster. Here is the solution that I came up with:
function plaintext($html)
{
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $html);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
When I tested this on some complicated sites (forums seem to contain some of the tougher html to parse), this method returned the same result as PHPSimpleDOM plaintext, only much, much faster. It also handled the list items (li tags) properly, where PHPSimpleDOM did not.
As for the speed:
SimpleDom: 0.03248 sec.
RegEx: 0.00087 sec.
37 times faster!

Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be:
//body//text()[not(ancestor::script)][not(ancestor::style)]

Using perl syntax for defining the regexes, a start might be:
!<body.*?>(.*)</body>!smi
Then applying the following replace to the result of that group:
!<script.*?</script>!!smi
!<[^>]+/[ \t]*>!!smi
!</?([a-z]+).*?>!!smi
/<!--.*?-->//smi
This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right). A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.

The simplest way for simple HTML (example in Python):
text = "<p>This is my> <strong>example</strong>HTML,<br /> containing tags</p>"
import re
" ".join([t.strip() for t in re.findall(r"<[^>]+>|[^<]+",text) if not '<' in t])
Returns this:
'This is my> example HTML, containing tags'

Here's a function to remove even most complex html tags.
function strip_html_tags( $text )
{
$text = preg_replace(
array(
// Remove invisible content
'#<head[^>]*?>.*?</head>#siu',
'#<style[^>]*?>.*?</style>#siu',
'#<script[^>]*?.*?</script>#siu',
'#<object[^>]*?.*?</object>#siu',
'#<embed[^>]*?.*?</embed>#siu',
'#<applet[^>]*?.*?</applet>#siu',
'#<noframes[^>]*?.*?</noframes>#siu',
'#<noscript[^>]*?.*?</noscript>#siu',
'#<noembed[^>]*?.*?</noembed>#siu',
// Add line breaks before & after blocks
'#<((br)|(hr))#iu',
'#</?((address)|(blockquote)|(center)|(del))#iu',
'#</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))#iu',
'#</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))#iu',
'#</?((table)|(th)|(td)|(caption))#iu',
'#</?((form)|(button)|(fieldset)|(legend)|(input))#iu',
'#</?((label)|(select)|(optgroup)|(option)|(textarea))#iu',
'#</?((frameset)|(frame)|(iframe))#iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0",
),
$text );
// Remove all remaining tags and comments and return.
return strip_tags( $text );
}

If you're using PHP, try Simple HTML DOM, available at SourceForge.
Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).
Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.

I believe you can just do
document.body.innerText
Which will return the content of all text nodes in the document, visible or not.
[edit (olliej): sigh nevermind, this only works in Safari and IE, and i can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ ]

Can't you just use the WebBrowser control available with C# ?
System.Windows.Forms.WebBrowser wc = new System.Windows.Forms.WebBrowser();
wc.DocumentText = "<html><body>blah blah<b>foo</b></body></html>";
System.Windows.Forms.HtmlDocument h = wc.Document;
Console.WriteLine(h.Body.InnerText);

string decode = System.Web.HttpUtility.HtmlDecode(your_htmlfile.html);
Regex objRegExp = new Regex("<(.|\n)+?>");
string replace = objRegExp.Replace(g, "");
replace = replace.Replace(k, string.Empty);
replace.Trim("\t\r\n ".ToCharArray());
then take a label and do "label.text=replace;" see on label out put
.

Related

how to strip html whitespace and comment and javascript comments using nokogiri?

I'm looking for a way in nokogiri to strip out html whitespace & comment and javascript comment (/* */, //). I'm doing this not because of the size of the document. I'm playing around with rack middleware to do this job. I know I could do via regular expression, but i think it could be troublesome.
If not possible to do with nokogiri, please give me the best regular expression to strip out for the 2 above cases.
What I tried using regular expression:
response = #app.call(env)
body = response.last.body.gsub(/(\n|\t|\r)/, ' ').gsub(/>\s*</, '><').gsub(/<!--[^>]*-->/, ' ').squeeze(' ')
response.last.body = body
response
I think there should be a cleaner way to do rather than using regular expression.
Loofah is nice but it won't help you strip javascript comments.
This thread deals with stripping js comments but there seems to be much disagreement. I agree with the ones who say you should not do it. However if you wanted to try the accepted answer with loofah you might do:
require 'rubygems'
require "loofah"
scrubber = Loofah::Scrubber.new do |node|
node.content = node.content.strip if node.name == "text"
node.remove if node.name == "comment"
if node.cdata? && node.parent.name == "script"
node.content = node.content.gsub(/\/\*![^*]*\*+(?:[^*\/][^*]*\*+)*\//,'')
end
end
puts Loofah.fragment('<p> trim </p><!-- remove --><p> me </p><script>var x=0;/*! remove! */</script>').scrub!(scrubber)
# <p>trim</p><p>me</p><script>var x=0;</script>
Loofah might be what you are looking for:
https://github.com/flavorjones/loofah
I end up writing a middleware to handle this since there is no exact solution for this.
Here I use very strict regular expression to handle it.
Check the code on my github repo.

How do I put two spaces after every period in our HTML?

I need there to be two spaces after every period in every sentence in our entire site (don't ask).
One way to do it is to embark on manually adding a &nbsp after every single period. This will take several hours.
We can't just find and replace every period, because we have concatenations in PHP and other cases where there is a period and then a space, but it's not in a sentence.
Is there a way to do this...and everything still work in Internet Explorer 6?
[edit] - The tricky part is that in the code, there are lines of PHP that include dots with spaces around them like this:
<?php echo site_url('/css/' . $some_name .'.css');?>
I definitely don't want extra spaces to break lines like that, so I would be happy adding two visible spaces after each period in all P tags.
As we all know, HTML collapses white space, but it only does this for display. The extra spaces are still there. So if the source material was created with two spaces after each period, then some of these substitution methods that are being suggested can be made to work reliably - search for "period-space-space" and replace it with something more suituble, like period-space-&emsp14;. Please note that you shouldn't use because it can prevent proper wrapping at margins. (If you're using ragged right, the margin change won't be noticeable as long as you use the the nbsp BEFORE the space.)
You can also wrap each sentence in a span and use the :after selector to add a space and format it to be wide with "word-spacing". Or you can wrap the space between sentences itself in a span and style that directly.
I've written a javascript solution for blogger that does this on the fly, looks for period-space-space, and replaces it with a spanned, styled space that appears wider.
If however your original material doesn't include this sort of thing then you'll have to study up on sentence boundary detection algorithms (which are not so simple), and then modify one to also not trip over PHP code.
You might be able to use the JavaScript split method or regex depending on the scope of the text.
Here's the split method:
var el = document.getElementById("mydiv");
if (el){
el.innerText = el.innerText.split(".").join(".\xA0 ");
}
Test case:
Hello world.Insert spaces after the period.Using the split method.
Result:
Hello world. Insert spaces after the period. Using the split method.
Have you thought using output buffer? ob_start($callback)
Not tested, but if you'll stick this before any output (or betetr yet, offload the function):
<?php
function processDots($buffer)
{
return (str_replace(".", ". ", $buffer));
}
ob_start("processDots");
?>
and add this to end of input:
<?php ob_end_flush(); ?>
Might just work :)
If you're not opposed to a "post processing"/"javascript" solution:
var nodes = $('*').contents().map(function(a, b) {
return (b.nodeType === Node.TEXT_NODE ? b : null);
});
$.each(nodes, function(i,node){
node.data = node.data.replace(/(\.\s)/g, '.\u00A0\u00A0');
});
Using jQuery for the sake of brevity, but not required.
p.s. I saw your comment about not all periods and a space are to be treated equal, but this is about as good as it gets. otherwise, you're going to need a lot better/more bullet-proof approach.
Incorporate something like this into your PHP file:
<?php if (preg_match('/^. [A-Z]$/' || '/^. [A-Z]$/')) { preg_replace('. ', '. '); } ?>
This allows you to search for the beginning of each new sentence as in .spacespaceA-Z, or .spaceA-Z and then replaces that with . space. [note: Capital letter is not replaced]

Parse HTML Page For Links With Regex Using Perl [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
How can I remove external links from HTML using Perl?
Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.
There are lots of links like this:
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
(1992)</a>
I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.
Thanks,
Cody
Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.
Or, consider the following simple example:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my #hrefs;
while ( my $anchor = $parser->get_tag('a') ) {
if ( my $href = $anchor->get_attr('href') ) {
push #hrefs, $href if $href =~ m!/en/subtitles/!;
}
}
print "$_\n" for #hrefs;
__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');"
class="bnone">Death Becomes Her
(1992)</a>
Output:
/en/subtitles/3586224/death-becomes-her-en
Don't use regexes. Use an HTML parser like HTML::TreeBuilder.
my #links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;
my #links = map { $_->attr('href') } $tree->look_down( _tag => 'a');
$tree = $tree->delete;
# Do stuff with links array
URLs like the one in your example can be matched with a regular expression like
($url) = /href=\"([^\"]+)\"/i
If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Qt Regex matches HTML Tag InnerText

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.