How extract meaningful text from HTML - html

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python

Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.

Solutions integrating with Ruby
use Nokogiri as recommended by Amigable Clark kant
Use Hpricot
External Solutions
If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).

Lynx is able to do this. This is open source if you want to take a look at it.

You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body> tag.

Related

Ignore HTML tags in QC REST API output

Using HP ALM REST API, we get the Memo fields embedded with HTML tags such as <html>, <span>, <body>, etc. Is there a way to suppress the same using any options?
Using the earlier OTA API, we had the option to use tdconnection.IgnoreHtmlFormat=True, which used to suppress these tags, but using REST API, I am unable to find an equivalent one. Any suggestions or should I build a parser myself after reading the output?
I personally don't know of a switch like that.
Alternatively you might try this:
How to Parse Only Text from HTML
On paper this seems quite nice.
This requires an extra step though. After getting the request you'ld have to run it through the proposed library to get the get the flat text. Shouldn't be more than a line of code I think.
Downside is there might be some stuff going south because you dump any formatting stored as HTML. Usually that isn't much though. Depends of the project and the people off course.

What are some good ways to parse HTML and CSS in Perl?

I have a project where my input files used to be XML. I'm now being asked to start processing HTML with embedded CSS instead, and I'd like to accomplish this as cleanly and with as few code changes as possible. I was using XML::LibXML to parse the XML files, but now that we're moving to HTML with CSS, I'm thinking I'll need to move to something else. That said, before I dig myself knee deep into silly decisions I'll likely regret, I wanted to ask here: what do you guys use for this kind of task?
The structures of the old XML and the new HTML input files are pretty similar, with both holding the same information. The HTML uses divs in place of the XML's text nodes, and holds its style information in style tags and attributes instead of separated xml attributes.
An example of the old XML is:
<text font="TimesNewRoman,BoldItalic" size="11.04" x="59" y="405" w="52"
h="12" bold="yes" italic="yes" cs="4.6" o_bbox="59,405;52,12"
o_size="11.04" o_cs="4.6">
Some text
</text>
An example of the new HTML is:
<div o="9ka" style="position:absolute;top:145;left:89;x-pdf-top:744;x-pdf-left:60;x-pdf-bottom:732;x-pdf-right:536;">
<span class="ft19" >
Some text
</span></nobr>
</div>
where "ft19" refers to a css style element from the top of the page of the format:
.ft19{ vertical-align:top;font-size:14px;x-pdf-font-size:14px;
font-family:Times;color:#000000;x-pdf-color:#000000;font-style:italic;
x-pdf-letter-spacing:0.83px;}
Basically, all I want is a parser that can read the stylistic elements of each node as attributes, so I could do something like:
my #texts_arr = $page_node->findnodes('text');
my $test_node = $texts_arr[1];
print "node\'s bold value is: " . $text_node->getAttribute('bold');
as I'm able to do with the XML. Does anything like that exist for parsing HTML? I'd really like to make sure I start this the right way instead of finding something that sort of does what I want on CPAN and realizing two months later that there was another module that was way better for what I'm trying to do.
Ideas?
The basic one I am aware of is HTML::Parser.
There is also a project that works with it, Marpa::HTML which is the work of the larger parser project Marpa, which parses any language that can be described in BNF, documented on the author's blog which is very interesting but much newer and experimental.
I also see that wildly successful WWW::Mechanize uses HTML::TokeParser, and it uses HTML::PullParser, so there's that too.
If you need something even more generic (and evil) you can look into "writing" your own using something like Text::Balanced (which has some nice methods for tags, not sure about tag properties though) or even Regexp::Grammars, but again this means reinventing the wheel somewhat, I would only choose these routes if the above don't do what you need.
Perhaps I haven't helped. Perhaps I have just done a literature search for you, but maybe one of these will work better for you than others.
Edit: one more parser for you, seems like it might do what you need HTML::Tree. Then look at methods like look_down from HTML::Element to act on the tree. I saw an example here.
It's not clear - is the Perl parsing for the purposes of doing the conversion to HTML (with embedded CSS)? If so, why not forget Perl and use XSLT which is designed to transform XML documents?

Howto remove HTML <a> tags in a CDATA element

I have HTML in a CDATA element (HTML is too crappy to be parsed) and I would like to remove <a href> tags, but keep text in the tags.
I'm searching around regex but still not find a good way to do that.
All advices are welcome!
You could remove anything from a string that looks like a HTML link via regex. Results heavily depend on your input, but replacing </?a\b[^>]*> with the empty string could get you pretty far.
In any case, handling HTML with regular expressions is crappy and ad-hoc. If your input data set is limited and well known and all you need to do is some throw-away one-time conversion code then crappy and ad-hoc may be enough and you could get away with it.
If you are developing code that is intended to be of the long-lived sort, you should definitely look into one of the avilable HTML parsers (BeautifulSoup for Python or the HTML Agility Pack for .NET come to mind) and not only handle your HTML in a structured way, but also fix it while you are at it.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

A regular expression to remove a given (x)HTML tag from a string

Let's say I have a string holding a mess of text and (x)HTML tags. I want to remove all instances of a given tag (and any attributes of that tag), leaving all other tags and text along. What's the best Regex to get this done?
Edited to add: Oh, I appreciate that using a Regex for this particular issue is not the best solution. However, for the sake of discussion can we assume that that particular technical decision was made a few levels over my pay grade? ;)
Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.
You might be able to get away with something like this:
</?tag[^>]*?>
But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).
Use a parser instead :)
I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.
Of course there are times when a parser might be the best option, but if you are looking for a regex then:
<script[^>]*?>[\s\S]*?<\/script>
That would remove script tags and their contents. Make sure that you use case-insensitive matching.
If you don't want to remove the contents of the tag then you can use:
<\/?script[^>]*?>
An example of usage in javascript would be:
function stripScripts(markup) {
return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}
var safeText = stripScripts(textarea.value);
I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)
If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...
There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.
Off the top of my head, I'd say this will get you started in the right direction.
s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1
Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.
Corrected answer:
</?TAG\b[^>]*?>
Because Dans answer would remove <br />, but you want only <b>
Here's a regex I wrote for this purpose, it works in a few more situations:
</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>
While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.
You could however use some inbuilt or library functions to achieve what you need.
JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
PHP has the DOM extension.
Python has the awesome Beautiful Soup
...and many more.