How to rewrite a stream of HTML tokens into a new document? - html

Suppose I have an HTML document that I have tokenized, how could I transform it into a new document or apply some other transformations?
For example, suppose I have this HTML:
<html>
<body>
<p>text</p>
<p>Hello <span class="green">world</span></p>
</body>
</html>
What I have currently written is a tokenizer that outputs a stream of tokens. For this document they would be (written in pseudo code):
TAG_OPEN[html] TAG_OPEN[body] TAG_OPEN[p] TAG_OPEN[a] TAG_ATTRIBUTE[href]
TAG_ATTRIBUTE_VALUE[/foo] TEXT[text] TAG_CLOSE[a] TAG_CLOSE[p]
TAG_OPEN[p] TEXT[Hello] TAG_OPEN[span] TAG_ATTRIBUTE[class]
TAG_ATTRIBUTE_VALUE[green] TEXT[world] TAG_CLOSE[span] TAG_CLOSE[p]
TAG_CLOSE[body] TAG_CLOSE[html]
But now I don't have any idea how could I use this stream to create some transformations.
For example, I would like to rewrite TAG_ATTRIBUTE_VALUE[/foo] in TAG_OPEN[a] TAG_ATTRIBUTE[href] to something else.
Another transformation I would like to do is make it output TAG_ATTRIBUTE[href] attributes after the TAG_OPEN[a] in parenthesis, for example,
text
gets rewritten into
text(/foo)
What is the general strategy for doing such transformations? There are many other transformations I would like to do, like stripping all tags and just leaving TEXT content, adding tags after some specific tags, etc.
Do I need to create the parse tree? I have never done it and don't know how to create a parse tree from a stream of tokens. Or can I do it somehow else?
Any suggestions are welcome.
And one more thing - I would like to learn all this parsing myself, so I am not looking for a library!
Thanks beforehand, Boda Cydo

If we can assume that the html is xml compliant, then xslt would be a way to go. But I am assuming that would be out as you seem to want to write your own parser (not sure why).
If you really want to write a parser (I'd write parse rules, not your own parser engine) take a look at antlr and MS oslo.

There are various ways of parsing/traversing an XML/HTML tree. Perhaps I can point you to:-
http://razorsharpcode.blogspot.com/2009/10/combined-pre-order-and-post-order-non.html
If you want to do pre-order or post-order manipulation of DOM elements, you can use the algorithm described there.

Related

RegEx to Filter some specific tags

I'm developing an ASP code that read a external websites and parse it via HTMLDocument interface Object ( "HTMLFILE" Object) to navigate contents via DOM structure. But there are some pages that throw an error :
'htmlfile error 80070057 Invalid Argument.'
After doing a lot of research, I've discovered that there are some HTML tags that, i don't know why, are not rendered or managed correctly by HTMLFILE object giving me that error.
Because ASP is too old and there isn't much content available today to be probing, I'm convinced that I have to parse it before send to HTMLFILE Object, and the best way that I have figured is to do via RegEx.
But I'm facing some problems (and because i don't have much practice).
I have to successfully locate HTML Tag Blocks that 'HTMLFILE' do not accept to be able to remove them.
For Example:
<head>
<script> ....... </script>
<style> ....... </style>
</head>
<body>
<iframe> ........ </iframe>
<div> ..... </div>
<table>.....</table>
I have to match full script block, style and iframe, leaving the rest of document intact.
From last days i've doing some research and have almost done it:
<(?:script|embed|object|frameset|frame|iframe|meta|style).+(.|\s)*?>$
I've tried to match single line tag (for example '<BR>') but I'm totally confused now and there are some inconsistencies on it, for example, some of lines that close some tags are improperly selected.
I Know that the best way is discover why HTMLFILE is throwing me on error, but there is no more information on error to debug it.
Thank for all the time and patience.
Here is the regex candidate:
<(script|meta|style|embed|object|frameset|frame|iframe)[\s\S]*?<\/(script|meta|style|embed|object|frameset|frame|iframe)>
DEMO with explanation
EDIT
Update with lazy match for [\s\S]*?
Regex is not best tool for that, take a look here, but if you really want, I think in simple cases you can also use one regex for all tags, also nested:
(?=(<([^>]+)>([\s\S]*?)<\/\2>))
DEMO
the 1st groups shows whole captured part, 2nd groups capture just tag, and 3rd group capture content of tag. It doesn't actually match text, only capture some fragments. However you probably can get start/end index of match, and use in as you want.
Still I think you should reconsider using regex, however suntex used above is quite useful, so it is worth to know how to use it.

Adding metadata to markdown text

I'm working on software creating annotations and would like my main data structure to be based around markdown.
I was thinking of working with an existing markdown editor, but hacking it so that certain tags, i.e. [annotation-id-001]Sample text.[/annotation-id-001] did not show up as rendered HTML; the above would output Sample text. in an HTML preview and link to a separate annotation with the ID 001.
My question is, is this the most efficient way to represent this kind of metadata inside of a markdown document? Also, if a user wants to legitimately use something like "[annotation-id-001]" as text inside of their document, I assume that I would have to make that string syntax illegal, correct?
I don't know what Markdown parser you use but you can abord your problem with different points of view:
first you can "hack" an existing parser to exclude your annotation tags from "classic" parsing and include them only in a certain mode
you can also use the internal "meta-data" information proposed by certain parsers (like MultiMarkdown or MarkdownExtended) and only write your annotations like meta-data with a reference to their final place in content
or, as mentionned by mb21, you can use simple links notation like [Sample text.](#annotation-id-001) or use footnotes like [Sample text.](^annotation-id-001) and put your annotations as footnotes.

Grep and Extract Data in Perl

I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud">*DATA_2*</td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it is HTML I think this could work for you?
https://metacpan.org/pod/XML::XPath
XPath is the way.
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser.
Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.
You might try this module: HTML::TreeBuilder::XPath. The doc says:
This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

How to replace markup in html files stored on unix/solaris servers?

I'm looking for a way to grab a piece of markup that is in a 1000+ html files published on unix servers (running via apache) and replace the markup with either empty nodes or alternate html markup.
ex:
Find
<div id="someComponent"> .....{a bunch of interior markup} .... </div>
Replace with {empty}
ex 2:
Find </div></body>
Replace </div>{some HTML markup needed here}</body>
If it is really simple (no parse needed, markup well known and not one into another), the fastest way should be :
(In Zsh or Bash)
perl -pi -e 's#<div class="toto">.*?</div>#<span>new content</span>#g' /path/to/files/**/*.html(.)
That should do the trick to replace all between all ...<div class="toto">.....</div>... by
...<span>newcontent</span>...
But beware it will NOT work for ...<div class="toto"> ... <div class="toto"> ... </div> ... </div> ....
One way to do it: use Python with BeautifulSoup to parse the HTML file, do replacement and write back.
If the markup is written in the same way in all the files, sed or perl will be much quicker than BeautifulSoup or the like, but it's also harder to make flexible in terms of various ways of expressing the same HTML markup in text form.
Do you have a more concrete example of what kind of markup you're looking for, and ideally how it might vary from file to file? Where in the file will it be? Also, is it okay to prettify or tidy the HTML in the process if necessary?
Oh, and are you running something on the server(s), or do you need code to spider the server to retrieve the HTML files for processing?

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*