How to split up long HTML into "functions" - html

When writing a non-trivial static HTML page, the large-scale structure gets very hard to see, with the fine structure and the content all mixed into it. Scrolling several pages to find the closing tag that matches a given opening tag is frustrating. In general it feels messy, awkward, and hard to maintain... like writing a large program with no functions.
Of course, when I write a large program, I break it up hierarchically into smaller functions. Is there any way to do this with a large HTML file?
Basically I'm looking for a template system, where the content to be inserted into the template is just more HTML that's optionally (here's the important part) located in the same file.
That is, I want to be able to do something like what is suggested by this hypothetical syntax:
<html>
<head>{{head}}</head>
<body>
<div class="header">{{header}}</div>
<div class="navbar">{{navbar}}</div>
<div class="content">{{content}}</div>
<div class="footer">{{footer}}</div>
</body>
</html>
{{head}} =
<title>Hello World!</title>
{{styles}}
{{scripts}}
{{styles}} =
<link rel="stylesheet" type="text/css" href="style.css">
{{navbar}} =
...
...
... and so on...
Then presumably there would be a simple way to "compile" this to make a standard HTML file.
Are there any tools out there to allow writing HTML this way?
Most template engines require each include to be a separate file, which isn't useful.
UPDATE: Gnu M4 seems to do exactly the sort of thing I'm looking for, but with a few caveats:
The macro definitions have to appear before they are used, when I'd rather they be after.
M4's syntax mixes very awkwardly with HTML. Since the file is no longer HTML, it can't be easily syntax checked for errors. The M4 processor is very forgiving and flexible, making errors in M4 files hard to find sometimes - the parser won't complain, or even notice, when what you wrote means something other than what you probably meant.
There's no way to get properly indented HTML out, making the output an unreadable mess. (Since production HTML might be minified anyway, that's not a major issue, and it can always be run through a formatter if it needs to be readable.)

This will parse your template example and do what you want.
perl -E 'my $pre.=join("",<>); my ($body,%h)=split(/^\{\{(\w+)\}\}\s*=\s*$/m, $pre); while ($body =~ s/\{\{(\w+)\}\}/$h{$1}/ge) { if ($rec++>200) {die("Max recursion (200)!")}};$body =~ s/({{)-/$1/sg; $body =~ s/({{)-/$1/sg; print $body' htmlfiletoparse.html
And here's the script version.
file joshTplEngine ;)
#!/usr/bin/perl
## get/read lines from files. Multiple input files are supported
my $pre.=join("",<>);
## split files to body and variables %h = hash
## variable is [beginning of the line]{{somestring}} = [newline]
my ($body,%h)=split(/^\{\{# split on variable line and
(\w+) ## save name
\}\}
\s*=\s*$/xm, $pre);
## replace recursively all variables defined as {{somestring}}
while ($body =~ s/
\{\{
(\w+) ## name of the variable
\}\}
/ ##
$h{$1} ## all variables have been read to hash %h and $1 contens variable name from mach
/gxe) {
## check for number of recursions, limit to 200
if ($rec++>200) {
die("Max recursion (200)!")
}
}
## replace {{- to {{ and -}} to }}
$body =~ s/({{)-/$1/sg;
$body =~ s/-(}})/$1/sg;
## end, print
print $body;
Usage:
joshTplEngine.pl /some/html/file.html [/some/other/file] | tee /result/dir/out.html
I hope this little snipped of perl will enable you to your templating.

Related

Get numbers, a given number of characters after a phrase, from HTML

Basically, I've opened an HTML file in perl, and wrote this line:
if(INFILE =~ \$txt_TeamNumber\) {
$teamNumber = \$txt_TeamNumber\
}
and I need to get the txt_TeamNumber, go 21 spaces forward, and get the next 1-5 numbers. Here is the part of the HTML file I'm trying to extract info from:
<td style="width: 25%;">Team Number:
</td>
<td style="width: 75%;">
<input name="ctl00$ContentPlaceHolder1$txt_TeamNumber" type="text" value="186" maxlength="5" readonly="readonly" id="ctl00_ContentPlaceHolder1_txt_TeamNumber" disabled="disabled" tabindex="1" class="aspNetDisabled" style="width:53px;">
</td>
This is a very good example for benefits of using ready parsers.
One of the standard modules for parsing HTML is HTML::TreeBuilder. Its effectiveness is to a good extent based on its good use of HTML::Element so always have that page ready for reference.
The question doesn't say where HTML comes from. For testing I put it in a file, wrapped with needed tags, and load it from that file. I expect it to come from internet, please change accordingly.
use warnings;
use strict;
use Path::Tiny;
use HTML::TreeBuilder;
my $file = "snippet.html";
my $html = path($file)->slurp; # or open and slurp by hand
my $tree = HTML::TreeBuilder->new_from_content($html);
my #nodes = $tree->look_down(_tag => 'input');
foreach my $node (#nodes) {
my $val = $node->look_down('name', qr/\$txt_TeamNumber/)->attr('value');
print "'value': $val\n";
}
This prints the line: 'value': 186. Note that we never have to parse anything at all.
I assume that the 'name' attribute is identified by literal $txt_TeamNumber, thus $ is escaped.
The code uses the excellent Path::Tiny to slurp the file. If there are issues with installing a module just read the file by hand into a string (if it does come from a file and not from internet).
See docs and abundant other examples for the full utility of the HTML parsing modules used above. There are of course other ways and approaches, made ready for use by yet other good modules. Please search for the right tool.
I strongly suggest to stay clear of any idea to parse HTML (or anything similar) with regex.
Watch for variable scoping. You should be able to get it with a simple regexp capture:
if(INFILE =~ /$txt_TeamNumber/) {
$teamNumber = /$txt_TeamNumber/
($value) = /$txt_TeamNumber.*?value="(.*?)"/
}

Pandoc: Include raw `tex` in `markdown` to `html` conversion

Here is a simple markdown code:
$$ \alpha = \beta\label{eqone}$$
Refer equation (\ref{eqone}).
When I convert this to html using
pandoc --mathjax file.markdown -o file.html
the \ref{eqone} is omitted in the html output since it is raw tex. Is there a work around to include the raw tex in html output?
I understand that I could have used:
(#eqone) $$\alpha=\beta$$
Refer equation ((#eqone)).
for equation numbering and referencing. This produces the number on the left side and also does not distinguish between figures, tables and equations.
However, mathjax numbering appears on the right like the standard tex output.
Any other work around for proper equation numbering is also welcome.
Note: Following code needs to be added to the head of the generated html file to configure autonumbering in mathjax.
<script type="text/x-mathjax-config">
MathJax.Hub.Config({ TeX: { equationNumbers: {autoNumber: "all"} } });
</script>
You could try to put the \ref into a math environment, $\ref{eqone}$. And if the command is not defined in math, switch back to text, $\text{\ref{eqone}}$. Ugly, but it might work.

perl: strip html tags, manipulate text, and then return html tags to their original positions

I'm using the Html::Strip module to remove all html tags from a file. I want to then manipulate the resulting text (stripped of html) and finally return the html tags to their original positions.
The text manipulation I'm doing requires breaking the text into arrays using split(/ /, $text). I then do some natural language processing of the resulting arrays (including adding new html tags to some key words). Once I'm finished processing the text, I'd like to return the original tags to their places while keeping the text manipulations I've done in the meantime intact.
I would be satisfied if I could simply remove all whitespace from within the original tags (since whitespace within tags is ignored by browsers). That way my NLProcessing could simply ignore words that are tags (contain '<' or '>').
I've tried diving into the guts of Html::Strip (in an effort to modify it to my needs), but I can't understand what the following piece of code does:
my $stripped = $self->strip_html( $text );
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Seems like strip_html is a sub, but I can't find that sub anywhere.
Anyway thanks for any and all advice.
... the next day...
After a bit of back and forth with #amon, I have come up with a solution that I believe is sufficient for my purposes. amon pushed me in the right direction even though he recommended I not do what I've done anyway, haha.
It is a brutish method, but gets the job done satisfactorily. Gonna leave it here in the off chance that someone else has the same wishes as me and doesn't mind a quick and dirty solution:
my $input = text.html;
my $stripped = $hs->parse($input);
$hs->eof;
so now I have two string variables. One is the html file I want to manipulate, and the other is the same file stripped of html.
my #marks = split(/\s/, $stripped);
#marks = uniq(#marks);
Now I have a list of all non-HTMLtag-associated words that appear in my file.
$input = HTML::Entities::decode($input);
$input =~ s/\</ \</g;
$input =~ s/\>/\> /g;
$input =~ s/\n/ \n /g;
$input =~ s/\r/ \r /g;
$input =~ s/\t/ \t /g;
Now I've decoded my HTML containing var and have ensured that no word is up against a "<", or ">" or non-space whitespace character.
foreach my $mark(#marks) { $input =~ s/ \Q$mark\E / TAQ\+$mark\TAQ /g; }
$input =~ s/TAQ\+TAQ//g;
Now I've "tagged" each word with a "+" and have separated words from non-words with the TAQ delimiter. I can now split on TAQ and ignore any item that does not contain a "+" when performing my NLP and text manipulation. Once I'm done, I rejoin and strip all of the "+". Follow that with some clever encoding, remove all the extra spaces I inserted, and BAM! I've now got my NLProcessing completed, have manipulated the text, and still have all of my HTML in the right places.
There are a lot of caveats here, and I'm not going to go into all of them. Most problematic is the need to decode and then encode, coupled with the fact that HTML::Strip doesn't always strip all the javascript or invalid HTML. There are ways to work around that, but again I don't have room or time to discuss that here.
Thanks amon for your help, and I welcome any criticism or suggestions. I'm new to this.
The module HTML::Strip uses the XS glue language to connect the Perl code with C code. You can find the XS file e.g. on (meta-)cpan. It includes a file strip_html.c that implements the actual algorithm. Due to the definitions in the XS file, a strip_html sub is available in the Perl code as part of the HTML::Strip package. Therefore, it can be invoked as a method on an appropriate object.
Explanation of that piece of code
my $stripped = $self->strip_html( $text );
This will invoke the C function on the contents of $text to strip all the HTML tags. The stripped data will then be assigned to $stripped.
if( $self->decode_entities && $_html_entities_p ) {
$stripped = HTML::Entities::decode($stripped);
}
Suffixing variable names with -p is a lispish tradition to indicate boolean variables (or predicates, in mathematics). Here, it indicates if HTML::Entities could be loaded: my $_html_entities_p = eval 'require HTML::Entities';. If the configuration option decode_entities was set to a true value, and HTML::Entities could be loaded, then entities will be decoded in the stripped data.
Example: given the input
<code> $x < $y </code>
then stripping would produce
$x < $y
and decoding the entities would make it
$x < $y

vim: how to apply a *new* syntax highlighting (smarty) instead of html depending on file content?

I've installed (and it seems to work well) this "Smarty syntax highlighting" file here.
The problem is that it's linked to *.tpl files. I've got the HTML syntax highlighting as well.
Here's what I'd like to do: when opening HTML files, just check if there are some special Smarty characters like { (alphanum) $xx (alphanum) } or {* *}. If so, use "Smarty syntax highlighting" otherwise use "HTML syntax highlighting".
Any idea how I could do this?
Don't hesitate to change my subject to make it more generic, and my question as well.
Thank you very much!
Placing this in you vimfiles as ftdetect/smarty.vim should work:
autocmd BufNewFile,BufRead *.html call s:CheckForSmarty()
function! s:CheckForSmarty()
for n in range(1, line('$'))
let line = getline(n)
if line =~ '{.*$\k\+}' || line =~ '{\*.*\*}'
set filetype=smarty
return
endif
endfor
endfunction
Basically, every time you open an html file, the (script-local) function s:CheckForSmarty will be called. It will go through each line and test it against the two regular expressions you see. If one of them matches, the filetype is set to smarty and the function ends its execution. Otherwise, we let vim take care of the rest. You can tweak the regexes if they don't work well enough for you, I'm not really a smarty user, so I can't be sure if they cover all use cases.
This may be slow on large html files, I've only tested it on small ones. If it turns out to be a problem, you can limit the script to only check the first 10 lines (this is how the htmldjango filetype is detected):
function! s:CheckForSmarty()
for n in range(1, line('$'))
if n > 10
return
endif
let line = getline(n)
if line =~ '{.*$\k\+}' || line =~ '{\*.*\*}'
set filetype=smarty
return
endif
endfor
endfunction
Another way to manually fix a speed problem is by placing a comment at the top of the file, like {* smarty *}. Vim will see the comment on the very first line, so there will be no reason to iterate through the rest of the file.
Since I am not too familiar with the smarty syntax you may want to adjust the regular expression of my example below.
function! s:CheckSmarty()
for i in range(1, min([10, line('$')]))
let line = getline(i)
if line =~ '{\*.\{-}\*}'
setl filetype=smarty
return
endif
endfor
endfunction
au BufNewFile,BufRead *.html,*.htm call s:CheckSmarty()
You can easily modify the number of lines to check for the smarty tags in every html file. It's important to use setlocal here in order to just modify the current buffer.

How can I strip HTML in a string using Perl?

Is there anyway easier than this to strip HTML from a string using Perl?
$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;
I would appreicate both a slimmed down regular expression, e.g. something like this:
$Error_Msg =~ s|</?[b|h1|br]>||ig;
Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?
Assuming the code is valid HTML (no stray < or > operators)
$htmlCode =~ s|<.+?>||g;
If you need to remove only bolds, h1's and br's
$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g
And you might want to consider the HTML::Strip module
You should definitely have a look at the HTML::Restrict which allows you to strip away or restrict the HTML tags allowed. A minimal example that strips away all HTML tags:
use HTML::Restrict;
my $hr = HTML::Restrict->new();
my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'
I would recommend to stay away from HTML::Strip because it breaks utf8 encoding.
From perlfaq9: How do I remove HTML from a string?
The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.
Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.
Here's one "simple-minded" approach, that works for most files:
#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .
Here are some tricky cases that you should think about when picking a solution:
<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > B">
<!-- <A comment> -->
<script>if (a<b && a>c)</script>
<# Just data #>
<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other tags, those solutions would also break on text like this:
<!-- This section commented out.
<B>You can't see me!</B>
-->