Simple macros for HTML - html

My html file contains in many places the code
It is too short and it doesn't really make sense to replace it with a code like
<span class="three-spaces"></span>
I would like to replace it with something like
##TS##
or
%%TS%%
and the file should start with something like:
SET TS = " "
Is there any way to write the HTML this way? I am not looking for compiling a source file into a HTML. I am looking for a solution that allows directly writing macros into HTML files.
Later edit: I'm coming with another example:
I also need to transform
lnk(http://www.example.com)
into
<a target="_blank" href="http://www.example.com">http://www.example.com</a>

Instead of telling him WHY he should not do something, how about telling him HOW he could do it? Maybe his example is not an appropriate need for it, but there's other situations where being able to create a macro would be nice.
For example... I have an HTML page that I'm working on that deals with unit conversions and quite often, I'm having to type things like "cm/in" as "cm/in" or for volumes "cu-cm/cu-in" as "cm3/in3". It would be really nice from a typing and readability standpoint if I could create macros that were just typed as "%%cm-per-in%%, %%cc-per-cu-in%% or something like that.
So, the line in the 'sed' file might look like this:
s/%%cc-per-cu-in%%/<sup>cm<sup>3<\/sup><\/sup>\/<sub>in<sup>3<\/sup><\/sub>/g
Since the "/" is a field separator for the substitute command, you need to explicitly quote it with the backslash character ("\") within the replacement portion of the substitute command.
The way that I have handled things like this in the past was to either write my own preprocessor to make the changes or if the "sed" utility was available, I would use it. So for this sort of thing, I would basically have a "pre-HTML" file that I edited and after running it through "sed" or the preprocessor, it would generate an HTML file that I could copy to the web server.
Now, you could create a javascript function that would do the text substitution for you, but in my opinion, it is not as nice looking as an actual preprocessor macro substitution. For example, to do what I was doing in the sed script, I would need to create a function that would take as a parameter the short form "nickname" for the longer HTML that would be generated. For example:
function S( x )
{
if (x == "cc-per-cu-in") {
document.write("<sup>cm<sup>3</sup></sup>/<sub>in<sup>3</sup></sub>");
} else if (x == "cm-per-in") {
document.write("<sup>cm</sup>/<sub>in</sub>");
} else {
document.write("<B>***MACRO-ERROR***</B>");
}
}
And then use it like this:
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
This is a test of an error <SCRIPT>S("cc-per-in");</SCRIPT> for a
missing macro substitution.
This generates the following:
This is a test of cc-per-cu-in cm3/in3
and cm-per-in cm/in as an alternative to sed. This is a test of an error MACRO-ERROR for a missing macro substitution.
Yeah, it works, but it is not as readable as if you used a 'sed' substitution.
So, decide for yourself... Which is more readable...
This...
This is a test of cc-per-cu-in <SCRIPT>S("cc-per-cu-in");</SCRIPT> and
cm-per-in <SCRIPT>S("cm-per-in");</SCRIPT> as an alternative to sed.
Or this...
This is a test of cc-per-cu-in %%cc-per-cu-in%% and
cm-per-in %%cm-per-in% as an alternative to sed.
Personally, I think the second example is more readable and worth the extra trouble to have pre-HTML files that get run through sed to generate the actual HTML files... But, as the saying goes, "Your mileage may vary"...
EDITED: One more thing that I forgot about in the initial post that I find useful when using a pre-processor for the HTML files -- Timestamping the file... Often I'll have a small timestamp placed on a page that says the last time it was modified. Instead of manually editing the timestamp each time, I can have a macro (such as "%%DATE%%", "%%TIME%%", "%%DATETIME%%") that gets converted to my preferred date/time format and put in the file.
Since my background is in 'C' and UNIX, if I can't find a way to do something in HTML, I'll often just use one of the command line tools under UNIX or write a small 'C' program to do it. My HTML editing is always in 'vi' (or 'vim' on the PC) and I find that I am often creating tables for alignment of various portions of the HTML page. I got tired of typing all the TABLE, TR, and TD tags, so I created a simple 'C' program called 'table' that I can execute via the '!}' command in 'vi', similar to how you execute the 'fmt' command in 'vi'. It takes as parameters the number of rows & columns to create, whether the column cells are to be split across two lines, how many spaces to indent the tags, and the column widths and generates an appropriately indented TABLE tag structure. Just a simple utility, but saves on the typing.
Instead of typing this:
<TABLE>
<TR>
<TD width=200>
</TD>
<TD width=300>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
<TR>
<TD>
</TD>
<TD>
</TD>
</TR>
</TABLE>
I can type this:
!}table -r 3 -c 2 -split -w 200 300
Now, with respect to the portion of the original question about being able to create a macro to do HTML links, that is also possible using 'sed' as a pre-processor for the HTML files. Let's say that you wanted to change:
%%lnk(www.stackoverflow.com)
to:
www.stackoverflow.com
you could create this line in the sed script file:
s/%%lnk(\(.*\))/<a href="\1">\1<\/a>/g
'sed' uses regular expressions and they are not what you might call 'pretty', but they are powerful if you know what you are doing.
One slight problem with this example is that it requires the macro to be on a single line (i.e. you cannot split the macro across lines) and if you call the macro multiple times in a single line, you get a result that you might not be expecting. Instead of doing the macro substitution multiple times, it assumes the argument to the macro starts with the first '(' of the first macro invocation and ends with the last ')' of the last macro invocation. I'm not a sed regular expression expert, so I haven't figured out how to fix this yet. For the multiple line portion though, a possible fix would be to replace all the LF characters in the file with some other special character that would not normally be used, run sed on that result, and then convert the special characters back to LF characters. Of course, the problem there is that the entire file would be a single line and if you are invoking the macro, it is going to have the results that I described above. I suspect awk would not have that problem, but I have never had a need to learn awk.
Upon further reflection, I think there might be an easier solution to both the multi-line and multiple invocation of a macro on a single line -- the 'm4' macro preprocessor that comes with the 'C' compiler (e.g. gcc). I haven't tested it much to see what the downside might be, but it seems to work well enough for the tests that I have performed. You would define a macro as such in your pre-HTML file:
define(`LNK', `$1')
And yeah, it does use the backwards single quote character to start the text string and the normal single quote character to end the text string.
The only problem that I've found so far is that is that for the macro names, it only allows the characters 'A'-'Z', 'a'-'z', '0'-'9', and '' (underscore). Since I prefer to type '-' instead of '', that is a definite disadvantage to me.

Technically inline JavaScript with a <script> tag could do what you are asking. You could even look into the many templating solutions available via JavaScript libraries.
That would not actually provide any benefit, though. JavaScript changes what is ultimately displayed, not the file itself. Since your use case does not change the display it wouldn't actually be useful.
It would be more efficient to consider why is appearing in the first place and fix that.

This …
My html file contains in many places the code
… is actually what is wrong in your file!
is not meant to use for layout purpose, you should fix that and use CSS instead to layout it correctly.
is meant to stop breaking words at the end of a line that are seperated by a space. For example numbers and their unit: 5 liters can end up with 5 at the end of the line and liters in the next line (Example).
To keep that together you would use 5 liters. That's what you use for and nothing else, especially not for layout purpose.
To still answer your question:
HTML is a markup language not a programming language. That means it is descriptive/static and not functional/dynamic. If you try to generate HTML dynamically you would need to use something like PHP or JavaScript.

Just an observation from a novice. If everyone did as purists suggest (i.e.-the right way), then the web would still be using the same coding conventions it was using 30 years ago. People do things, innovate, and create new ways, then new standards, and deprecate others all the time. Just because someone says "spaces are only for separating words...and nothing else" is silly. For many, many years, when people typed letters, they used one space between words, and two spaces between end punctuation and the next sentence. That changed...yeah, things change. There is absolutely nothing wrong with using spaces and non-breaking spaces in ways which assist layout. It is neither useful nor elegant for someone to use a long span with style over and over and over, rather than simple spaces. You can think it is, and your club of do it right folks might even agree. But...although "right", they are also being rather silly about it. Question: Will a page with 3 non-breaking spaces validate? Interesting.

Related

How to create "variables" for later search and replace (sort of like a merge-field in HTML

I use a lot (20 or so) of what FrontPage 2003 (don't laugh!) calls "Site Parameters", which are essentially variables. I use them for the Number of Products, Phone Hours, etc.
I'm upgrading to Expression Web, which does not support changing or adding those Parameters. Also, I'd like to create Variables for our Breadcrumbs trail. (So a product page might have breadcrumbs: Products> This Product> Screenshot.
So if we ever decide to rename Products I want to easily be able to replace it.
What I do not want to do:
Replace the value when the page is served up. That slows things down a bit and forces all pages to be .aspx, etc. I want to stick with plain html.
Replace the value using Javascript (same reason, and a tiny % of brrowsers don't have .js enabled).
I was thinking:
we have <variable.products_count>20</variable.products_count>
But..... it's easy to get another tag and text in there, as happens in this example:
<variable.products_count> we have <strong>20</strong> </variable.products_count>
Now when I replace, I'm replacing the tags and "we have" as well.
What I do not want to do: Replace the value when the page is served up. That slows things down a bit and forces all pages to be .aspx, etc. I want to stick with plain html.
Technically you are correct; there will probably be a performance hit by using an executed versus static page. But the overhead of inserting a few dynamic values is so trivial that it shouldn't even factor into the decision making process.
At it's simplest:
HTML
<body>
<div>Good old HTML</div>
<div>A dynamic value: <%= SiteParameters.Foo %></div>
</body>
c# (or VB.Net, or whatever you prefer)
public static class SiteParameters
{
// This value could be pulled from a config file, a database, etc.
public static readonly string Foo = "Hello World";
}
Your example with the we have problem can be easily re-written to fix the problem:
<variable.products_count> we have <strong>20</strong> </variable.products_count>
to:
we have <strong><variable.products_count></variable.products_count></strong>
Of course, I think the easier approach is to step slightly outside of HTML syntax and pick variable names that are unlikely to occur in your documentation:
we have <strong>##PRODUCTS_COUNT##</strong> ...
When you have several of these you can replace them all with one pass of a simple text-replacement tool such as sed(1):
sed -e 's/##PRODUCTS_COUNT##/20/g'
-e 's/##PRODUCTS##/Products/g'
-e 's/##SCREENSHOTS##/Screen Shots/g' < inputfile > output.html
To automate over many files, you'd probably want to write a little script:
s/##PRODUCTS_COUNT##/20/g;
s/##PRODUCTS##/Products/g;
s/##SCREENSHOTS##/Screen Shots/g;
and run it on all your files:
for f in pre/*.html ; do sed -f /path/to/script < "$f" > "${f/pre/post}" ; done
(The ${f/pre/post} replaces the pre with post; your post-processed files would be in the post/ directory.)

Regular expressions - finding and comparing the first instance of a word

I am currently trying to write a regular expression to pull links out of a page I have. The problem is the links need to be pulled out only if the links have 'stock' for example. This is an outline of what I have code wise:
<td class="prd-details">
<a href="somepage">
...
<span class="collect unavailable">
...
</td>
<td class="prd-details">
<a href="somepage">
...
<span class="collect available">
...
</td>
What I would like to do is pull out the links only if 'collect available' is in the tag. I have tried to do this with the regular expression:
(?s)prd-details[^=]+="([^"]+)" .+?collect{1}[^\s]+ available
However on running it, it will find the first 'prd-details' class and keep going until it finds 'collect available', thereby taking the incorrect results. I thought by specifying the {1} after the word collect it would only use the first instance of the word it finds, but apparently I'm wrong. I've been trying to use different things such as positive and negative lookaheads but I cant seem to get anything to work.
Might anyone be able to help me with this issue?
Thanks,
Dan
You need an expression that knows "collect unavailable" is junk. You should be able to use a negative lookahead with your wildcard after the link capture. Something like:
prd-details[^=]+="([^"]+)"(.(?!collect un))+?collect available
This will collect any character after the link that isn't followed by "collect un". This should eliminate capturing the "collect unavailable" chunk along with "collect available".
I tested in C# treating the text as a single line. You may need a slightly different syntax and options depending on your language and regex library.
If you insist on doing this with regex, I recommend a 2-step split-then-check approach:
First, split into each prd-details.
Then, within each prd-details, see if it contains collect available
If yes, then pull out the href
This is easier than trying to do everything in one step. Easier to read, write, and maintain.

How to remove all empty tags in X/HTML code in once?

for example :
I want to remove all highlighted tags
alt text http://shup.com/Shup/299976/110220132930-My-Desktop.png
You could use a regular expression in any editor that supports them. For instance, I tested this one in Dreamweaver:
<(?!\!|input|br|img|meta|hr)[^/>]*?>[\s]*?</[^>]*?>
Just make a search and replace all (with the regex as search string and nothing as replacement). Note however that this may remove necessary whitespace. If you just want to remove empty tags without anything in between,
<(?!\!|input|br|img|meta|hr)[^/>]*?></[^>]*?>
would be the way to go.
Update: You want to remove &nbsps as well:
<(?!\!|input|br|img|meta|hr)[^/>]*?>(?:[\s]| )*?</[^>]*?>
I did not verify this one - it should be OK though, try it out :-)
If this is only about quickly editing a file, and your editor supports regular expression replacement, you can use a regex like this:
<[^>]+></[^>]+>
Search for this regex, and replace with an empty string.
Note: This isn't safe in any way - don't rely on it, as it can find more things than just valid, empty tags. (It would also find <a></b> for example.) There is no safe way to do this with regexes - but if you check each replacement manually, you should be fine. If you need real safe replacement, then either you'll have to find an editor that supports this (JEdit may be a good bet, but I haven't checked), or you'll have to parse the file yourself - e.g. using XSLT.
What you're asking for sounds like a job for regular expressions. Many editors support regular expression find/replace. Personally, I'd probably do this from the command-line with Perl (sed would also work), but that's just me.
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html
or if you're brave, edit the file in place:
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' -i file.html
This will remove:
<p></p>
<p id="foo"></p>
but not:
<p>hello world</p>
<p></a>
Warning: things like <img src="pic.png"></img> and <br></br> will also be removed. It's not obvious from your question, but I'll assume this is undesirable. Maybe you're not worried because you know all your images are declared like this <img src="pic.png"/>. Otherwise the regular expression will need to be modified to account for this, but I decided to start simple for an easier explanation...
It works by matching the opening tag: a literal < followed by the tag name (one or more characters which are not whitespace or > = [^\s>]+), any attributes (zero or more characters which aren't > = [^>]*), and then a literal >; and a closing tag with the same name: this takes advantage of the fact that we captured the tag name, so we can use a backreference = </\1>. The matches are then replaced with the empty string.
If the syntax/terminology used here is unfamiliar to you, I'm a fan of the perlre documentation page. Regular expression syntax in other languages should be very similar if not identical to this, so hopefully this will be useful even if you don't Perl :)
Oh, one more thing. If you have things like <div><p></p></div>, these will not be picked up all at once. You'll have to do multiple passes: the first will remove the <p></p> leaving a <div></div>to be removed by the second. In Perl, the substitution operator returns the number of replacements made, so you can:
perl -pe '1 while s|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html

Extract all the links between specified html tags from an html file with sed

Well I must find a way to extract all the links between <div id="links"> and </table> tags.
And if there is more than one link, it should add '\n' character between the urls: "$URL1\n$URL2".
<div id="links">
<table>
<td>url</td>
<td>url</td>
</table>
<table>
..
</table>
</div>
The ones between <div> tag and the first </table> tag.
Are there any other ways besides sed then?
Thank you.
As is posted every day on SO: You can't process HTML with regular expressions. Can you provide some examples of why it is hard to parse XML and HTML with a regex?
That goes double for a tool as limited as sed, with its Basic Regular Expressions.
If the kind of input you have is very limited such that every link is in the exact same format, it might be possible, in which case you'd have to post an example of that format. But for general HTML pages, it can't be done.
ETA given your example: at the simplest level, since each URL is already on its own line, you could select the ones that look right and throw away the bits you don't want:
#!/bin/sed -f
s/^<td><a href="\(.*\)">.*<\/a><\/td>$/\1/p
d
However note that this would still leave URLs in their HTML-encoded form. If the script that produced this file is correctly HTML-encoding its URLs, you would then have to replace any instances of the lt/gt/quot/amp entity references back to their plain character form ‘<>"&’. In practice the only one of those you're likely to meet is &/amp, which is very common indeed in URLs.
But! That's not all the HTML-encoding that might have occurred. Maybe there are other HTML entity references in there, like eacute (which would be valid now we have IRIs), or numerical character references (in both decimal and hex). There are two million-odd potential forms of encoding for characters including Unicode... replacing each one individually in sed would be a massive exercise in tedium.
Whilst you could possibly get away with it if you know that the generator script will never output any of those, an HTML parser is still best really. (Or, if you know it's well-formed XHTML, you can use a simpler XML parser, which tends to be built in to modern languages' standard libraries.)
if you have access to python i would recommend BeautifulSoup. A nice python library for manipulating HTML. The following code collects links from a given ressource, which is a full name to a webpage like http://www.foo.com, and stores them in file. Hope this helps.
import sys, os
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
fileLinksName = "links.dat"
if __name__ == "__main__":
try:
# get all links so far
fileLinks = open(fileLinksName)
links = fileLinks.read().split('\n')
fileLinks.close()
htmlFileSoup = BeautifulSoup(urlopen(sys.argv[1]).read())
anchorList = htmlFileSoup.findAll('a')
for htmlAnchor in anchorList:
print htmlAnchor
if 'href' in htmlAnchor:
links.append(htmlAnchor)
for link in links:
print link
except:
print sys.exc_info()
exit()
This might be possible if instead of trying to look at the tags you just look for the URLs.
If these are the only URLs in the page you can write a pattern to look for URLs between quotes, something like:
"[a-z]+://[^"]+"
Do you have access to AWK? A combination of AWK and sed might do what you want, provided that:
The html is relatively simple
The html doesn't change suddenly (I mean in form, not in content)
The html is not excessively convoluted.
It's false that you can't process HTML with regular expressions. It's true that in the general case, you can't process HTML (or XML) with regexes, because they allow arbitrary nesting and regexes don't do recursion well -or at all-. But if your HTML is relatively 'flat' you can certainly do much with regexes.
I can't tell you exactly what to do, because I've forgotten what little AWK and sed I learned in college, but this strikes me as something doable:
Find the string <div id="links">
Now find the string <table>
Now find the string <td>...</td> and get a link from it (this is the regex part).
Append it to var $links
Until you find the string </table>
Finally, print $links separating each link with \n.
Again, this is just pseudocode for the simple case. But it might just work.
I mention AWK because, even if you don't have access to Perl, sed and AWK tend to be both installed.
Finally, for a pure sed solution, you could also take a look at this sed recipe and adapt it to your needs.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*