How can I use simple HTML output in a perl script? - html

I am working on an assignment where I must use an HTML font tag to change the font color of an perl output statement. I've followed the direction in the assignment, and this is what the code looks like:
print $FILE2 "<font color = 'red'> var1 </font>";
$FILE2 is the filehandle for the file I am writing to. Var1, Var1 is a test variables.
However, there is no color change and the syntax above is printed exactly like that to the screen. I'm not really sure what I am doing, wrong, can somebody please help me?
I have not included any headers other than strict and warnings.

Variables in Perl are always indicated by sigils. In this case you are dealing with scalars (single value container), and the sigil for a scalar variable is $. This sigil is also what allows it to be interpolated (replaced with its value) in a double-quoted string like you have. See perldata for more info.
As a side note, interpolating variables directly into HTML can be dangerous because they can contain HTML tags. If you want to avoid the possibility of the input adding tags of its own, pass the input through an HTML-escaper like encode_entities from HTML::Entities or xml_escape from Mojo::Util first. The usual way to construct HTML like this is with templates, and templating modules often provide such functionality built in.

Related

cgi/perl/html - what characters to escape when printing into html?

I got an input file that I need to print directly into an html page.
I did $inputfile =~ s/\n/<br>/g; Are there any other special characters I should be aware of maybe other than < and > when printing this $inputfile to html?
You absolutely should use HTML::Escape instead of doing some ill-conceived hackjob which will cause everyone who deals with your code (you included) to curse your name in the future.
It's simple - install HTML::Escape via CPAN, then use it thus:
use HTML::Escape qw(escape_html);
my $escaped_string = escape_html($string);
Note that if you want to preserve whitespace formatting you should use a module to do that, as well, such as HTML::FromText - the above code will not automagically convert line breaks to tags because that's different completely from escaping unsafe characters to HTML entities.

Is it possible to write full css line with LESSCSS using variables?

is there a way that you can write a full CSS line using a LESS variable?
For example, if I declare a variable:
#dir: test;
and then want that variable to be used within another variable:
#bg-img: url(http://example.com/#dir/img/bg.png;
When I try to use #bg-img in my CSS, it compiles like this:
body{background-image:url(http://example.com/#dir/img/bg.png);
How do I get #dir to echo out as test?
I know that I can just replace #dir but since I have various different microsites, it'll be quicker to change the one instance of #dir rather than go through and change every instance of the microsite name.
This is possible by interpolation, but not certain if this would work in your particular situation without seeing more code. However, give this a try:
body{background-image:url(http://example.com/#{dir}/img/bg.png);
Found it in the docs, down the page under String interpolation. Hope this helps.
EDIT: and escape, just as #Rob W mentions in the comments. Also in the doc, just below interpolation.

How to parse the attributes value inside {{}} (curly braces) inside a infobox

Within Infobox at wikipedia some attributes values are also inside curly braces {{}}.. Some time they have lins also.. I need values inside the braces, which is displayed on wikipedia web page.
I read these are templates also.. Can anyone give me some link or guide me how do I deal with it?
Double-curly-braces {{}} define a call to some kind of magic word, variable, parser function, or template.. Help can be found on MediaWiki.org/.../Manual:Magic_words. The little lines that look like | are called pipes and are used to as separators that allow the wikicore parsing engine to define parameters that can be used with the magic word, variable, parser function, or template..
Hopefully this will help everyone who come across this very same issue.
Considering you will be parsing the infobox with PHP, you can use this:
http://www.mywiki.com/wiki/api.php?format=xml&action=query&titles=PAGE_TITLE_THAT_CONTAINS_AN_INFOBOX&prop=revisions&rvprop=content&rvgeneratexml=1
'rvgeneratexml' is being set to true (1), this will make the xml node <rev> generate an attribute "parsetree" containing the infobox information in XML format.
Then, in PHP, you can load the whole information (<api>everything including <rev></api>) with simpleXML:
$xml = simplexml_load_file($url);
Then you can load the template's information by getting the "parsetree" attribute and loading the string with:
$template = simplexml_load_string($xml->query->pages->page->revisions->rev->attributes()->parsetree);
$template = $template->template; // If more than 1 template, check template[0], [1], etc
Then, by using the correct structure, you can access the elements with something like:
if ($template->part[0]->name='name')
$film = $template->part[0]->value;
Then, $film will contain the film's name (->name is the parameter's name, and ->value is its value).

How can I extract the HREF value from an HTML link?

My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.

How do I match text in HTML that's not inside tags?

Given a string like this:
This is the foo link
... and a search string like "foo", I would like to highlight all occurrences of "foo" in the text of the HTML -- but not inside a tag. In other words, I want to get this:
This is the <b>foo</b> link
However, a simple search-and-replace won't work, because it will match part of the URL in the <a> tag's href.
So, to express the above in the form of a question: How do I restrict a regex so that it only matches text outside of HTML tags?
Note: I promise that the HTML in question will never be anything pathological like:
<img title="Haha! Here are some angle brackets to screw you up: ><" />
Edit: Yes, of course I'm aware that there are complex libraries in CPAN that can parse even the most heinous HTML, and thus alleviate the need for such a regex. On many occasions, that's what I would use. However, this is not one of those occasions, since keeping this script short and simple, without external dependencies, is important. I just want a one-line regex.
Edit 2: Again, I know that Template::Refine::Fragment can parse all my HTML for me. If I were writing an application I would certainly use a solution like that. But this isn't an application. It's barely more than a shell script. It's a piece of disposable code. Being a single, self-contained file that can be passed around is of great value in this case. "Hey, run this program" is a much simpler instruction than, "Hey, install a Perl module and then run this-- wait, what, you've never used CPAN before? Okay, run perl -MCPAN -e shell (preferably as root) and then it's going to ask you a bunch of questions, but you don't really need to answer them. No, don't be afraid, this isn't going to break anything. Look, you don't need to answer every question carefully -- just hit enter over and over. No, I promise, it's not going to break anything."
Now multiply the above across a great deal of users who are wondering why the simple script they've been using isn't so simple anymore, when all that's changed is to make the search term boldface.
So while Template::Refine::Fragment may be the answer to someone else's HTML parsing question, it's not the answer to this question. I just want a regular expression that works on the very limited subset of HTML that the script will actually be asked to parse.
If you can absolutely guarantee that there are no angle brackets in the HTML other than those used to open and close tags, this should work:
s%(>|\G)([^<]*?)($key)%$1$2<b>$3</b>%g
In general, you want to parse the HTML into a DOM, and then traverse the text nodes. I would use Template::Refine for this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature ':5.10';
use Template::Refine::Fragment;
my $frag = Template::Refine::Fragment->new_from_string('<p>Hello, world. This is a test of foo finding. Here is another foo.');
say $frag->process(
simple_replace {
my $n = shift;
my $text = $n->textContent;
$text =~ s/foo/<foo>/g;
return XML::LibXML::Text->new($text);
} '//text()',
)->render;
This outputs:
<p>Hello, world. This is a test of <foo> finding. Here is another <foo>.</p>
Anyway, don't parse structured data with regular expressions. HTML is not "regular", it's "context-free".
Edit: finally, if you are generating the HTML inside your program, and you have to do transformations like this on strings, "UR DOIN IT WRONG". You should build a DOM, and only serialize it when everything has been transformed. (You can still use TR, however, via the new_from_dom constructor.)
The following regex will match all text between tags or outside of tags:
<.*?>(.*?)<.*?>|>(.*?)<
Then you can operate on that as desired.
Try this one
(?=>)?(\w[^>]+?)(?=<)
it matches all words between tags
To strip off the variable size contents from even nested tags you can use this regex that is in fact a mini-regular grammar for that. (note: PCRE machine)
(?<=>)((?:\w+)(?:\s*))(?1)*