here I have an example. how can I display this text on my html page as is?
I tried using   for spaces and br for lines, br seemed to work but   did not.
-`
.o+`
`ooo/
`+oooo:
`+oooooo:
-+oooooo+:
`/:-:++oooo+:
`/++++/+++++++:
`/++++++++++++++:
`/+++ooooooooooooo/`
./ooosssso++osssssso+`
.oossssso-````/ossssss+`
-osssssso. :ssssssso.
:osssssss/ osssso+++.
/ossssssss/ +ssssooo/-
`/ossssso+/:- -:/+osssso+-
`+sso+:-` `.-/+oso:
`++:. `-/+/
.` `
Here is Code snippet to check how it's rendered
<p>
-`
.o+`
`ooo/
`+oooo:
`+oooooo:
-+oooooo+:
`/:-:++oooo+:
`/++++/+++++++:
`/++++++++++++++:
`/+++ooooooooooooo/`
./ooosssso++osssssso+`
.oossssso-````/ossssss+`
-osssssso. :ssssssso.
:osssssss/ osssso+++.
/ossssssss/ +ssssooo/-
`/ossssso+/:- -:/+osssso+-
`+sso+:-` `.-/+oso:
`++:. `-/+/
.` `
</p>
Use the preformatted text element: <pre>—it renders text and whitespace exactly as written, using a monospaced font.
<pre>
-`
.o+`
`ooo/
`+oooo:
`+oooooo:
-+oooooo+:
`/:-:++oooo+:
`/++++/+++++++:
`/++++++++++++++:
`/+++ooooooooooooo/`
./ooosssso++osssssso+`
.oossssso-````/ossssss+`
-osssssso. :ssssssso.
:osssssss/ osssso+++.
/ossssssss/ +ssssooo/-
`/ossssso+/:- -:/+osssso+-
`+sso+:-` `.-/+oso:
`++:. `-/+/
.` `
</pre>
To make it more accessible for people using screen readers and other assistive technology, you could add an image ARIA role to the <pre> element and provide alternative text, like so:
<pre role="img" aria-label="ASCII art of an upward-sweeping, triangular arrow.">
I have a file formatted like this one:
Eye color
<p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css">
</>
Each word within the <p class="ul1"> separated by ,should be wrapped in an <a> tag, like this:
Eye color
<p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css">
</>
There could be one or several words within the <p class="ul1"> tag.
Is this possible in Perl one-liner?
Thanks in advance. Any help is appreciated.
Parse the file using a module and iterate over the elements you need (<p> of class ul1). Extract those comma-separated phrases from each and wrap links around them; then replace the element with that new content. Write the changed tree out in the end.
Using HTML::TreeBuilder (with its workhorse HTML::Element)
use warnings;
use strict;
use feature 'say';
use HTML::Entities;
use HTML::TreeBuilder;
my $file = shift // die "Usage: $0 file\n";
my $tree = HTML::TreeBuilder->new_from_file($file);
foreach my $elem ($tree->look_down(_tag => "p", class => "ul1")) {
my #new_content;
for ($elem->content_list) {
my #w = split /\s*,\s*/;
my $wrapped = join ", ",
map { qq().$_.q() } #w;
push #new_content, $wrapped;
}
$elem->delete_content;
$elem->push_content( #new_content );
};
say decode_entities $tree->as_HTML;
In your case an element ($elem) will have one item in the content_list so you don't have to collect modified content into an array (#new_content) but can process that one piece only, what simplifies the code. Working with a list as above doesn't hurt of course.
I redirect the output of this program to an .html file. The generated file is qouite frugal on newlines. If pretty HTML matters make a pass with a tool like HTML::Tidy or HTML::PrettyPrinter.
In a one-liner? Nah, it's too much. And please don't use regex as there's trouble down the road; it needs close work to get it right, is easy to end up buggy, is sensitive to smallest details, and brittle for even slightest changes in input. And that's when it can do the job. There are reasons for libraries.
Another good tool for this job is Mojo::DOM. For example
use Mojo::DOM;
use Path::Tiny; # only to read the file into a string easily
my $html = path($file)->slurp;
my $dom = Mojo::DOM->new($html);
foreach my $elem ($dom->find('p.ul1')->each) {
my #w = split /,/, $elem->text;
my $new = join ', ',
map { qq().$_.q() } #w;
$elem->replace( $new );
}
say $dom;
Produces the same HTML as above (just nicer, and note no need to deal with entities).
Newer module versions provide new_tag method with which the additional link above is made as
my $new = join ', ',
map { $e->new_tag('a', 'href' => "entry://$_", $_) } #w;
what takes care of some subtle needs (HTML escaping for one). The main docs don't say when this method was added, see changelog (May 2018, so supposedly in v5.28; it works with my 5.29.2).
I padded the shown sample to this file for testing:
<!DOCTYPE html> <title>Eye color</title> <body>
<p class="ul">Eye color, color</p>
<p class="ul1">blue, cornflower blue, steely blue</p>
<p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"></>
weasel
<p class="ul">weasel</p>
<p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"></>
</body> </html>
Update It's been clarified that the given markup snippet isn't merely a fragment of a presumably full HTML document but that it is a file (as stated) that stands as shown, as a custom format using HTML; apart from the required changes the rest of it need be preserved.
A particularly unpleasant detail proves to be the </> part; each of HTML::TreeBuilder, Mojo::DOM, and XML::LibXML† discards it while parsing. I couldn't find a way to make them keep that piece.
It was Marpa::HTML that processed the whole fragment as required, changing what was asked while leaving alone the rest of it.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
use Marpa::HTML qw(html);
my $file = shift // die "Usage: $0 file\n";
my $html = path($file)->slurp;
my $marpa = Marpa::HTML::html(
\$html,
{
'p.ul1' => sub {
return join ', ',
map { qq().$_.q() }
split /\s*,\s*/, Marpa::HTML::contents();
},
}
);
say $$marpa;
The processing of the <p> tags of class ul1 is the same as before: split the content on comma and wrap each piece into an <a> tag, then join them back with ,
This prints (with added line-breaks and indentation for readability)
Eye color
<p class="ul">Eye color, color</p>
blue,
cornflower blue,
steely blue
velvet brown
<link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> musteline
<link rel="stylesheet" href="a.css">
</>
It is the overall approach of this module that is suited for a task like this
Marpa::HTML is an extremely liberal HTML parser. Marpa::HTML does not reject any documents, no mater how poorly they fit the HTML standards.
Here it processed a custom piece of HTML-like markup, leaving things like </> in place.
†
See this post for an example of very permissive processing of HTML with XML::LibXML
perl -0777 -MWeb::Query=wq -lne'
my $w = wq $_; my $sep = ", ";
$w->filter("p.ul1")->each(sub {
my (undef, $e) = #_;
$e->html(join $sep, map {
qq($_)
} split $sep, $e->text);
});
print $w->as_html;
'
One-liner:
cat text | perl -pE 's{<p class="ul1">\K.*?(?=<\/p>)}{ join ", ", map {qq|$_|} split /, */, $& }eg'
From an awk script I want to generate a HTML file. My string could include characters like "<" and "&". Is there a short and proven function for awk which does the escaping?
Sure. Just call makeEntities() for each line ($0) you want to convert. Or modify it to accept an argument. I made this for working with the British National Corpus, which has a high degree of overlap with HTML entities, but not 100%, so if there are some exotic characters you need, you should verify that they are correct.
function makeEntities() {
gsub(/á/, "\\á");
gsub(/Á/, "\\Á");
gsub(/ă/, "\\ă");
gsub(/â/, "\\â");
gsub(/´/, "\\´");
gsub(/æ/, "\\æ");
gsub(/Æ/, "\\Æ");
gsub(/α/, "\\&agr;");
gsub(/à/, "\\à");
gsub(/ā/, "\\ā");
gsub(/Ā/, "\\Ā");
gsub(/&/, "\\&");
gsub(/ą/, "\\ą");
gsub(/å/, "\\å");
gsub(/Å/, "\\Å");
gsub(/ã/, "\\ã");
gsub(/ä/, "\\ä");
gsub(/Ä/, "\\Ä");
gsub(/β/, "\\&bgr;");
gsub(/\\/, "\\\");
gsub(/•/, "\\•");
gsub(/ć/, "\\ć");
gsub(/č/, "\\č");
gsub(/Č/, "\\Č");
gsub(/ç/, "\\ç");
gsub(/Ç/, "\\Ç");
gsub(/ĉ/, "\\ĉ");
gsub(/✓/, "\\✓");
gsub(/ˆ/, "\\ˆ");
gsub(/#/, "\\@");
gsub(/©/, "\\©");
gsub(/‐/, "\\‐");
gsub(/ď/, "\\ď");
gsub(/°/, "\\°");
gsub(/δ/, "\\&dgr;");
gsub(/Δ/, "\\&Dgr;");
gsub(/¨/, "\\¨");
gsub(/\$/, "\\$");
gsub(/đ/, "\\đ");
gsub(/é/, "\\é");
gsub(/É/, "\\É");
gsub(/ě/, "\\ě");
gsub(/ê/, "\\ê");
gsub(/è/, "\\è");
gsub(/È/, "\\È");
gsub(/ε/, "\\&egr;");
gsub(/ē/, "\\ē");
gsub(/Ē/, "\\Ē");
gsub(/ę/, "\\ę");
gsub(/ð/, "\\ð");
gsub(/ë/, "\\ë");
gsub(/Ë/, "\\Ë");
gsub(/♭/, "\\♭");
gsub(/½/, "\\½");
gsub(/⅓/, "\\⅓");
gsub(/¼/, "\\¼");
gsub(/⅕/, "\\⅕");
gsub(/⅙/, "\\⅙");
gsub(/⅛/, "\\⅛");
gsub(/⅔/, "\\⅔");
gsub(/⅖/, "\\⅖");
gsub(/¾/, "\\¾");
gsub(/⅗/, "\\⅗");
gsub(/⅜/, "\\⅜");
gsub(/⅘/, "\\⅘");
gsub(/⅝/, "\\⅝");
gsub(/⅞/, "\\⅞");
gsub(/′/, "\\&ft;");
gsub(/γ/, "\\&ggr;");
gsub(/>/, "\\>");
gsub(/½/, "\\½");
gsub(/ħ/, "\\ħ");
gsub(/í/, "\\í");
gsub(/Í/, "\\Í");
gsub(/î/, "\\î");
gsub(/Î/, "\\Î");
gsub(/ì/, "\\ì");
gsub(/ī/, "\\ī");
gsub(/″/, "\\&ins;");
gsub(/¿/, "\\¿");
gsub(/ï/, "\\ï");
gsub(/Ï/, "\\Ï");
gsub(/ĺ/, "\\ĺ");
gsub(/Ĺ/, "\\Ĺ");
gsub(/\{/, "\\{");
gsub(/≤/, "\\≤");
gsub(/λ/, "\\&lgr;");
gsub(/_/, "\\_");
gsub(/\[/, "\\[");
gsub(/ł/, "\\ł");
gsub(/Ł/, "\\Ł");
gsub(/</, "\\<");
gsub(/—/, "\\—");
gsub(/μ/, "\\&mgr;");
gsub(/µ/, "\\µ");
gsub(/·/, "\\·");
gsub(/ń/, "\\ń");
gsub(/ň/, "\\ň");
gsub(/ņ/, "\\ņ");
gsub(/–/, "\\–");
gsub(/ñ/, "\\ñ");
gsub(/Ñ/, "\\Ñ");
gsub(/#/, "\\#");
gsub(/ó/, "\\ó");
gsub(/Ó/, "\\Ó");
gsub(/ô/, "\\ô");
gsub(/œ/, "\\œ");
gsub(/ò/, "\\ò");
gsub(/Ω/, "\\Ω");
gsub(/ō/, "\\ō");
gsub(/ø/, "\\ø");
gsub(/Ø/, "\\Ø");
gsub(/õ/, "\\õ");
gsub(/ö/, "\\ö");
gsub(/Ö/, "\\Ö");
gsub(/φ/, "\\&phgr;");
gsub(/\+/, "\\+");
gsub(/±/, "\\±");
gsub(/£/, "\\£");
gsub(/ŕ/, "\\ŕ");
gsub(/√/, "\\√");
gsub(/ř/, "\\ř");
gsub(/Ř/, "\\Ř");
gsub(/\}/, "\\}");
gsub(/®/, "\\®");
gsub(/-/, "\\&rehy;");
gsub(/\]/, "\\]");
gsub(/ś/, "\\ś");
gsub(/Ś/, "\\Ś");
gsub(/š/, "\\š");
gsub(/Š/, "\\Š");
gsub(/ş/, "\\ş");
gsub(/Ş/, "\\Ş");
gsub(/ŝ/, "\\ŝ");
gsub(/σ/, "\\&sgr;");
gsub(/♯/, "\\♯");
gsub(/\//, "\\&shilling;");
gsub(/∼/, "\\∼");
gsub(/\//, "\\/");
gsub(/²/, "\\²");
gsub(/ß/, "\\ß");
gsub(/ť/, "\\ť");
gsub(/ţ/, "\\ţ");
gsub(/τ/, "\\&tgr;");
gsub(/þ/, "\\þ");
gsub(/Þ/, "\\Þ");
gsub(/×/, "\\×");
gsub(/™/, "\\™");
gsub(/ú/, "\\ú");
gsub(/Ú/, "\\Ú");
gsub(/û/, "\\û");
gsub(/ù/, "\\ù");
gsub(/ū/, "\\ū");
gsub(/¨/, "\\¨");
gsub(/ů/, "\\ů");
gsub(/ü/, "\\ü");
gsub(/Ü/, "\\Ü");
gsub(/\|/, "\\|");
gsub(/ŵ/, "\\ŵ");
gsub(/ý/, "\\ý");
gsub(/ŷ/, "\\ŷ");
gsub(/¥/, "\\¥");
gsub(/ÿ/, "\\ÿ");
gsub(/Ÿ/, "\\Ÿ");
gsub(/ź/, "\\ź");
gsub(/Ž/, "\\Ž");
gsub(/ž/, "\\ž");
gsub(/ż/, "\\ż");
}
To escape the bare minimum you can do:
function escapeHtml(t)
{
# Must do this one first
gsub(/&/, "\\&", t);
gsub(/</, "\\<", t);
gsub(/>/, "\\>", t);
return t;
}
I have two files, XML and an HTML and need to extract data from these on certain patterns.
My XML file is pretty well formatted and I can use readline to read a line and search data between tags.
if($line =~ /\<tag1\>$varvalue\<\/tag1\>/)`
However, for my HTML, it has one of the worst code I have seen and the file is like:
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
Now from this file I need to pick data which is shown in bold.
I can use Perl regular expression to search data from this file.
RegEx match open tags except XHTML self-contained tags
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
Using regular expressions to parse HTML: why not?
When you are done reading those come back :)
Edit : and to actually solve your problem take a look at this module :
http://perlmeme.org/tutorials/html_parser.html
Some sample to parse the an html file :
#!/usr/local/bin/perl
use HTML::TreeBuilder;
$tree = HTML::TreeBuilder->new;
$tree->parse_file('C:\Users\Stefanos\workspace\HTML_Parser_Test\test.html');
#divs = $tree->find('div');
$tree->delete;
In this example I just used your tags as the main body of an .html file. The divs are stored in the #divs array. Since I have no idea which text you want to find, because ** is not a element I can't help you further..
P.S. I have never used this module but I just did it in 5 minutes so it is not so hard to parse the html file and find whatever you want..
Regex to match any specific tag and store of contents result into $1:
if ($subject =~ m!<tagname[^>]*>(.*?)</tagname>!s) {
# Successful match
}
Although you will soon realize the limitations of this approach when you have nested elements..
Replace tagname with actual tag.. e.g. in your case i, a, span, div although for div you will also get the contents of the first div which is not what you want..
Parsing XML and HTML using regular expressions is a fool's errand. There are many simple to use Perl modules for parsing HTML. Here is something using HTML::TokeParser::Simple. I've omitted the code to associate movies and showtimes with theaters (because I have no intention of building an appropriate input file):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);
my #theaters;
while (my $div = $parser->get_tag('div')) {
my $class = $div->get_attr('class');
next unless defined($class) and $class eq 'theater';
my %record;
$record{theater} = $parser->get_text('/a');
$record{address} = $parser->get_text('/i');
s{(?:^\s+)|(?:\s+\z)}{} for values %record;
push #theaters, \%record;
}
use YAML;
print Dump \#theaters;
__DATA__
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**University Village 3**</a></h2>
<div class="address">
<i>**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**</i>
</div>
</div>
<div class="mtitle">
<a href="/movie/dream-house-2011" title="Dream House" onmouseover="mB(event, 771204354);" >**Dream House**</a>
<span>**(PG-13 , 1 hr. 31 min.)**</span>
</div>
<div class="times">
**1:00 PM,**
</div>
<div class="theater">
<h2>
<a href="/showtimes/university-village-3" >**Some other theater*</a></h2>
<div class="address">
<i>**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**</i>
</div>
</div>
Output:
[sinan#macardy]:~/tmp> ./tt.pl
---
- address: '**3323 South Hoover Street, Los Angeles CA 90007 | (213) 748-6321**'
theater: '**University Village 3**'
- address: '**1234 South Hoover Street, St Paul, MN 99999 | (999) 748-6321**'
theater: '**Some other theater*'