Escape HTML special characters in awk - html

From an awk script I want to generate a HTML file. My string could include characters like "<" and "&". Is there a short and proven function for awk which does the escaping?

Sure. Just call makeEntities() for each line ($0) you want to convert. Or modify it to accept an argument. I made this for working with the British National Corpus, which has a high degree of overlap with HTML entities, but not 100%, so if there are some exotic characters you need, you should verify that they are correct.
function makeEntities() {
gsub(/á/, "\\á");
gsub(/Á/, "\\Á");
gsub(/ă/, "\\&abreve;");
gsub(/â/, "\\â");
gsub(/´/, "\\´");
gsub(/æ/, "\\æ");
gsub(/Æ/, "\\Æ");
gsub(/α/, "\\&agr;");
gsub(/à/, "\\à");
gsub(/ā/, "\\&amacr;");
gsub(/Ā/, "\\&Amacr;");
gsub(/&/, "\\&");
gsub(/ą/, "\\&aogon;");
gsub(/å/, "\\å");
gsub(/Å/, "\\Å");
gsub(/ã/, "\\ã");
gsub(/ä/, "\\ä");
gsub(/Ä/, "\\Ä");
gsub(/β/, "\\&bgr;");
gsub(/\\/, "\\&bsol;");
gsub(/•/, "\\•");
gsub(/ć/, "\\&cacute;");
gsub(/č/, "\\&ccaron;");
gsub(/Č/, "\\&Ccaron;");
gsub(/ç/, "\\ç");
gsub(/Ç/, "\\Ç");
gsub(/ĉ/, "\\&ccirc;");
gsub(/✓/, "\\&check;");
gsub(/ˆ/, "\\ˆ");
gsub(/#/, "\\&commat;");
gsub(/©/, "\\©");
gsub(/‐/, "\\&dash;");
gsub(/ď/, "\\&dcaron;");
gsub(/°/, "\\°");
gsub(/δ/, "\\&dgr;");
gsub(/Δ/, "\\&Dgr;");
gsub(/¨/, "\\&die;");
gsub(/\$/, "\\&dollar;");
gsub(/đ/, "\\&dstrok;");
gsub(/é/, "\\é");
gsub(/É/, "\\É");
gsub(/ě/, "\\&ecaron;");
gsub(/ê/, "\\ê");
gsub(/è/, "\\è");
gsub(/È/, "\\È");
gsub(/ε/, "\\&egr;");
gsub(/ē/, "\\&emacr;");
gsub(/Ē/, "\\&Emacr;");
gsub(/ę/, "\\&eogon;");
gsub(/ð/, "\\ð");
gsub(/ë/, "\\ë");
gsub(/Ë/, "\\Ë");
gsub(/♭/, "\\&flat;");
gsub(/½/, "\\½");
gsub(/⅓/, "\\&frac13;");
gsub(/¼/, "\\¼");
gsub(/⅕/, "\\&frac15;");
gsub(/⅙/, "\\&frac16;");
gsub(/⅛/, "\\&frac18;");
gsub(/⅔/, "\\&frac23;");
gsub(/⅖/, "\\&frac25;");
gsub(/¾/, "\\¾");
gsub(/⅗/, "\\&frac35;");
gsub(/⅜/, "\\&frac38;");
gsub(/⅘/, "\\&frac45;");
gsub(/⅝/, "\\&frac58;");
gsub(/⅞/, "\\&frac78;");
gsub(/′/, "\\&ft;");
gsub(/γ/, "\\&ggr;");
gsub(/>/, "\\>");
gsub(/½/, "\\&half;");
gsub(/ħ/, "\\&hstrok;");
gsub(/í/, "\\í");
gsub(/Í/, "\\Í");
gsub(/î/, "\\î");
gsub(/Î/, "\\Î");
gsub(/ì/, "\\ì");
gsub(/ī/, "\\&imacr;");
gsub(/″/, "\\&ins;");
gsub(/¿/, "\\¿");
gsub(/ï/, "\\ï");
gsub(/Ï/, "\\Ï");
gsub(/ĺ/, "\\&lacute;");
gsub(/Ĺ/, "\\&Lacute;");
gsub(/\{/, "\\&lcub;");
gsub(/≤/, "\\≤");
gsub(/λ/, "\\&lgr;");
gsub(/_/, "\\&lowbar;");
gsub(/\[/, "\\&lsqb;");
gsub(/ł/, "\\&lstrok;");
gsub(/Ł/, "\\&Lstrok;");
gsub(/</, "\\<");
gsub(/—/, "\\—");
gsub(/μ/, "\\&mgr;");
gsub(/µ/, "\\µ");
gsub(/·/, "\\·");
gsub(/ń/, "\\&nacute;");
gsub(/ň/, "\\&ncaron;");
gsub(/ņ/, "\\&ncedil;");
gsub(/–/, "\\–");
gsub(/ñ/, "\\ñ");
gsub(/Ñ/, "\\Ñ");
gsub(/#/, "\\&num;");
gsub(/ó/, "\\ó");
gsub(/Ó/, "\\Ó");
gsub(/ô/, "\\ô");
gsub(/œ/, "\\œ");
gsub(/ò/, "\\ò");
gsub(/Ω/, "\\&ohm;");
gsub(/ō/, "\\&omacr;");
gsub(/ø/, "\\ø");
gsub(/Ø/, "\\Ø");
gsub(/õ/, "\\õ");
gsub(/ö/, "\\ö");
gsub(/Ö/, "\\Ö");
gsub(/φ/, "\\&phgr;");
gsub(/\+/, "\\&plus;");
gsub(/±/, "\\±");
gsub(/£/, "\\£");
gsub(/ŕ/, "\\&racute;");
gsub(/√/, "\\√");
gsub(/ř/, "\\&rcaron;");
gsub(/Ř/, "\\&Rcaron;");
gsub(/\}/, "\\&rcub;");
gsub(/®/, "\\®");
gsub(/-/, "\\&rehy;");
gsub(/\]/, "\\&rsqb;");
gsub(/ś/, "\\&sacute;");
gsub(/Ś/, "\\&Sacute;");
gsub(/š/, "\\š");
gsub(/Š/, "\\Š");
gsub(/ş/, "\\&scedil;");
gsub(/Ş/, "\\&Scedil;");
gsub(/ŝ/, "\\&scirc;");
gsub(/σ/, "\\&sgr;");
gsub(/♯/, "\\&sharp;");
gsub(/\//, "\\&shilling;");
gsub(/∼/, "\\∼");
gsub(/\//, "\\&sol;");
gsub(/²/, "\\²");
gsub(/ß/, "\\ß");
gsub(/ť/, "\\&tcaron;");
gsub(/ţ/, "\\&tcedil;");
gsub(/τ/, "\\&tgr;");
gsub(/þ/, "\\þ");
gsub(/Þ/, "\\Þ");
gsub(/×/, "\\×");
gsub(/™/, "\\™");
gsub(/ú/, "\\ú");
gsub(/Ú/, "\\Ú");
gsub(/û/, "\\û");
gsub(/ù/, "\\ù");
gsub(/ū/, "\\&umacr;");
gsub(/¨/, "\\¨");
gsub(/ů/, "\\&uring;");
gsub(/ü/, "\\ü");
gsub(/Ü/, "\\Ü");
gsub(/\|/, "\\&verbar;");
gsub(/ŵ/, "\\&wcirc;");
gsub(/ý/, "\\ý");
gsub(/ŷ/, "\\&ycirc;");
gsub(/¥/, "\\¥");
gsub(/ÿ/, "\\ÿ");
gsub(/Ÿ/, "\\Ÿ");
gsub(/ź/, "\\&zacute;");
gsub(/Ž/, "\\&Zcaron;");
gsub(/ž/, "\\&zcaron;");
gsub(/ż/, "\\&zdot;");
}

To escape the bare minimum you can do:
function escapeHtml(t)
{
# Must do this one first
gsub(/&/, "\\&", t);
gsub(/</, "\\<", t);
gsub(/>/, "\\>", t);
return t;
}

Related

Vue v-html not render \r \n \t

I have data like this :
data: {
content:"<p><span style=\"font-size:16px\">Berikut adalah beberapa pemberontakan yang pernah terjadi di daerah.</span></p>\r\n\r\n<p><span style=\"font-size:16px\"><strong>1. Pemberontakan Angkatan Perang Ratu Adil (APRA) </strong></span></p>\r\n\r\n<ul>\r\n\t<li><span style=\"font-size:16px\">di Bandung, pada 23 Januari 1950.</span></li>"
}
From the data I want to display it using Vue js.
This is my Vue js code:
<div class="row px-3" v-html="data.content"></div>
And if the above code is executed then the result is like this :
You can see, \r \n and \t don't seem to be rendering by Vue js
How to get \r \n and \t to be rendered by Vue js and can display as below?
\r, \n, and \t are not valid HTML; they are escape sequences that are used in other languages (so expecting them to work in HTML would be like pasting python code into a javascript file and expecting it to run.) You need to replace them with HTML that does what you want it to do. For new lines, the <br> tag could be used, but traditionally people handle line breaks by wrapping their sections in paragraphs (<p>) or divs (<div>). For tabs, you'll need to google for how to handle indenting in HTML as there is a lot more to say about it than I can explain in a short answer here.
I don't have the complete code, but after first read :
Try
v-html="content"
and
data(){
return {
content: "<p><span style=\"font-size:16px\">Berikut adalah beberapa pemberontakan yang pernah terjadi di daerah.</span></p>\r\n\r\n<p><span style=\"font-size:16px\"><strong>1. Pemberontakan Angkatan Perang Ratu Adil (APRA) </strong></span></p>\r\n\r\n<ul>\r\n\t<li><span style=\"font-size:16px\">di Bandung, pada 23 Januari 1950.</span></li>"
}
}

Perl add <a></a> around words within an HTML/XML tag

I have a file formatted like this one:
Eye color
<p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css">
</>
Each word within the <p class="ul1"> separated by ,should be wrapped in an <a> tag, like this:
Eye color
<p class="ul">Eye color, color</p> <p class="ul1">blue, cornflower blue, steely blue</p> <p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> <p class="ul1">musteline</p> <link rel="stylesheet" href="a.css">
</>
There could be one or several words within the <p class="ul1"> tag.
Is this possible in Perl one-liner?
Thanks in advance. Any help is appreciated.
Parse the file using a module and iterate over the elements you need (<p> of class ul1). Extract those comma-separated phrases from each and wrap links around them; then replace the element with that new content. Write the changed tree out in the end.
Using HTML::TreeBuilder (with its workhorse HTML::Element)
use warnings;
use strict;
use feature 'say';
use HTML::Entities;
use HTML::TreeBuilder;
my $file = shift // die "Usage: $0 file\n";
my $tree = HTML::TreeBuilder->new_from_file($file);
foreach my $elem ($tree->look_down(_tag => "p", class => "ul1")) {
my #new_content;
for ($elem->content_list) {
my #w = split /\s*,\s*/;
my $wrapped = join ", ",
map { qq().$_.q() } #w;
push #new_content, $wrapped;
}
$elem->delete_content;
$elem->push_content( #new_content );
};
say decode_entities $tree->as_HTML;
In your case an element ($elem) will have one item in the content_list so you don't have to collect modified content into an array (#new_content) but can process that one piece only, what simplifies the code. Working with a list as above doesn't hurt of course.
I redirect the output of this program to an .html file. The generated file is qouite frugal on newlines. If pretty HTML matters make a pass with a tool like HTML::Tidy or HTML::PrettyPrinter.
In a one-liner? Nah, it's too much. And please don't use regex as there's trouble down the road; it needs close work to get it right, is easy to end up buggy, is sensitive to smallest details, and brittle for even slightest changes in input. And that's when it can do the job. There are reasons for libraries.
Another good tool for this job is Mojo::DOM. For example
use Mojo::DOM;
use Path::Tiny; # only to read the file into a string easily
my $html = path($file)->slurp;
my $dom = Mojo::DOM->new($html);
foreach my $elem ($dom->find('p.ul1')->each) {
my #w = split /,/, $elem->text;
my $new = join ', ',
map { qq().$_.q() } #w;
$elem->replace( $new );
}
say $dom;
Produces the same HTML as above (just nicer, and note no need to deal with entities).
Newer module versions provide new_tag method with which the additional link above is made as
my $new = join ', ',
map { $e->new_tag('a', 'href' => "entry://$_", $_) } #w;
what takes care of some subtle needs (HTML escaping for one). The main docs don't say when this method was added, see changelog (May 2018, so supposedly in v5.28; it works with my 5.29.2).
I padded the shown sample to this file for testing:
<!DOCTYPE html> <title>Eye color</title> <body>
<p class="ul">Eye color, color</p>
<p class="ul1">blue, cornflower blue, steely blue</p>
<p class="ul1">velvet brown</p> <link rel="stylesheet" href="a.css"></>
weasel
<p class="ul">weasel</p>
<p class="ul1">musteline</p> <link rel="stylesheet" href="a.css"></>
</body> </html>
Update It's been clarified that the given markup snippet isn't merely a fragment of a presumably full HTML document but that it is a file (as stated) that stands as shown, as a custom format using HTML; apart from the required changes the rest of it need be preserved.
A particularly unpleasant detail proves to be the </> part; each of HTML::TreeBuilder, Mojo::DOM, and XML::LibXML† discards it while parsing. I couldn't find a way to make them keep that piece.
It was Marpa::HTML that processed the whole fragment as required, changing what was asked while leaving alone the rest of it.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
use Marpa::HTML qw(html);
my $file = shift // die "Usage: $0 file\n";
my $html = path($file)->slurp;
my $marpa = Marpa::HTML::html(
\$html,
{
'p.ul1' => sub {
return join ', ',
map { qq().$_.q() }
split /\s*,\s*/, Marpa::HTML::contents();
},
}
);
say $$marpa;
The processing of the <p> tags of class ul1 is the same as before: split the content on comma and wrap each piece into an <a> tag, then join them back with ,
This prints (with added line-breaks and indentation for readability)
Eye color
<p class="ul">Eye color, color</p>
blue,
cornflower blue,
steely blue
velvet brown
<link rel="stylesheet" href="a.css">
</>
weasel
<p class="ul">weasel</p> musteline
<link rel="stylesheet" href="a.css">
</>
It is the overall approach of this module that is suited for a task like this
Marpa::HTML is an extremely liberal HTML parser. Marpa::HTML does not reject any documents, no mater how poorly they fit the HTML standards.
Here it processed a custom piece of HTML-like markup, leaving things like </> in place.
† 
See this post for an example of very permissive processing of HTML with XML::LibXML
perl -0777 -MWeb::Query=wq -lne'
my $w = wq $_; my $sep = ", ";
$w->filter("p.ul1")->each(sub {
my (undef, $e) = #_;
$e->html(join $sep, map {
qq($_)
} split $sep, $e->text);
});
print $w->as_html;
'
One-liner:
cat text | perl -pE 's{<p class="ul1">\K.*?(?=<\/p>)}{ join ", ", map {qq|$_|} split /, */, $& }eg'

Remove <p> tags - Regular Expression (Regex)

I have some HTML and the requirement is to remove only starting <p> tags from the string.
Example:
input: <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text1 Here</span></p><p style="margin: 50pt"><span style="font:XXXX">Text2 Here</span></p> <p style="display:inline; margin: 40pt;"><span style="font:XXXX;"> Text3 Here</span></p>the string goes on like that
desired output: <span style="font:XXXX;"> Text1 Here</span></p><span style="font:XXXX">Text2 Here</span></p><span style="font:XXXX;"> Text3 Here</span></p>
Is it possible using Regex? I have tried some combinations but not working. This is all a single string. Any advice appreciated.
I'm sure you know the warnings about using regex to match html. With these disclaimers, you can do this:
Option 1: Leaving the closing </p> tags
This first option leaves the closing </p> tags, but that's what your desired output shows. :) Option 2 will remove them as well.
PHP
$replaced = preg_replace('~<p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<p[^>]*>/g, "");
Python
replaced = re.sub("<p[^>]*>", "", yourstring)
<p matches the beginning of the tag
The negative character class [^>]* matches any character that is not a closing >
> closes the match
we replace all this with an empty string
Option 2: Also removing the closing </p> tags
PHP
$replaced = preg_replace('~</?p[^>]*>~', '', $yourstring);
JavaScript
replaced = yourstring.replace(/<\/?p[^>]*>/g, "");
Python
replaced = re.sub("</?p[^>]*>", "", yourstring)
This is a PCRE expression:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*<\/p>)/Ug
Replace each occurrence with $3 or just remove all occurrences of:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>/g
If you want to remove the closing tag as well:
/<p( *\w+=("[^"]*"|'[^']'|[^ >]))*>(.*)<\/p>/Ug

Passing double quotes to Jscript

insertText is java script that accepts two string paramters
I need to pass two strings
first parameter:
<img src="
second
">
I just cant figure out how to pass double quote as parameter
This works
<a onClick="insertText('<em>', '</em>'); return false;">Italic</a>
This does not
<a onClick="insertText('<img src=/"', '/">'); return false;">Image</a>
Prints '); return false;">Image
You want to use \ rather than /
The escape character for JavaScript is \, not /. So try this:
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>
Update:
The solution above doesn't work, because the double-quotes "belong" to the HTML and not to the JavaScript, so we can't escape them in the JavaScript code.
Use this instead:
<a onClick="insertText('<img src=\'', '\'>'); return false;">Image</a> // --> <img src='...'>
or
<a onClick='insertText("<img src=\"", "\">"); return false;'>Image</a> // --> <img src="...">
Since you are using jQuery, why don't you do it the jQuery way?
insertText = function(a, b) {
// your insertText implementation...
};
$('a').click(function() { // use the right selector, $('a') selects all anchor tags
insertText('<img src="', '">');
});
With this solution you can avoid the problems with the quotes.
Here's a working example: http://jsfiddle.net/jcDMN/
The Golden Rule for that is reversing the quotation which means I use the single quotation ' inside the double quotation " and vice versa.
Also, you should use the backslash symbole to espape a special character like ' and ".
For example,
the following commands should work as they apply the rules mentioned above...
<a onClick="insertText('<em>', '</em>'); return false;">Italic</a>
or
<a onClick='insertText("<em>", "</em>"); return false;'>Italic</a>
or
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>
or
<a onClick='insertText("<img src=\'", "\'>"); return false;'>Image</a>
I hope this helps you ...
You need to escape it.
<a onClick="insertText('<img src=\"', '\">'); return false;">Image</a>

What is a tool that will syntax highlight using only HTML with class attributes?

I'm looking for a command line tool (or Perl module or VIM script or whatever) that will take some input files (such as XML or JavaScript files) and format them in HTML. I specifically want my output not to contain stuff like <span style="color: red"> or <font color=red> according to a particular colour scheme, rather it should use CSS class names to mark up the different syntactic parts of the file.
For example, if I had this file as input:
function f(x) {
return x + 1;
}
the kind of output I would like is:
<pre><span class=keyword>function</span> <span class=ident>f</span><span class=punc>{</span>
<span class=keyword>return</span> <span class=ident>x</span> <span class=op>+</span> <span class=numliteral>1</span><span class=punc>;</span>
<span class=punc>}</span></pre>
Does anyone know of such a tool?
Something like VIM's 2html.vim script, but outputting class="" attributes with the syntax highlight group names (like "Constant", "Identifier", "Statement", etc.) would be ideal.
Thanks,
Cameron
You can feed a file into GeSHi using PHP on the command line (or cURL your own local server or some other hack)
http://qbnz.com/highlighter/geshi-doc.html#basic-usage
There is buf2html.vim. Unfortunately, it uses non-semantic class names: See http://intrepid.perlmonk.org/apropos.vim/buf2html/current/myself.html
I think this is exacly what Vim's :TOhtml does if you
:let html_use_css = 1
Original:
function f(x) {
return x + 1;
}
output:
<pre>
<span class="Identifier">function</span> f(<span class="">x</span><span class="javaScriptParens">)</span><span class=""> </span><span class="Identifier">{</span>
<span class="Statement">return</span><span class=""> x + </span>1<span class="">;</span>
<span class="Identifier">}</span>
</pre>