cgi/perl/html - what characters to escape when printing into html?

cgi/perl/html - what characters to escape when printing into html? - html

I got an input file that I need to print directly into an html page.
I did $inputfile =~ s/\n/<br>/g; Are there any other special characters I should be aware of maybe other than < and > when printing this $inputfile to html?

You absolutely should use HTML::Escape instead of doing some ill-conceived hackjob which will cause everyone who deals with your code (you included) to curse your name in the future.
It's simple - install HTML::Escape via CPAN, then use it thus:
use HTML::Escape qw(escape_html);
my $escaped_string = escape_html($string);
Note that if you want to preserve whitespace formatting you should use a module to do that, as well, such as HTML::FromText - the above code will not automagically convert line breaks to tags because that's different completely from escaping unsafe characters to HTML entities.

Related

write_html() method in fpdf not using font/encoding specified

I'm creating a PDF with a large collection of quotes that I've imported into python with docx2python, using html=True so that they have some tags. I've done some processing to them so they only really have the bold, italics, underline, or break tags. I've sorted them and am trying to write them onto a PDF using the fpdf library, specifically the pdf.write_html(quote) method. The trouble comes with several special characters I have, so I am hoping to encode the PDF to UTF-8. To write with .write_html(), I had to create a new class as shown in their readthedocs under the .write_html() method at the very bottom of the left hand side:
from fpdf import FPDF, HTMLMixin
class htmlFPDF(FPDF, HTMLMixin):
pass
pdf = htmlFPDF()
pdf.add_page()
#set the overall PDF to utf-8 to preserve special characters
pdf.set_doc_option('core_fonts_encoding', 'utf-8')
pdf.write_html(quote) #[![a section of quote giving trouble with quotations][2]][2]
The list of quotes that I have going into the pdf all appear with their special characters and the html tags (<u> or <i>) in the debugger, but after the .write_html() step they then show up in the pdf file with mojibake, even before being saved, as seen through debugger. An example being "dayÃ¢Â€ÂTMs demands", when it should be "day's demands" (the apostrophe is curled clockwise in the quote, but this textbox doesn't support).
I've tried updating the font I use by
pdf.add_font('NotoSans', '', 'NotoSans-Regular.ttf', uni=True)
pdf.set_font('NotoSans', '', size=12)
added after the .add_page() method, but this doesn't change the current font (or fix mojibake) on the PDF unless I use the more common .write(text_height, quote) method, which renders the underline/italicize tags into the PDF as text. The .write() method does preserve the special characters. I'm not trying to change the font really, but make sure that what's written onto the PDF preserves the special characters instead of mojibake them.
I've also attempted some .encode/.decode action before going into the .write_html(), as well as attempted some methods from the ftfy library. And tried adding '' to the start of each quote to no effect.
If anyone has ideas for a way to iterate through each line on the PDF that'd be terrific, since then I could use ftfy to fix the mojibake. But ideally, it would be some other html tag at the start of each quote or a way to change the font/encoding of the .write_html() method, maybe in the class declaration?
Or if I'm at a dead-end and should just split each quote on '<', use if statements to detect underlines, italicize, etc., and use the .write() method after all.

Extract docx to html works really bad with docx2python. I do this few month ago. I recommend PyDocX. docx2python are good for docx file content extracting, not converting it into a html.

Can't figure out a regex with line break - HTML

I have written a very simple regular expression to search within an HTML document for any tag - as we are modifying 40+ templates that have been edited by a WYSIWYG editor that was horrible. Basically, it added style="font... tags everywhere - so I want to delete them all.
The problem is, some of them have line breaks between the styles (like you would typically write CSS) - and I can't figure out how to include line breaks within my expression.
Here is what I have:
style="font(.*?)"
I am using textmate to search for it, and it works great except for styles that have hard line breaks in them.
Any help???

Use this RegEx: style="font([\s\S]*?)". . does not match \n by default.

Putting (?s) at the front of your regex causes . to match newline as well

This is the most straightforward way to do it:
style="font([^"]*)"

extracting double quotes from html tags with a regex

I'm extracting some content from a website with this pattern:
([^+]+)
and it outputs
< img src=""http://www."" border=""0""/>
with double quotes. What is wrong with my query?

your problem only makes sense if you modify your regexp.
but first of all, beware:
in general, what you try to achieve is not feasible using regexes. they are the inappropriate tool to do it. you will not come up with a solution 100% correct using regexes.
having said this, try to replace ([^+]+) with (([^<!--]+([^<]|<[^!]|<![^-]|<!-[^-]))+). note that this regex assumes the following:
there are no html comments inside the message portion
there are no strings containing html comment openings inside the message portion
the message portion is a valid html fragment
(otherwise it would match eg. <!-<!-- / message -->)
you have been warned.
btw, the dquote doubling must be a standard escape mechanism of the imacro environment.

How to display XML source code using HTML with Emacs?

In the HTML file, I need to show some XML code. The problem is that I can't use
<pre>..</pre>
to show '<' and '>'.
What would be the solution for this problem?
ADDED
From the answer, replacing '<' and '>' to < and> can be a solution. I'm an Emacs user, are there Emacs tools/magic to do that automatically? I mean, I can use search and replace, but I expect Emacs can do it by 'select region' -> 'M-x replace_xml' or something.

You need to escape < as < and & as &. Optionally, for consistency, you can escape > as >
To do this automatically in Emacs, if you're in HTML mode, you can select the code that you would like to escape, and run M-x sgml-quote.

You need to replace < by < and > by >. How to do this depends on the server side language in question.
Update: as per your update: this is not programming related anymore. I think http://superuser.com is a better place to ask software related questions.

As already mentioned, you need to escape the XML. For robustness I would also escape single and double quotes too. Note that CDATA and <pre> can cause you problems if, for any reason, your XML document includes ]]> or </pre> in it.
You can get away with doing a straight string substitution for the escaping, but if you do, make sure you escape & to & before doing any of the other escapes.

As other have noted, you need to escape the xml markup to display it in html.
Take a look at xmlverbatim stylesheet: It does that as well as pretty printing and colorizing.
If you google around there are several stylesheets to do similar formatting.

Select the region
Do M-% < RET < RET !

Do a substitution using a programming language without Emacs.
For Python:
#Make a copy just in case.
#Open file.
#Read lines.
for line in lines:
line = line.replace("<pre>", "<").replace("</pre>", ">")
#Output to file.
#Enjoy!

How to remove all empty tags in X/HTML code in once?

for example :
I want to remove all highlighted tags
alt text http://shup.com/Shup/299976/110220132930-My-Desktop.png

You could use a regular expression in any editor that supports them. For instance, I tested this one in Dreamweaver:
<(?!\!|input|br|img|meta|hr)[^/>]*?>[\s]*?</[^>]*?>
Just make a search and replace all (with the regex as search string and nothing as replacement). Note however that this may remove necessary whitespace. If you just want to remove empty tags without anything in between,
<(?!\!|input|br|img|meta|hr)[^/>]*?></[^>]*?>
would be the way to go.
Update: You want to remove &nbsps as well:
<(?!\!|input|br|img|meta|hr)[^/>]*?>(?:[\s]| )*?</[^>]*?>
I did not verify this one - it should be OK though, try it out :-)

If this is only about quickly editing a file, and your editor supports regular expression replacement, you can use a regex like this:
<[^>]+></[^>]+>
Search for this regex, and replace with an empty string.
Note: This isn't safe in any way - don't rely on it, as it can find more things than just valid, empty tags. (It would also find <a></b> for example.) There is no safe way to do this with regexes - but if you check each replacement manually, you should be fine. If you need real safe replacement, then either you'll have to find an editor that supports this (JEdit may be a good bet, but I haven't checked), or you'll have to parse the file yourself - e.g. using XSLT.

What you're asking for sounds like a job for regular expressions. Many editors support regular expression find/replace. Personally, I'd probably do this from the command-line with Perl (sed would also work), but that's just me.
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html
or if you're brave, edit the file in place:
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' -i file.html
This will remove:
<p></p>
<p id="foo"></p>
but not:
<p>hello world</p>
<p></a>
Warning: things like <img src="pic.png"></img> and <br></br> will also be removed. It's not obvious from your question, but I'll assume this is undesirable. Maybe you're not worried because you know all your images are declared like this <img src="pic.png"/>. Otherwise the regular expression will need to be modified to account for this, but I decided to start simple for an easier explanation...
It works by matching the opening tag: a literal < followed by the tag name (one or more characters which are not whitespace or > = [^\s>]+), any attributes (zero or more characters which aren't > = [^>]*), and then a literal >; and a closing tag with the same name: this takes advantage of the fact that we captured the tag name, so we can use a backreference = </\1>. The matches are then replaced with the empty string.
If the syntax/terminology used here is unfamiliar to you, I'm a fan of the perlre documentation page. Regular expression syntax in other languages should be very similar if not identical to this, so hopefully this will be useful even if you don't Perl :)
Oh, one more thing. If you have things like <div><p></p></div>, these will not be picked up all at once. You'll have to do multiple passes: the first will remove the <p></p> leaving a <div></div>to be removed by the second. In Perl, the substitution operator returns the number of replacements made, so you can:
perl -pe '1 while s|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

cgi/perl/html - what characters to escape when printing into html? - html

I got an input file that I need to print directly into an html page. I did $inputfile =~ s/\n/<br>/g; Are there any other special characters I should be aware of maybe other than < and > when printing this $inputfile to html?

Related

write_html() method in fpdf not using font/encoding specified

Can't figure out a regex with line break - HTML

extracting double quotes from html tags with a regex

How to display XML source code using HTML with Emacs?

How to remove all empty tags in X/HTML code in once?

Categories

Resources