Am I delusional about Html encoding? - html

I seem to be hitting a logical contradiction. I have a view with...
</br> in the markup and when I load the page it shows a new line. Inspecting the source I can see a </br>.
Then I put #Html.Raw("</br>") and in the source I get </br>
However all the documentation says that by default razor will html encode all strings. So why does Html.Raw show an encoded string instead?
Shouldn't it be the other way around?

</br> is incorrect, you probably meant <br/>.
This being said, here's how it works:
<br/> generates <br/>
#("<br/>") generates <br/>
#Html.Raw("<br/>") generates <br/>
The Html.Raw helper is used when you do not want to get HTML encoded output in the resulting HTML. By default the # Razor function HTML encodes its argument.

From W3schools (I know it ain't W3C official)
Differences Between HTML and XHTML
In HTML, the <br> tag has no end tag.
In XHTML, the <br> tag must be properly closed, like this: <br />.
I don't know about razor it's #Html function but you don't need the '/' at all

Related

Is there a non-javascript/PHP way to write sample code that won't get evaluated? [duplicate]

I use the <pre> tag in my blog to post code. I know I have to change < to < and > to >. Are any other characters I need to escape for correct html?
What happens if you use the <pre> tag to display HTML markup on your blog:
<pre>Use a <span style="background: yellow;">span tag with style attribute</span> to hightlight words</pre>
This will pass HTML validation, but does it produce the expected result? No. The correct way is:
<pre>Use a <span style="background: yellow;">span tag with style attribute</span> to hightlight words</pre>
Another example: if you use the pre tag to display some other language code, the HTML encoding is still required:
<pre>if (i && j) return;</pre>
This might produce the expected result but does it pass HTML validation? No. The correct way is:
<pre>if (i && j) return;</pre>
Long story short, HTML-encode the content of a pre tag just the way you do with other tags.
TL;DR
PHP: htmlspecialchars($html);
JavaScript(JS): Element.innerText = "<html>...";
Note that <pre> is just for styles, so you have to escape ALL HTML.
Only For You HTML "fossil"s: using <xmp> tag
This is not well known, but it really does exist and even chrome still supports it, however using a pair of <xmp> tag is NOT recommended to be relied on - it's just for you HTML fossils, but it's a very simple way to handle your personal content, e.g. DOCS. Even the w3.org Wiki says in its example: "No, really. don't use it."
You can put ANY HTML (excluding </xmp> end tag) inside <xmp></xmp>
<xmp>
<html> <br> just any other html tags...
</xmp>
The proper version
Proper version could be considered to be HTML stored as a STRING and displayed with the help of some escaping function/mechanism.
Just remember one thing - the strings in C-like languages are usually written between single quotes or double quotes - if you wrap your string in double => you should escape doubles (probably with \), if you wrap your string in single => escape singles (probably with \)...
The most frequent - Server-side language escaping (ex. in PHP)
Server-side scripting languages often have some built-in function to escape HTML.
<?php
$html = "<html> <br> or just any other HTML"; //store html
echo htmlspecialchars($html); //display escaped html
?>
Note that in PHP 8.1 there was a change so you no longer have to specify ENT_QUOTES flag:
flags changed from ENT_COMPAT to ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401.
The client-side way (example in JavaScript / JS&jQuery)
Similar approach as on server-side is achievable in client-side scripts.
Pure JavaScript
There is no function, but there is the default behavior, if you set element's innerText or node's textContent:
document.querySelector('.myTest').innerText = "<html><head>...";
document.querySelector('.myTest').textContent = "<html><head>...";
HTMLElement.innerText and Node.textContent are not the same thing! You can find out more about the difference in the MDN doc links above
jQuery (a JS library)
jQuery has $jqueryEl.text() for this purpose:
$('.mySomething .test').text("<html><head></head><body class=\"test\">...");
Just remember the same thing as for server-side - in C-like languages, escape the quotes you've wrapped your string in.
For posting code within your markup, I suggest using the <code> tag. It works the same way as pre but would be considered semantically correct.
Otherwise, <code> and <pre> only need the angle brackets encoded.
Use this and don't worry about any of them.
<pre>
${fn:escapeXml('
<!-- all your code -->
')};
</pre>
You'll need to have jQuery enabled for it to work.

Mozilla translates <br></br> as <br></br><br></br>

Our CMS outputs linebreaks as <br></br> (stupid, i know, but syntactically corrent(?))
This translates to <br><br> in chrome and IE10 and to <br></br><br></br> in Firefox.
All browsers shows it as two linebreaks.
Why are not <br></br> translated like <br /> or just <br>, is there something i can do
to make browser interpret <br></br> as just one linebreak?
It's defined by the HTML5 specifications: http://www.w3.org/TR/html5/syntax.html#parsing-main-inbody (search the text An end tag whose tag name is "br" in the page).
If your document is parsed as an HTML document, each closing br tag will be always parsed as if it was an opening tag (and creates an element with no content). But if you parse your document as an XHTML document a <br></br> sequence will produce the same DOM tree as the <br/> tag. To have your document parsed as an XHTML document, you have to send it with the application/xhtml+xml mime type.
More details are available in the specifications: http://www.w3.org/TR/html5/introduction.html#html-vs-xhtml; http://www.w3.org/TR/html5/the-xhtml-syntax.html.
To answer your question... seeing as your CMS seems to output this abomination, if you need a quick fix...
Fiddle - http://jsfiddle.net/P6bDp/
Bonus CSS selector list - http://net.tutsplus.com/tutorials/html-css-techniques/the-30-css-selectors-you-must-memorize/
CSS
br + br {display: none;}
This will eliminate the second <br> and leave a single line break. Fix that CMS though :)
hey <br></br> is not syntactically correct, you must use it as <br/> (look at the slash). its inline element so no need to close like other elements. it closes itself

Are XHTML self closing elements still valid in HTML5?

I was wondering if I can write self closing elements like in XHTML in HTML5, for example, <input type="email"> can be <input type="email" />, and will it still validate? And is this the correct way to code HTML5 web pages?
HTML5 can either be coded as XHTML, or as HTML 4. It's flexible that way.
As to which is the correct way, that's a preference. I suspect that many web designers into standards are used to XHTML and will probably continue to code that way.
You can go straight to: http://html5.validator.nu/ to validate your code, or if you have the right doctype, the official W3C site will use it for you.
Self-closing tags may lead to some parsing errors. Look at this:
<!DOCTYPE html>
<html>
<head><title>Title</title></head>
<body>
<div>
<p>
<div/>
</p>
</div>
</body>
</html>
While it is perfectly valid HTML4, it is invalid in HTML5.
W3C validation complains about <div/>:
Self-closing syntax (/>) used on a non-void HTML element. Ignoring the slash and treating as a start tag.
If innermost self-closed div is treated as start tag, it breaks whole structure, so be careful.
Either will work, just try to be consistent.
Same goes for quoting attributes - I've read tutorials that discourage quoting one word attribute variables. I would quote them all, at least for consistency (unless you have a popular web app where every byte is precious).

HTML 5: Is it <br>, <br/>, or <br />?

This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I've tried checking other answers, but I'm still confused — especially after seeing W3schools HTML 5 reference.
I thought HTML 4.01 was supposed to "allow" single-tags to just be <img> and <br>. Then XHTML came along with <img /> and <br /> (where someone said that the space is there for older browsers).
Now I'm wondering how I'm supposed to format my code when practicing HTML 5.
Is it <br>, <br/> or <br />?
Simply <br> is sufficient.
The other forms are there for compatibility with XHTML; to make it possible to write the same code as XHTML, and have it also work as HTML. Some systems that generate HTML may be based on XML generators, and thus do not have the ability to output just a bare <br> tag; if you're using such a system, it's fine to use <br/>, it's just not necessary if you don't need to do it.
Very few people actually use XHTML, however. You need to serve your content as application/xhtml+xml for it to be interpreted as XHTML, and that will not work in old versions of IE - it will also mean that any small error you make will prevent your page from being displayed in browsers that do support XHTML. So, most of what looks like XHTML on the web is actually being served, and interpreted, as HTML. See Serving XHTML as text/html Considered Harmful for some more information.
I think this quote from the HTML 5 Reference Draft provides the answer:
3.2.2.2 Void Elements
The term void elements is used to designate elements that must be empty. These requirements only apply to the HTML syntax. In XHTML, all such elements are treated as normal elements, but must be marked up as empty elements.
These elements are forbidden from containing any content at all. In HTML, these elements have a start tag only. The self-closing tag syntax may be used. The end tag must be omitted because the element is automatically closed by the parser.
HTML Example:
A void element in the HTML syntax. This is not permitted in the XHTML syntax.
<hr>
Example:
A void element using the HTML- and XHTML-compatible self-closing tag syntax.
<hr/>
XHTML Example:
A void element using the XHTML-only syntax with an explicit end tag. This is not permitted for void elements in the HTML syntax.
<hr></hr>
In other words:
Invalid HTML 5: <IMG></IMG>
Valid HTML 5: <IMG>, <IMG/>
And while HTML forbids certain closing tags, xhtml requires them:
Invalid xhtml: <img>
Valid xhtml: <img></img> or <img/>
Other elements that are forbidden from having a closing tag in HTML:
Element
Valid HTML
Valid xhtml
AREA
<AREA>
<AREA></AREA>
BASE
<BASE>
<BASE></BASE>
BASEFONT
<BASEFONT>
<BASEFONT></BASEFONT>
BR
<BR>
<BR></BR>
COL
<COL>
<COL></COL>
FRAME
<FRAME>
<FRAME></FRAME>
HR
<HR>
<HR></HR>
IMG
<IMG>
<IMG></IMG>
INPUT
<INPUT>
<INPUT></INPUT>
ISINDEX
<ISINDEX>
<ISINDEX></ISINDEX>
LINK
<LINK>
<LINK></LINK>
META
<META>
<META></META>
PARAM
<PARAM>
<PARAM></PARAM>
The fact that HTML forbids certain closing tags, while xhtml requires them is xhtml's problem. If you're writing HTML, you follow the HTML rules.
At the same time, browers gave up trying to enforce the standards, because everyone gets it wrong. It's not obvious:
that </BR> is forbidden
that </P> is optional
and </SPAN> is required
And then xhtml came along, with its XML rule that every element must have a closing tag, and people just assumed that HTML was the same thing. So the standards gave up, and were later revised to throw up their hands to the reality.
XML doesn't allow leaving tags open, so it makes <br> a bit worse than the other two. The other two are roughly equivalent with the second (<br/>) preferred for compatibility with older browsers. Actually, space before / is preferred for compatibility sake, but I think it only makes sense for tags that have attributes. So I'd say either <br/> or <br />, whichever pleases your aesthetics.
To sum it up: all three are valid with the first one (<br>) being a bit less "portable".
Edit: Now that we're all crazy about specs, I think it worth pointing out that according to dev.w3.org:
Start tags consist of the following
parts, in exactly the following order:
A "<" character.
The element’s tag name.
Optionally, one or more attributes, each of which must be
preceded by one or more space
characters.
Optionally, one or more space characters.
Optionally, a "/" character, which may be present only if the
element is a void element.
A ">" character.
In HTML (up to HTML 4): use <br>
In HTML 5: <br> is preferred, but <br/> and <br /> is also acceptable
In XHTML: <br /> is preferred. Can also use <br/> or <br></br>
Notes:
<br></br> is not valid in HTML 5, it will be thought of as two line breaks.
XHTML is case sensitive, HTML is not case sensitive.
For backward compatibility, some old browsers would parse XHTML as HTML and fail on <br/> but not <br />
Reference:
http://www.w3schools.com/tags/tag_br.asp
http://en.wikipedia.org/wiki/XHTML
According to the spec the expected form is <br> for HTML 5 but a closing slash is permitted.
I would recommend using <br /> for the following reasons:
1) Text and XML editors that highlight XML syntax in different colours will highlight properly with <br /> but this is not always the case if you use <br>
2) <br /> is backwards-compatible with XHTML and well-formed HTML (ie: XHTML) is often easier to validate for errors and debug
3) Some old parsers and some coding specs require the space before the closing slash (ie: <br /> instead of <br/>) such as the WordPress Plugin Coding spec: http://make.wordpress.org/core/handbook/coding-standards/html/
I my experience, I have never come across a case where using <br /> is problematic, however, there are many cases where <br/> or especially <br> might be problematic in older browsers and tools.
XML requires all tags to have a corresponding closing tag. So there is a special short-hand syntax for tags without inner contents.
HTML5 is not XML, so it should not pose such a requirement. Neither is HTML 4.01.
For instance, in HTML5 specs, all examples with br tag use <br> syntax, not <br/>.
UPD Actually, <br/> is permitted in HTML5. 9.1.2.1, 7.
If you're interested in comparability (not compatibility, but comparability) then I'd stick with <br />.
Otherwise, <br> is fine.
Both <br> and <br /> are acceptable in HTML5, but in the spirit of HTML, <br> should be used. HTML5 allows closing slashes in order to be more compatible with documents that were previously HTML 4.01 and XHTML 1.0, allowing easier migration to HTML5. Of course, <br/> is also acceptable, but to be compatible with some older browsers, there should be a space before the closing slash (/).
If you are outputting HTML on a regular website you can use <br> or <br/>, both are valid anytime you are serving HTML5 as text/html.
If you are serving HTML5 as XHTML (i.e. content type application/xhtml+xml, with an XML declaration) then you must use a self closing tag like so: <br/>.
If you don't the some browsers may flat out refuse to render your page (Firefox in particular is very strict about rendering only valid xhtml+xml pages).
As noted in 1. <br/> is also valid for HTML5 that happens to be generated as XML but served as a regular text/html without an XML declaration (such as from an XSL Transform that generates web pages, or something similar).
To clear up confusion: Putting a space before the slash isn't required in HTML5 and doesn't make any difference to how the page is rendered (if anyone can cite an example I'll retract this, but I don't believe it's true - but IE certainly does a lot of other odd things with all forms of <br> tags).
The excellent validator at http://validator.w3.org is really helpful for checking what's valid (although I'm not sure you can rely on it to also check content-type).
<br> is sufficient but in XHTML <br /> is preferred according to the WHATWG and according to the W3C.
To quote Section 8.1.2.1 of HTML 5.2 W3C Recommendation, 14 December 2017
Start tags must have the following format:
…
After the attributes, or after the tag name if there are no attributes, there may be one or more space characters. (Some attributes are required to be followed by a space. See §8.1.2.3 Attributes below.)
Then, if the element is one of the void elements, or if the element is a foreign element, then there may be a single U+002F SOLIDUS character (/). This character has no effect on void elements, but on foreign elements it marks the start tag as self-closing.
If you use Dreamweaver CS6, then it will autocomplete as <br />.
To validate your HTML file on W3C see : http://validator.w3.org/
<br> and <br/> render differently. Some browsers interpret <br/> as <br></br> and insert two line breaks
<br/> is the most appropriate one. This tag notation can also be used in Reactjs where a line break is required instead of <br>
Most of the cases in HTML, the tags are in pair. But for a line break you don't need a pair of tags. Therefore to indicate this, HTML uses <br/> format. <br/> is the right one. Use that format.
<br> tag has no end tag in HTML
In XHTML, the <br> tag must be properly closed, like this: <br />
In XML every tag must be closed. XHTML is an extension of XML, hence all the rules of XML must be followed for valid XHTML. Hence even empty tags (nodes without child nodes) like should be closed. XML has a short form called self closing tags for empty nodes. You can write <br></br> as <br />. Hence in XHTML <br /> is used.
HTML is very lenient in this regard, and there is no such rule. So in HTML empty nodes like <br> <hr> <meta> etc are written without the closing forward slash.
HTML
<br>
<hr>
<meta name="keywords" content="">
<link rel="canonical" href="http://www.google.com/">
XHTML
<br />
<hr />
<meta name="keywords" content="" />
<link rel="canonical" href="http://www.google.com/" />
Not all tags can be self closed. For example, a tag like <script src="jQuery.min.js" /> is not allowed by XHTML DTD.
Well all I know is that <br /> gives a break with a white line and <br> just gives a break in some cases. This happened to me when I was setting up an IPN-script (PHP) and sent mails and checked the inbox for it. Dont know why but I only got the message to look neat using both <br /> and <br>
Have a look at the mail here: http://snag.gy/cLxUa.jpg
The first two sections of text is seperated by <br />, hence the whitespace lines, the last three rows of text in the bottom and the last section is seperated by <br> and just gives new row.
Ummm.....does anyone know a SINGLE vendor, user-agent, or browser maker that has ever followed the W3C Specifications 100%??? So if HTML5 says it supports all three break element versions, you can bet the vendors support the same and even more sloppier versions!
The ONLY thing that matters in this debate is to CONSISTENTLY use coding that also happens to follow XML specifications as well as HTML specifications when possible. That means you should use the correct XML version of the break tag and encourage all your team to do the same:
<br />
The same space-slash format should apply for the img, a, hr, and meta tags in your code. Why? Because:
Its is backwards compatible with older XHTML user-agents / browsers
The browser vendors support the XML version anyway so the HTML5 specification is moot.
The sloppy implementations of most user-agents today, in the past, and in the future will accept it.
It allows your markup to be comparable with XML standards should you need to go back to creating XHTML/XML documents from your markup.
It's "good coding practice" for ALL WEB DEVELOPERS to keep using solid markup practices that follow XML, including coding in all lower case, quoted attributes, escaped XML characters, etc. etc. Why? In the future if you have to switch to XML data you automatically code and think in XML.
We can only hope that in the future World Wide Web, we move away from private vendor-implemented standards and go back to solid, reliable, verified markup that parses faster, moves data over the wires faster, and make our future Internet a more standardized medium using XML.
Old Netscape always needed the " /" space before the slash or it failed. Who cares about old browsers, right? But its one more case for my version I still like :)
Besides, in the robotic and machine world that's here, where robots don't have the same Human-interface coding problems HTML5 solves for us, they will gladly go back to XML data systems and parse such UI web pages much faster when converted to XML data.
<br> and <br /> render differently in some browsers, so choosing either over the other isn't going to hurt your project, but do expect a bulk find..replace to affect the page render in some browsers, which may result in extra work for yourself or even embarrassment should the change affect nothing in your test browser, but break it in the preferred browser of your clients'.
I prefer <br> since it is what I have used since Erwise and Netscape Navigator (early web browsers), but there's no reason not to choose <br /> instead. It may be useful for some preprocessing, comparability, etc.
Even if your choice boils down to preferring the look of one over the other, or you (or your favourite HTML editor e.g. Dreamweaver) might like your code to be xml compliant. It's up to you.
A quick side note:
Not to be confused with br, but in addition you may also consider using wbr tags in your HTML: A word break opportunity tag, which specifies where in a text it would be ok to add a line-break.
For further reading, please have a read of the HTML5 spec.
The elements without having end tags are called as empty tags. In html 4 and html 5, end tags are not required and can be omitted.
In xhtml, tags are so strict. That means must start with start tag and end with end tag.

How can I parse and normalize HTML from different HTML generators?

This is an extension of this question. I'm trying to parse HTML snippets embedded in an XML backup of a Blogger blog and retag them with InDesign tags.
Blogger doesn't standardize the HTML for any of its posts, and the posts can be written in Word, Windows Live Writer, the native Blogger interface, or text editors, resulting in tons of different forms of HTML. Some posts don't mark paragraphs and only use double <br>s in between paragraphs—others use actual <p> tags.
What's the best way to parse this unstandard conglomeration of tags?
Additionally, each post is not a complete HTML file--just a snippet that gets inserted into a template—which means that there is no overall HTML structure to parse (<html><body></body></html>, etc.) Does that have any effect on XML/HTML parsing?
Here's some potential examples, mostly standard HTML, missing paragraphs:
This is a section of a blog post. It has links and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li><ul>
And another paragraph here...
<br>
<br/>
Etc.
The Word HTML looks like this - http://www.timeatlas.com/mos/images/stories/word_html_tags.png
HTML::Parser?
The HTML generated by Word is relatively easier to deal with. I would just get rid of all the tag attributes (unless you care about styles). That would live you with fairly plain HTML which you can then style.
HTML::TokeParser::Simple can help make that relatively painless.
As for the other stuff, that will take some trial and error. I am going to think more about that and post later if I can think of something clever.
Later Update:
Well, here is something that makes me cringe a little but it seems to work:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use Text::Markdown qw( markdown );
my $html = read_file \*DATA;
$html =~ s{(?:<br(:? ?/)*>)}{\n\n}g;
print markdown( $html );
__DATA__
This is a section of a blog post. It has links and lists and stuff. Weee....
<br>
<br>
Here's a list
<br/>
<br />
<ul><li>Item 1</li><li>Item 2</li></ul>
And another paragraph here...
<br>
<br/>
Output:
<p>This is a section of a blog post. It has links and lists and
stuff. Weee....</p>
<p>Here's a list</p>
<ul><li>Item 1</li><li>Item 2</li></ul>
<p>And another paragraph here...</p>
As I said in the other question, I like XML::Twig. It can handle both XML and HTML.
FWIW, I tend to use XML::LibXML for all my XML and HTML needs. Here is a one-liner that will convert a line of "bad" HTML into a well-formed XHTML document:
perl -MXML::LibXML -ne 'my $p = XML::LibXML->new->parse_html_string($_); print $p->toString'
In your case, you probably want to use the DOM to emit a new document that has the correct tags. This is straightforward; XML::LibXML uses the same W3C DOM that JavaScript does.
As an example, this input:
<p>Foo<p>Bar<br>Baz!
Gets translated into:
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Foo</p><p>Bar<br/>Baz!
</p></body></html>
This is probably what you want, and remember, use the DOM to translate... don't worry about this printed representation.