I accept some of the symbols, we have to use character entities. But what is the difference to use & and & or > and > and some of them which is available in keyboard.
Just for knowledge purpose.
Thanks in Advance.
The only characters you need to use character references for are < (start of tag), & (start of character reference), " (start/end of attribute value) and ' (ditto), and then only in places where they have special meaning.
e.g. < means "start of tag" in many parts of an HTML document, so you have to use < if you want to express "less than symbol" instead.
Character entities are used to display reserved characters in HTML.
< , & are reserverd charcters .
Related
My HTML script includes some special characters, for example, < (less than), or ' (single quote). For < (lt), the code represent it as < and for ' (single quote), the code is '.
The browser cannot show them correctly. Instead, it shows < for < and ' for '. Is there any generic way to show these character correctly? I don't wanna replace all the & with & in the source code. Thanks.
The entity name for < (less than) is <
If you want to use two entities together then you have to include the & like
&<
To know more about entities : https://www.w3schools.com/html/html_entities.asp
HTML 4 states pretty which characters should be escaped:
Four character entity references deserve special mention since they
are frequently used to escape special characters:
"<" represents the < sign.
">" represents the > sign.
"&" represents the & sign.
"" represents the " mark.
Authors wishing
to put the "<" character in text should use "<" (ASCII decimal 60)
to avoid possible confusion with the beginning of a tag (start tag
open delimiter). Similarly, authors should use ">" (ASCII decimal
62) in text instead of ">" to avoid problems with older user agents
that incorrectly perceive this as the end of a tag (tag close
delimiter) when it appears in quoted attribute values.
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid
confusion with the beginning of a character reference (entity
reference open delimiter). Authors should also use "&" in
attribute values since character references are allowed within CDATA
attribute values.
Some authors use the character entity reference """ to encode
instances of the double quote mark (") since that character may be
used to delimit attribute values.
I'm surprised I can't find anything like this in HTML 5. With the help of grep the only non-XML mention I could find comes as an aside regarding the deprecated XMP element:
Use pre and code instead, and escape "<" and "&" characters as "<" and "&" respectively.
Could somewhat point to the official source on this matter?
The specification defines the syntax for normal elements as:
Normal elements can have text, character references, other elements, and comments, but the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand. Some normal elements also have yet more restrictions on what content they are allowed to hold, beyond the restrictions imposed by the content model and those described in this paragraph. Those restrictions are described below.
So you have to escape <, or & when followed by anything that could begin a character reference. The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. (Obviously, if you don’t want to terminate the attribute value there, escape the quotation mark.)
These rules don’t apply to <script> and <style>; you should avoid putting dynamic content in those. (If you have to include JSON in a <script>, replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialization.)
From http://www.w3.org/html/wg/drafts/html/master/single-page.html#serializing-html-fragments
Escaping a string (for the purposes of the algorithm* above) consists
of running the following steps:
Replace any occurrence of the "&" character by the string "&".
Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string " ".
If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string """.
If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "<", and any
occurrences of the ">" character by the string ">".
*Algorithm is the built-in serialization algorithm as called e.g. by the innerHTML getter.
Strictly speaking, this is not exactly an aswer to your question, since it deals with serialization rather than parsing. But on the other hand, the serialized output is designed to be safely parsable. So, by implication, when writing markup:
The & character should be replaced by &
Non-breaking spaces should be escaped as (surprise!...)
Within attributes, " should be escaped as "
Outside of attributes, < should be escaped as < and > should be escaped as >
I'm intentionaly writing "should", not "must", since parsers may be able to correct violations of the above.
Adding my voice to insist that things are not that easy -- strictly speaking:
HTML5 is a language specifications
it could be serialized either as HTML or as XML
Case 1 : HTML serialization
(the most common)
If you serialize your HTML5 as HTML, "the text must not contain the character U+003C LESS-THAN SIGN (<) or an ambiguous ampersand."
An ambiguous ampersand is an "ampersand followed by one or more alphanumeric ASCII characters, followed by a U+003B SEMICOLON character (;)"
Furthermore, "the parsing of certain named character references in attributes happens even with the closing semicolon being omitted."
So, in that case editable && copy (notice the spaces around &&) is valid HTML5 serialized as HTML construction as none of the ampersands is followed by a letter.
As a counter example: editable&© is not safe (even if this might work) as the last sequence © might be interpreted as the entity reference for ©
Case 1 : XML serialization
(the less common)
Here the classic XML rules apply. For example, each and every ampersand either in the text or in attributes should be escaped as &.
In that case && (with or without spaces) is invalid XML. You should write &&
Tricky, isn't it ?
Yes, I've googled it, and surprisingly got confusing answers.
One page says that < > & " are the only reserved characters in (X)HTML. No doubt, this makes sense.
This page says < > & " ' are the reserved characters in (X)HTML. A little confusing, but okay, this makes sense too.
And then comes this page which says < > & " © ° £ and non-breaking space ( ) are all reserved characters in (X)HTML. This makes no sense at all, and pretty much adds to my confusion.
Can someone knowledgeable, who actually do know this stuff, clarify which the reserved characters in (X)HTML actually are?
EDIT: Also, should all the reserved characters in code be escaped when wrapped in <pre> tag? or is it just these three -- < > & ??
The XHTML 1.0 specification states at http://www.w3.org/TR/2002/REC-xhtml1-20020801/#xhtml:
XHTML 1.0 [...] is a reformulation of the three HTML 4 document types as
applications of XML 1.0 [XML].
The XML 1.0 specification states at http://www.w3.org/TR/2008/REC-xml-20081126/#syntax:
Character Data and Markup: Text consists of intermingled character
data and markup. [...] The ampersand character (&) and the left angle
bracket (<) MUST NOT appear in their literal form, except when used as
markup delimiters, or within a comment, a processing instruction, or a
CDATA section. If they are needed elsewhere, they MUST be escaped
using either numeric character references or the strings "&" and
"<" respectively. The right angle bracket (>) may be represented
using the string ">", and MUST, for compatibility, be escaped
using either ">" or a character reference when it appears in the
string "]]>" in content, when that string is not marking the end of
a CDATA section.
This means that when writing the text parts of an XHTML document you must escape &, <, and >.
You can escape a lot more, e.g. ü for umlaut u. You can as well state that the document is encoded in for example UTF-8 and write the byte sequence 0xc3bc instead to get the same umlaut u.
When writing the element parts (col. "tags") of the document, there are different rules. You have to take care of ", ' and a lot of rules concerning comments, CDATA and so on. There are also rules which characters can be used in element and attribute names. You can look it up in the XML specification, but in the end it comes down to: for element and attribute names, use letters, digits and "-"; do not use "_". For attribute values, you must escape & and (depending on the quote style) either ' or ".
If you use one of the many libraries to write XML / XHTML documents, somebody else has already taken care of this and you just have to tell the library to write text or elements. All the escaping is done the in the background.&
Only < and & need to be escaped. Inside attributes, " or ' (depending on which quote style you use for the attribute's value) needs to be escaped, too.
<a href="#" onclick='here you can use " safely'></a>
By writing "(X)HTML", you are asking (at least) two different questions.
By the HTML rules, with "HTML" meaning any HTML version up to and including HTML 4.01, only "<" and "&" are reserved. The rules are somewhat complex. They should not not appear literally except in their syntactic use in tags, entity references, and character references. But by the formal rules, they may appear literally e.g. in the context "A & B" or "A < B" (but A&B be formally wrong, and so would A<B).
The XHTML rules, based on XML, are somewhat stricter, simpler: "<" and "&" are unconditionally reserved.
The ASCII quotation mark " and the ASCII apostrophe ' are not reserved, except in the very specific sense that a quoted attribute value must not literally contain the character used as quote, i.e. in "foo" the string foo must not contain " as such and in 'foo' the string foo must not contain ' as such.
The characters < > & " are reserved by XML format.
It means that you can use < and > chars only to define tags (<mytag></mytag>).
Double quotes (") are used to define values of attributes (<mytag attribute="value" />)
Ampersand (&) is used to write entities (& is used when you actually want to write ampersand, NOT &). Also, when you write url in your XML document, you should use &, not just &: www.aaa.com?a=1&b=2 - is wrong; www.aaa.com?a=1&b=2 - is good!
XHTML is based on XML, so what I have wrote applies to XHTML.
© ° £ - These are not reserved chars. These are entities defined specifically for XHTML, not for XML.
In XML you can simply write ©. In XHMTL you can also simply write ©, or use entity ©, or numeric entity &00A9;.
In addition to the other answers, it might help to know that there are also forbidden characters: all control characters in ASCII and ISO-8859-1 except TAB, LF, and CR.
https://www.w3.org/MarkUp/html3/specialchars.html
Why we use & instead of & ?
What is the advantage ?
From HTML 4.0.1 Specification:
Authors should use & (ASCII
decimal 38) instead of & to avoid
confusion with the beginning of a
character reference (entity reference
open delimiter). Authors should also
use & in attribute values since
character references are allowed
within CDATA attribute values.
Your question should be reversed!
You should always use &, because that's the only way to create valid HTML.
Since the & character is used for entities (such as & or >), it must be escaped in order to write a literal &.
What are escape tags in html?
Are they " < > to represent " < >?
And how do these work?
Is that hex, or what is it?
How is it made, and why aren't they just the characters themselves?
Here are some common entities. You do not need to use the full code - there are common aliases for frequently used entities. For example, you can use < and > to indicate less than and greater than symbols. & is ampersand, etc.
EDIT: That should be - < > and &
EDIT: Another common character is   which is often used to represent tabs in <code> segments
How do these work?
Anything &#num; is replaced with character from ASCII table, matching that num.
Is that hex, or what is it?
It's not hex, the number represents characters number in decimal in ASCII table. Check out ASCII table. Check Dec and HTML columns.
Why aren't they just the characters themselves?
Look at this example:
<div>Hey you can use </div> to end div tag!</div>
It would mess up the interpreter. It's a bad example, but you got the idea.
Why you can't use escape characters like in programming languages?
I don't have exact answer to that. But html same as xml is a markup language and not programming language and answer probably lies within history of how markup languages become what they are now.
No, it's not hex, it's decimal. Hex equivalent is < But one usually uses < (less-than) for < and > for > instead.
Here is the complete reference of html entities:
Complete HTML Entities
It is use for correct character formatting
HTML has a set of special characters which browsers recognize as part of the HTML language itself. For example, browsers know that when they encounter a < character in the HTML code, they are to interpret it as a beginning of a tag. > is one of an example of reserved character. Therefore use html character to avoid any problem and for correct practice also
Those escapes are decimal ascii escapes. You can see the values here. Take a look at the HTML column.