HTML Codes .. Names Vs Numbers - html

This might seem like a stupid question but I have always wondered. What would be the advantage of using HTML code names versus HTML code numbers. Is there a right place or a wrong place for each version?
By HTML codes I am referring to this..
http://ascii.cl/htmlcodes.htm
I know for validation purposes codes should be used for example using & or & versus using &. However I don't know when it would be right to use & over & ... or does it simply make no difference?

It makes no difference. The reason why, for example, & was created was to make it easier for coders to remember and make code easier to read.
It just comes down to, one is easier for us (humans) to read.

Some terminology: A code like "&" is properly called a character entity reference; a code like "&" is a numeric character reference.
Together, we can refer to them all as "HTML entities." For a given code point, there is sometimes a character entity reference, but there is always a numeric character reference, which can be formed from the Unicode encoding of the character. For instance, ℛ has the numeric character reference "ℛ".
Generally it's the ASCII characters that have character entity references, but not always.
Character entity references are usually easier to read, but in a particular context a set of numeric character references might possibly be. For instance, if you were writing a regular expression to match a certain block of Unicode characters.
When you say "for validation purposes codes should be used," I think you have in mind the rule that a bare ampersand is not valid HTML. That's specific to this character.
Update
An example where you have to use the numeric character entity: There is no character entity reference for the single quote character, "'". A piece of JavaScript to scrub quote characters out of a string has to use the numeric character entity.

Using names is always preferrable, as it is always more readable. Consider the following pieces of identical code:
$location = "Faith Life Church";
$city = "Sarasota";
/*...*/
foreach ($stmt as $row) {
foreach ($row as $variable => $value) {
$variable = strtolower($variable);
$$variable = $value;
}
}
And
$v073124 = "Faith Life Church";
$v915431 = "Sarasota";
/*...*/
foreach ($v3245 as $v9825) {
foreach ($v9825 as $v85423 => $v8245631) {
$v85423 = strtolower($v85423);
$$v85423 = $v8245631;
}
}
Which would you consider more readable?

It's your choice. Code may be easier for someone to remember instead of letters.
BUT I think HTML 5 require you using letters as a standard for what you can. Really not sure about this.

Related

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

Potential pitfalls with my new markup language?

Something that's really bothered me about XHTML, and XML in general is why it's necessary to indicate what tag you're closing. Things like <b>bold <i>bold and italic</b> just italic</i> aren't legal anyways. Thus I think using {} makes more sense. Anyway, here's what I came up with:
doctype;
html
{
head
{
title "my webpage"
javascript '''
// code here
// single quotes do not allow variable substitution, like PHP
// triple quotes can be used like Python
'''
}
body
{
table {
tr {
td "cell 1"
td "cell 2"
td #var|filter1|filter2:arg
}
}
p "variable #var in a string"
p "variable #{var|withfilter}"
input(type=password, value=secret); // attributes are specified like this
br; // semi-colons are used on elements that don't have content
p { "strings are" "automatically" "concatenated together" #andvars "too" }
}
}
Tags that only contain one element do not need to be enclosed in braces (for example td "cell 1" the td is closed immediately after the text). Strings are outputted directly, except double-quoted strings allow variable substitution, and single quotes do not. I'm adopting a filtering scheme similar to Django's. The thing I'm most concerned about, I think, is variable substitution in double-quotes.. I don't want people to have to open and close single quotes everywhere because the syntax things are being treated as vars that shouldn't. I don't think the # character is very commonly used in code. I was going to use $ like PHP, but jQuery uses that, and I want to allow people to do substitutions in their JS too (of course, if they don't need to, they should use single quotes!)
Templates will use "dictionaries". By default, it uses this HTML dict, with familiar tags, but you can easily add your own. "Tags" may consist of not just one, but multiple HTML tags.
Still need to decide how to do loops and including partials...
Edit: Started an open source project, for those interested.
I believe you can get close to that with the syntax of TCL script language.
The thing I like the most about your idea is the removal of the (to me very) redundant information in the closing tags of the (has it's roots in) SGML markup.
Another clean option IMO is to go the road of using indenting to specify scope, eliminating braces all together. With the assumption of a little editor support, I can imagine this happening.
I think it's possibly stiflling that globally used specifications cater to the theorhetical person using VI or Notepad to type out their markup...

Encoding rules for URL with the `javascript:` pseudo-protocol?

Is there any authoritative reference about the syntax and encoding of an URL for the pseudo-protocol javascript:? (I know it's not very well considered, but anyway it's useful for bookmarklets).
First, we know that standard URLs follow the syntax:
scheme://username:password#domain:port/path?query_string#anchor
but this format doesn't seem to apply here. Indeed, it seems, it would be more correct to speak of URI instead of URL : here is listed the "unofficial" format javascript:{body}.
Now, then, which are the valid characters for such a URI, (what are the escape/unescape rules) when embedding in a HTML?
Specifically, if I have the code of a javascript function and I want to embed it in a javascript: URI, which are the escape rules to apply?
Of course one could escape every non alfanumeric character, but that would be overkill and make the code unreadable. I want to escape only the necessary characters.
Further, it's clear that it would be bad to use some urlencode/urldecode routine pair (those are for query string values), we don't want to decode '+' to spaces, for example.
My findings, so far:
First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).
Some examples:
<a href="javascript:alert('Hi!')"> (1)
<a href="javascript:if(a > b && 1 < 0) alert( b ? 'hi' : 'bye')"> (2)
<a href="javascript:if(a>b &&& 1 < 0) alert( b ? 'hi' : 'bye')"> (3)
Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).
Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.
From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited:
among them spaces and {}# .
The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.
This would imply that the valid characters inside the 'body' of a javascript: URI are
a-zA-Z0-9
_|. !~*'();?:#&=+$,/-
%hh : (escape sequence, with two hexadecimal digits)
with the additional restriction that it can't begin with /.
This stills leaves out some "important" ASCII characters, for example
{}#[]<>^\
Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.
In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).
But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.
And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?
I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.
javascript: URLs are currently part of the HTML spec and are specified at https://html.spec.whatwg.org/multipage/browsing-the-web.html#the-javascript:-url-special-case

Single vs Double quotes (' vs ")

I've always used single quotes when writing my HTML by hand. I work with a lot of rendered HTML which always uses double quotes. This allows me to determine if the HTML was written by hand or generated. Is this a good idea?
What is the difference between the two? I know they both work and are supported by all modern browsers but is there a real difference where one is actually better than the other in different situations?
The w3 org said:
By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa. Authors may also use numeric character references to represent double quotes (") and single quotes ('). For double quotes authors can also use the character entity reference ".
So... seems to be no difference. Only depends on your style.
I use " as a top-tier and ' as a second tier, as I imagine most people do. For example
Click Me!
In that example, you must use both, it is unavoidable.
Quoting Conventions for Web Developers
The Short Answer
In HTML the use of single quotes (') and double quotes (") are interchangeable, there is no difference.
But consistency is recommended, therefore we must pick a syntax convention and use it regularly.
The Long Answer
Web Development often consists of many programming languages. HTML, JS, CSS, PHP, ASP, RoR, Python, etc. Because of this we have many syntax conventions for different programming languages. Often habits from one language will follow us to other languages, even if it is not considered "proper" i.e. commenting conventions. Quoting conventions also falls into this category for me.
But I tend to use HTML tightly in conjunction with PHP. And in PHP there is a major difference between single quotes and double quotes. In PHP with double quotes "you can insert variables directly within the text of the string". (scriptingok.com) And when using single quotes "the text appears as it is". (scriptingok.com)
PHP takes longer to process double quoted strings. Since the PHP parser has to read the whole string in advance to detect any variable inside—and concatenate it—it takes longer to process than a single quoted string. (scriptingok.com)
 
Single quotes are easier on the server. Since PHP does not need to read the whole string in advance, the server can work faster and happier. (scriptingok.com)
Other things to consider
Frequency of double quotes within string. I find that I need to use double quotes (") within my strings more often than I need to use single quotes (') within strings. To reduce the number of character escapes needed I favor single quote delimiters.
It's easier to make a single quote. This is fairly self explanatory but to clarify, why press the SHIFT key more times than you have to.
My Convention
With this understanding of PHP I have set the convention (for myself and the rest of my company) that strings are to be represented as single quotes by default for server optimization. Double quotes are used within the string if a quotes are required such as JavaScript within an attribute, for example:
<button onClick='func("param");'>Press Me</button>
Of course if we are in PHP and want the parser to handle PHP variables within the string we should intentionally use double quotes. $a='Awesome'; $b = "Not $a";
Sources
Single quotes vs Double quotes in PHP. (n.d.). Retrieved November 26, 2014, from http://www.scriptingok.com/tutorial/Single-quotes-vs-double-quotes-in-PHP
If it's all the same, perhaps using single-quotes is better since it doesn't require holding down the shift key. Fewer keystrokes == less chance of repetitive strain injury.
Actually, the best way is the way Google recommends. Double quotes:
https://google.github.io/styleguide/htmlcssguide.xml?showone=HTML_Quotation_Marks#HTML_Quotation_Marks
See https://google.github.io/styleguide/htmlcssguide.xml?showone=HTML_Validity#HTML_Validity
Quoted Advice from Google: "Using valid HTML is a measurable baseline quality attribute that contributes to learning about technical requirements and constraints, and that ensures proper HTML usage."
In HTML I don't believe it matters whether you use " or ', but it should be used consistently throughout the document.
My own usage prefers that attributes/html use ", whereas all javascript uses ' instead.
This makes it slightly easier, for me, to read and check. If your use makes more sense for you than mine would, there's no need for change. But, to me, your code would feel messy. It's personal is all.
Using double quotes for HTML
i.e.
<div class="colorFont"></div>
Using single quotes for JavaScript
i.e.
$('#container').addClass('colorFont');
$('<div class="colorFont2></div>');
I know LOTS of people wouldn't agree, but this is what I do and I really enjoy such a coding style: I actually don't use any quote in HTML unless it is absolutely necessary.
Example:
<form method=post action=#>
<fieldset>
<legend>Register here: </legend>
<label for=account>Account: </label>
<input id=account type=text name=account required><br>
<label for=password>Password: </label>
<input id=password type=password name=password required><br>
...
Double quotes are used only when there are spaces in the attribute values or whatever:
<form class="val1 val2 val3" method=post action=#>
...
</form>
I had an issue with Bootstrap where I had to use double quotes as single quotes didn't work.
class='row-fluid' made the last <span> fall below the other <span>s, rather than sitting nicely beside them on the far right. class="row-fluid" worked.
It makes no difference to the html but if you are generating html dynamically with another programming language then one way may be easier than another.
For example in Java the double quote is used to indicate the start and end of a String, so if you want to include a doublequote within the String you have to escape it with a backslash.
String s = "a Link"
You don't have such a problem with the single quote, therefore use of the single quote makes for more readable code in Java.
String s = "<a href='link'>a Link</a>"
Especially if you have to write html elements with many attributes.(Note I usually use a library such as jhtml to write html in Java, but not always practical to do so)
if you are writing asp.net then occasionally you have to use double quotes in Eval statements and single quotes for delimiting the values - this is mainly so that the C# inline code knows its using a string in the eval container rather than a character. Personally I'd only use one or the other as a standard and not mix them, it looks messy thats all.
Using " instead of ' when:
<input value="user"/> //Standard html
<input value="user's choice"/> //Need to use single quote
<input onclick="alert('hi')"/> //When giving string as parameter for javascript function
Using ' instead of " when:
<input value='"User"'/> //Need to use double quote
var html = "<input name='username'/>" //When assigning html content to a javascript variable
I'm newbie here but I use single quote mark only when I use double quote mark inside the first one. If I'm not clear I show You example:
<p align="center" title='One quote mark at the beginning so now I can
"cite".'> ... </p>
I hope I helped.
Lots of great insightful replies here! More than enough for anyone to make a clear and personal decision.
I would simply like to point out one thing that's always mattered to me.
And take this with a grain of salt!
Double quotes apply to strings that have more than a single phase such as "one two" rather than single quotes for 'one' or 'two'. This can be traced as far back as C and C++.
(reference here or do your own online search).
And that's truly the difference.
With this principle (this different), parsing became possible such as "{{'a','b'},{'x','y'}} or "/[^\r\n]*[\r\n]" (which needed to be space independent because it's expressional) or more famously for HTML specific title = "Hello HTML!" or style = "font-family:arial; color:#FF0000;"
The funny thing here is that HTML (coming from XML itself) commonly adopted double quotes due to expressional features even if it is a single character (e.g. number) or single phase string.
As NibblyPig pointed out quite well and straightforward:
" as a top-tier and ' as a second tier since "'a string here'" is valid and expected by W3 standards (which is for the web) and will most likely never change.
And for consistency, double quotes is wisely used, but only fully correct by preference.
In PHP using double quotes causes a slight decrease in performance because variable names are evaluated, so in practice, I always use single quotes when writing code:
echo "This will print you the value of $this_variable!";
echo 'This will literally say $this_variable with no evaluation.';
So you can write this instead;
echo 'This will show ' . $this_variable . '!';
I believe Javascript functions similarly, so a very tiny improvement in performance, if that matters to you.
Additionally, if you look all the way down to HTML spec 2.0, all the tags listed here;
W3 HTML DTD Reference
(Use doublequotes.) Consistency is important no matter which you tend to use more often.
Double quotes are used for strings (i.e., "this is a string") and single quotes are used for a character (i.e., 'a', 'b' or 'c'). Depending on the programming language and context, you can get away with using double quotes for a character but not single quotes for a string.
HTML doesn't care about which one you use. However, if you're writing HTML inside a PHP script, you should stick with double quotes as you will need to escape them (i.e., \"whatever\") to avoid confusing yourself and PHP.

Regex for Encoded HTML

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:
<a href="doesntmatter.com">
It should match the above, but not match when other attributes are added:
<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">
Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:
<a href="doesntmatter.com" >
But not match this:
<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >
Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).
Thanks!
The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.
In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:
/<a (?>href=".*?")>/
Which would look like the following when replacing the characters with their HTML entities:
/<a (?>href=".*?")>/
Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:
private string getAnchor(string data)
{
MatchCollection matches;
string pattern = #"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
string anchor = "";
matches = myRegex.Matches(data);
foreach (Match match in matches)
{
anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
}
return anchor;
}
I hope that helps!
I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:
(?:(?!").)*
It essentially means:
Match the following group 0 or more times
Fail match if the following string is """
Match any character (except new line unless DOTALL is specified)
The complete regular expression would be:
/<a href="(?>(?:[^&]+|(?!").)*)">/s
This is more efficient than using a non-greedy expression.
Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)
I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.
Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):
/<a href="(?:[^&]+|(?!").)*+">/s
As you can see it's slightly shorter.