Apart from readability, what is the difference between the HTML codes
and 
?
ASCII Code
HTML Entity 

Hexadecimal value
These are all referring to the same thing but are represented in different ways. They all translate to Unicode U+0000A LINE FEED (LF)
Example:
The number 2 can be represented using 1+1. It can also be represented using |sqrt(4)|. The result is the same, but using different syntaxes we can achieve the same result in different ways.
References:
https://theasciicode.com.ar/ascii-control-characters/line-feed-ascii-code-10.html
https://www.quackit.com/character_sets/unicode/co_controls_and_basic_latin_unicode_character_codes.cfm
https://www.w3schools.com/html/html_symbols.asp
https://www.w3schools.com/charsets/ref_html_ascii.asp
Related
I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
This question already has answers here:
Regex to match only letters
(20 answers)
Closed 8 years ago.
I'm trying to create a regex for a HTML5 input so a user can only insert alpha characters that may be in a name. So characters from a-z, but also including ö,ü,â,æ ... and so on whilst also allowing whitespace and hyphens .
I have played around with some pattens but nothing seems to work correctly, this is what I have so far: <input type="text" name="firstname" pattern="[a-zA-Z\x7f-\xff] " title="">
Does anyone have a quick answer for this?
Since the HTML5 pattern attribute uses the same regex syntax as JavaScript, there is no simple way to refer to all alphabetic characters. You would need to write a rather huge expression (and to update it as new alphabetic characters are added to Unicode). You would need to start from the Unicode character database and the definition of General Category of characters there, or rely on someone having done that for you.
However, for your practical purposes, testing for “alpha characters that may be in a name” is even more complex. There are non-alphabetic characters used in names, such as left single quotation mark (‘) in addition to normal quotation mark (’), and who knows what characters there might be? If this is about people’s real names, it is very difficult to impose restrictions that do not discriminate. If this is about user names in a system, for example, you can define the repertoire as you like, but [a-zA-Z\x7f-\xff] does not look adequate (it includes some control characters and some non-alphabetic characters and excludes many Latin letters commonly used in Europe).
There is a very simple method to apply all you RegEx logic(that one can apply easily in English) for any Language using Unicode.
For matching a range of Unicode Characters like all Alphabets [A-Za-z] we can use
[\u0041-\u005A] where \u0041 is Hex-Code for A and \u005A is Hex Code for Z
'matchCAPS leTTer'.match(/[\u0041-\u005A]+/g)
//output ["CAPS", "TT"]
In the same way we can use other Unicode characters or their equivalent Hex-Code according to their Hexadecimal Order (eg: \u0100–\u017FF) provided by unicode.org
Try: [À-ž] as an example of Range. Modify your Range according to your requirement.
It will match all characters between À and ž.
Sample regEx would be
/[A-Za-zÀ-ž\-\s]+/
For more Ref: Latin Unicode Character
While editing some old ColdFusion code I found a <td> which has a bgcolor property. The value of it was ##89969E. I copied the code to a HTML file and found out the color was different in ColdFusion.
Now, i noticed the double #, so i removed one and the color was the same. Why does the color change when adding/removing a #?
jsFiddle
As a basic premise, additional hashes are interpreted as a missing/erroneous value and so replaced with a zero, so ##89969E becomes #0089969E. Note that HEX codes can be as long as 8 digits following a hash (if aRGB), where the last two refer to transparency
Missing digits are treated as 0[...]. An incorrect digit is simply
interpreted as 0. For example the values #F0F0F0, F0F0F0, F0F0F, #FxFxFx and FxFxFx are all the same.
When color strings longer than 8 characters or shorter than 4
characters are used, things start to get strange.
However there are a lot of nuances - you can find out more about this here, and for some fairly entertaining results this creates, have a little read here
I have written one XSLT to transform xml to html. If input xml node contains only space then it inserts the space using following code.
<xsl:text> </xsl:text>
There is another numeric character which also does same thing as shown below.
<xsl:text> </xsl:text>
Is there any difference between these characters? Are there any examples where one of these will work and other will not?
Which one is recommended to add space?
Thanks,
Sambhaji
is a non-breaking space ( ).
is just the same, but in hexadecimal (in HTML entities, the x character shows that a hexadecimal number is coming). There is basically no difference, A0 and 160 are the same numbers in a different base.
You should decide whether you really need a non-breaking space, or a simple space would suffice.
It's the same. It's a numeric character reference.
A0 is the same number as 160. The first is in base 16 (hexadecimal) and the second is in base 10 (decimal, everyday base).
Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/