Alternative to entering entity references in source code - html

Google's HTML/CSS Style Guide advises against using entity references:
Do not use entity references.
There is no need to use entity references like —, ”, or ☺, assuming the same encoding (UTF-8) is used for files and editors as well as among teams.
<!-- Not recommended -->
The currency symbol for the Euro is “&eur;”.
<!-- Recommended -->
The currency symbol for the Euro is “€”.
I'm not sure I understand what it is that they are proposing. The only thing I can think of is that they are saying that you should be using your text editor's insert character command (e.g., in Atom, Ctrl-Shift-U, or in Emacs, C-x 8) to enter Unicode characters rather than typing in the literal entity references. Is that it?

The only thing I can think of is that they are saying that you should be using your text editor's insert character command […] rather than typing in the literal entity references. Is that it?
Yes, that's precisely what they're saying.
You don't write A to insert the letter A, after all! There's no more reason to write ä for ä, or ♥ for ♥, when those characters can be represented directly in the HTML file.

Related

Delimiters in BIML

I am migrating to biml for SSIS integration of flat, excel and CSV files and I want to know all the possible delimeters and text qualifiers we can use since the documentation isn't telling that much.
Basically, there are 3 options to choose from:
The “official” ENUM
Allowed values here are:
– CRLF
– CR
– LF
– Semicolon
– Comma
– Tab
– VerticalBar
– UnitSeparator
Use the Hex Code
If you know the ASCII Code of your qualifier, you can use it starting with “x” and ending with “”.
A ” would be described by “x0022” for example.
Use the actual character (HTML encoded or escaped)
If you want to (for example) define a ” as your qualifier, you can do so. Just make sure, depending on how you use it, to either encode or escape it. When defining it as an actual Biml property, it has to be encoded.

OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor)

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks
I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

Do ampersands still need to be encoded in URLs in HTML5?

I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
...
One should write:
...
Apparently, the former example shouldn't work, but browser error recovery means it does.
Is this still the case in HTML5?
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
any other character ===> the parser will try to find a named character reference, e.g., something like ∉.
The last case is the one of interest to you since your example has:
...
You have the character sequence
AMPERSAND
LATIN SMALL LETTER Y
EQUAL SIGN
Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
...
which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the = makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.

Tesseract SetVariable tessedit_char_whitelist in another language

Tesseract setVariable whitelist works ok for english language for example i use this to recognize only digits and letters from image (excluding special characters &*^%! etc)
_ocr.SetVariable("tessedit_char_whitelist",
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
But i can't do the same thing for Thai language
_ocr.SetVariable("tessedit_char_whitelist","0123456789กขคงจฉ");
Is there a different principle? Because this does not work. Instead of all determined characters I receive only digits in output, tesseract ignores all Thai letters which I put into the whitelist.
How can I pass this variable correctly?
You might need to use the language package for Thai first... please refer the download list here https://code.google.com/p/tesseract-ocr/downloads/list
Then you need to replace "eng" with "tha" in your code to use the new language data to OCR

How to program Emacs to syntax highlight html character references specified numerically

In html major mode, Emacs is programmed to syntax highlight html character entity references (i.e., character references specified by name, e.g., ) but not, for some reason, numeric character references (e.g.,   or &#xa0). I guess this is a special case of the more general problem of customizing syntax highlight in a given mode. I imagine it involves some use of regexes. Can someone give me some guidance on how get started with this?
Following code snippet should help you:
(add-to-list 'sgml-font-lock-keywords-2
'("\\&#x?[0-9a-fA-F][0-9a-fA-F]*;?" . font-lock-variable-name-face))
but it should be put after loading of sgml-mode that provides html-mode. You can force loading with following command:
(require 'sgml-mode)