"Text run is not in Unicode Normalization Form C" using 〉 [duplicate] - html

While I was trying to validate my site I get the following error:
Text run is not in Unicode Normalization Form C
A: What does it mean?
B: Can I fix it with notepad++ and how?
C: If B is no, How can I fix this with free tools(not dreamweaver)?

What does it mean?
From W3C:
In Unicode it is possible to produce
the same text with different sequences
of characters. For example, take the
Hungarian word világ. The fourth
letter could be stored in memory as a
precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single
character) or as a decomposed
sequence of U+0061 LATIN SMALL LETTER
A followed by U+0301 COMBINING ACUTE
ACCENT (two characters).
világ = világ
The Unicode Standard allows either of
these alternatives, but requires that
both be treated as identical. To
improve efficiency, an application
will usually normalize text before
performing searches or comparisons.
Normalization, in this case, means
converting the text to use all
precomposed or all decomposed
characters.
There are four normalization forms
specified by the Unicode Standard:
NFC, NFD, NFKC and NFKD. The C stands
for (pre-)composed, and the D for
decomposed. The K stands for
compatibility. To improve
interoperability, the W3C recommends
the use of NFC normalized text on
the Web.
Besides "to improve interoperability", precomposed text usually looks better than decomposes text.
How can I fix this with free tools
By using the function equivalent to Python's text = unicodedata.normalize('NFC', text) in your favorite programming language.
(Or, if you weren't planning to write a program, your question should be moved to superuser or webmasters.)

A. It means what it says (see dan04’s explanation for a brief answer and the Unicode Standard for a long one), but it simply indicates that the authors of the validator wanted to issue the warning. HTML5 rules do not require Normalization Form C (NFC); it is rather something generally favored by the W3C.
B.There is no need to fix anything, unless you decide that using NFC would actually be better. If you do, then there are various tools for automatic conversion to NFC, such as the free BabelPad editor. If you only need to deal with one character not in NFC, you can use character information repositories such as Fileformat.info character search to find out the canonical decomposition of the character and use it.
Whether you use NFC or not depends on many considerations and on the characters involved. As a rule, NFC works better, but in some cases, an alternative, non-NFC presentation produces more suitable rendering or works better in some specific processing.
For example, in a duplicate question, the reference Ω has been reported as triggering the message. (The validator actually checks for characters entered as such references, too, instead of just plain text level NFC check.) The reference stands for U+2126 OHM SIGN “Ω”, which is defined to be canonical equivalent to U+03A9 GREEK CAPITAL LETTER OMEGA “Ω”. The Unicode Standard explicitly says that the latter is the preferred character. It is also better covered in fonts. But if you have a special reason to use OHM SIGN, you can do that, without violating current HTML5 rules, and you can ignore the validator warning.

Related

How to check whether a numeric encoded entity is a valid ISO8859-1 encoding?

Let's say I was given random character reference like 〹. I need a solution to check whether this a valid encoding or not.
I think I can use the Charset lib but I can't fully wrap my mind on how to come up with a solution.
[This answer has been rewritten after further research.]
There's no simple answer to this using Charsets; see below for a complicated one.
There are simple answers using the character code, but it turns out to depend on exactly what you mean by ISO8859-1!
According to the Wikipedia page on ISO/IEC 8859-1, the character set ISO8859-1 defines only characters 32–126 and 160–255. So you could simply check for those ranges, e.g.:
fun Char.isISO8859_1() = this.toInt() in 32..126 || this.toInt() in 160..255
However, that same page also mentions the character set ISO-8859-1 (note the extra hyphen), which defines all 8-bit characters (0–255), assigning control characters to the extra ones. You could check for that with e.g.:
fun Char.isISO_8859_1() = this.toInt() in 0..255
ISO8859-1 includes all the printable characters, so if you only want to know whether a character has a defined glyph, you could use the former. However, these days most people tend to mean ISO-8859-1: that's what many web pages use (those which haven't yet moved on to UTF-8), and that's what the first 256 Unicode characters are defined as. So the latter will probably be more generally useful.
Both of the above methods are of course very short, simple, and efficient; but they only work for the one character set; and it's awkward hard-coding details of a character set, when library classes already have that information.
It seems that Charset objects are mainly aimed at encoding and decoding, so they don't provide a simple way to tell which characters are defined as such. But you can find out whether they can encode a given character. Here's the simplest way I found:
fun Char.isIn(charset: Charset) =
try {
charset.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(toString()))
true
} catch (x: CharacterCodingException) {
false
}
That's really inefficient, but will work for all Charsets.
If you try this for ISO_8859_1, you'll find that it can encode all 8-bit values, i.e. 0–255. So it's clearly using the full ISO-8859-1 definition.

Why does gensim ignore underscores during preprocessing?

Going through the gensim source, I noticed the simple_preprocess utility function clears all punctuations except those with words starting with an underscore, _. Is there a reason for this?
def simple_preprocess(doc, deacc=False, min_len=2, max_len=15):
tokens = [
token for token in tokenize(doc, lower=True, deacc=deacc, errors='ignore')
if min_len <= len(token) <= max_len and not token.startswith('_')
]
return tokens
The underscore ('_') isn't typically meaningful punctuation, but is often considered a "word" character in programming and text-processing.
For example, common regular-expression syntax uses \w to indicate a "word character". Per https://www.regular-expressions.info/shorthand.html :
\w stands for "word character". It always matches the ASCII characters
[A-Za-z0-9_]. Notice the inclusion of the underscore and digits. In
most flavors that support Unicode, \w includes many characters from
other scripts. There is a lot of inconsistency about which characters
are actually included. Letters and digits from alphabetic scripts and
ideographs are generally included. Connector punctuation other than
the underscore and numeric symbols that aren't digits may or may not
be included. XML Schema and XPath even include all symbols in \w.
Again, Java, JavaScript, and PCRE match only ASCII characters with \w.
As such, it's often used in authoring, or in other text-preprocessing steps, to connect other groups of letters/numbers that should be kept together as a unit. Thus it's not often cleared with other true punctuation.
The code you've referenced also does something else, different than your question about clearing punctuation: it drops word-tokens beginning with _.
I'm not sure why it does that; at some point that code may have be designed with some specific text-format in mind where leading-underscore tokens were semantically-unimportant formatting directives.
The simple_preprocess() function in gensim is just a quick-and-dirty baseline helpful for internal tests and compact beginner tutorials. It shouldn't be considered a "best practice".
Real projects should give more consideration to the kind of word-tokenization that makes sense for their data and purposes – and either look to libraries with more options, or custom approaches (which still need not be more than a few lines of Python), to implement tokenization that best suits their needs.

Use of Parentheses in HTML with "href=tel:"

Surprised I can't find a definitive answer about this anywhere online: I am setting up a translated HTML page in French with a different contact number that begins with "+33 (0)." Since I can't personally test it on this number -- a canonical question: can I get away with an anchor tag that begins <a href="tel:+33(0)..." i.e. has a number contained in parentheses with the remaining numbers following and have the link work?
Good question. Finding a clear, authorative answer regarding href="tel:" specifically is difficult.
RFC 3986 (Section 2.2) defines parenthesis as "reserved sub-delims". This means that they may have special meaning when used in certain parts of the URL. The RFC says:
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
(Emphasis mine)
Basically, you can use any character in the US-ASCII character set in a URL. But, in some situations, parentheses are reserved for specific uses, and in those cases, they should be percent-encoded. Otherwise they can be left as is.
So, yes, you can use parentheses in href="tel:" links and they should work across all browsers. But as with any web standard in the real world, performance relies on each browser correctly implementing that standard.
However, regarding your example (<a href="tel:+33(0)...), I would steer clear of the format you have given, that is:
[country code]([substituted leading 0 for domestic callers])[area code][phone number]
While I was unable to find a definitive guide to how browsers handle such cases, I think you will find, as #DigitalJedi has pointed out, that some (perhaps all?) browsers will strip the parentheses and leave the number contained therein, ultimately resulting in an incorrect number, e.g.
+33 (0) 123 456 7890
...which may result in a call to +3301234567890.
Will this still work? Maybe? We're getting into phone number routing territory now.
Some browsers/devices may be smart enought to figure out what is intended and adapt accordingly, but I would play it safe and instead simply use:
[country code][area code][phone number], e.g.
+33 123 456 7890
or
(0) 123 456 7890
There is no downside (that I know of) to having your local users dialing the international country code - it will result in the same thing as if they had omitted it and substituted the leading zero.
As a side note, according to the ITU's (International Telecommunications Union) E.123 document, section 7.2,
The ( ) should not be used in an international number.
This recommendation concerns how phone numbers are written, but is of some relevance in terms of the text that should be used when creating an href="tel:" link, and is the reason for the two alternative examples I have provided above.
(Credit to #NiKiZe for this semi-related info).
Finally, here is some semi-related, useful information regarding browser treatment of telephone links: https://css-tricks.com/the-current-state-of-telephone-links/
The () in the href attribute should not be a problem: (href-wise)
http://example.com/test(1).html
HOWEVER, if I do a href="tel:+000 (1) 000 000", after clicking the link on phone, the dial on my phone will show +0001000000 This is tested and confirmed on a Android device
This means that the parentheses are removed, as well as the spaces.
But this could still vary because of different OS on the phone.
P.S.:
If you think the + in your number is an issue... I did test this too, and the + does not have any unexpected behavior.
MORE: href="tel:" and mobile numbers
According to ITU-T E.123 Section 7.2 Use of parentheses
The ( ) should not be used in an international number.
So (0) should not be included at all.
I read and remember the correct format should be:
+33.1 23 45 67 89
But I can't find where from.

In SQL tables, should I, for example, have "é" or should I have "e´"?

I have tried in vain to look up relevant questions. They are beyond my pay scale. I am not a professional. To explain this a bit more: in the HTML that I wrote, the em dash would be "& #151;" (that space inserted so it would not show up as an actual em dash). It ended up in the tables (someone else was doing that work) as "—". Those are not showing up correctly when searches are done using PHP. I only get the image with a question mark. I do have my SQL account set to Unicode.
Take a philosophical stand: The datastore (database table) should contain data, not some special encoding of the data.
The "data" is é
When you display that in HTML, you might need to convert it to e´. However, all modern browsers don't have a problem if é is encoded UTF-8.
If you choose to use "html entities", then have your application do the conversion after fetching é from the table. PHP has the function htmlentities() specifically for that task.
But, I still have not addressed what byte(s) are in the table to represent é. These days, you 'should' use UTF-8 (aka MySQL's utf8mb4). That would be two hex bytes C3A9, which can be discovered using SELECT HEX(col) .... If you use the old default, latin1, the hex would show C9.
A related question is whether you should store html 'tags' or construct the html on the fly after fetching the data. So, let me give you three philosophies; you pick which to apply:
The table contains pure data; formatting, etc, is done after fetching and before delivering to the user's browser.
The table contains an 'opaque' image of what needs to be sent to the browser -- complete with tags, entities, etc. With this approach, you may as well call it a BLOB, not TEXT.
Some compromise between those. Note: The use of CSS can avoid too much hard-coding of formatting before storing into the database.
Also, the first choice is much cleaner for searching. This may lead you to pick it. However, another approach is to have two columns -- one aimed at delivering mostly-formatted ouput; the other for searching (tags removed, no entities, etc); it would be mostly text, but you probably could not generate a web page (with links, paragraphs, etc) from it.
é -- different strokes for different folks
é in latin1 (not advised) hex E9, 1 byte
é in utf8 C3A9 2 bytes
\u00E9 -- Unicode codepoint -- 6 bytes
é -- html entity (see PHP's htmlentities()) -- 8 bytes
%C3%A9 -- PHP's urlencode() (for URLs) -- 6 bytes
Responding to Comments
If entries_lists, entries_languages, and authors_entries are many:many mapping tables, please consider the several optimizations mentioned here.
Do not use utf8_encode. Instead, figure out what caused them not to be encoded correctly, and/or not displayed correctly. Start by
echo bin2hex($record['author']);
SELECT name, HEX(name) FROM authors WHERE ...
for some author with an accented letter.

Acronyms in CamelCase [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Closed 4 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have a doubt about CamelCase. Suppose you have this acronym: Unesco = United Nations Educational, Scientific and Cultural Organization.
You should write: unitedNationsEducationalScientificAndCulturalOrganization
But what if you need to write the acronym? Something like:
getUnescoProperties();
Is it right to write it this way? getUnescoProperties() OR getUNESCOProperties();
There are legitimate criticisms of the Microsoft advice from the accepted answer.
Inconsistent treatment of acronyms/initialisms depending on number of characters:
playerID vs playerId vs playerIdentifier.
The question of whether two-letter acronyms should still be capitalized if they appear at the start of the identifier:
USTaxes vs usTaxes
Difficulty in distinguishing multiple acronyms:
i.e. USID vs usId (or parseDBMXML in Wikipedia's example).
So I'll post this answer as an alternative to accepted answer. All acronyms should be treated consistently; acronyms should be treated like any other word. Quoting Wikipedia:
...some programmers prefer to treat abbreviations as if they were lower case words...
So re: OP's question, I agree with accepted answer; this is correct: getUnescoProperties()
But I think I'd reach a different conclusion in these examples:
US Taxes → usTaxes
Player ID → playerId
So vote for this answer if you think two-letter acronyms should be treated like other acronyms.
Camel Case is a convention, not a specification. So I guess popular opinion rules.
( EDIT: Removing this suggestion that votes should decide this issue; as #Brian David says; Stack Overflow is not a "popularity contest", and this question was closed as "opinion based")
Even though many prefer to treat acronyms like any-other word, the more common practice may be to put acronyms in all-caps (even though it leads to "abominations")
See "EDXML" in this XML schema
See "SFAS158" in this XBRL schema
Other Resources:
Note some people distinguish between abbreviation and acronyms
Note Microsoft guidelines distinguish between two-character acronyms, and "acronyms more than two characters long"
Note some people recommend to avoid abbreviations / acronyms altogether
Note some people recommend to avoid camelCase / PascalCase altogether
Note some people distinguish between "consistency" as "rules that seem internally inconsistent" (i.e. treating two-character acronyms different than three-character acronyms); some people define "consistency" as "applying the same rule consistently" (even if the rule is internally inconsistent)
Framework Design Guidelines
Microsoft Guidelines
Some guidelines Microsoft has written about camelCase are:
When using acronyms, use Pascal case or camel case for acronyms more than two characters long. For example, use HtmlButton or htmlButton. However, you should capitalize acronyms that consist of only two characters, such as System.IO instead of System.Io.
Do not use abbreviations in identifiers or parameter names. If you must use abbreviations, use camel case for abbreviations that consist of more than two characters, even if this contradicts the standard abbreviation of the word.
Summing up:
When you use an abbreviation or acronym that is two characters long, put them all in caps;
When the acronym is longer than two chars, use a capital for the first character.
So, in your specific case, getUnescoProperties() is correct.
To convert to CamelCase, there is also Google's (nearly) deterministic Camel case algorithm:
Beginning with the prose form of the name:
Convert the phrase to plain ASCII and remove any apostrophes.
For example, "Müller's algorithm" might become "Muellers
algorithm". Divide this result into words, splitting on
spaces and any remaining punctuation (typically hyphens).
Recommended: if any word already has a conventional camel case
appearance in common usage, split this into its constituent parts
(e.g., "AdWords" becomes "ad words"). Note that a word such
as "iOS" is not really in camel case per se; it defies any
convention, so this recommendation does not apply.
Now lowercase everything (including acronyms), then uppercase only
the first character of: … each word, to yield upper
camel case, or … each word except the first, to yield
lower camel case Finally, join all the words into
a single identifier.
Note that the casing of the original words is almost entirely
disregarded.
In the following examples, "XML HTTP request" is correctly transformed to XmlHttpRequest, XMLHTTPRequest is incorrect.
getUnescoProperties() should be the best solution...
When possible just follow the pure camelCase, when you have acronyms just let them upper case when possible otherwise go camelCase.
Generally in OO programming variables should start with lower case letter (lowerCamelCase) and class should start with upper case letter (UpperCamelCase).
When in doubt just go pure camelCase ;)
parseXML is fine, parseXml is also camelCase
XMLHTTPRequest should be XmlHttpRequest or xmlHttpRequest no way to go with subsequent upper case acronyms, it is definitively not clear for all test cases.
e.g.
how do you read this word HTTPSSLRequest, HTTP + SSL, or HTTPS + SL (that doesn't mean anything but...), in that case follow camel case convention and go for httpSslRequest or httpsSlRequest, maybe it is no longer nice, but it is definitely more clear.
There is airbnb JavaScript Style Guide at github with a lot of stars (~57.5k at this moment) and guides about acronyms which say:
Acronyms and initialisms should always be all capitalized, or all
lowercased.
Why? Names are for readability, not to appease a computer algorithm.
// bad
import SmsContainer from './containers/SmsContainer';
// bad
const HttpRequests = [
// ...
];
// good
import SMSContainer from './containers/SMSContainer';
// good
const HTTPRequests = [
// ...
];
// also good
const httpRequests = [
// ...
];
// best
import TextMessageContainer from './containers/TextMessageContainer';
// best
const requests = [
// ...
];
In addition to what #valex has said, I want to recap a couple of things with the given answers for this question.
I think the general answer is: it depends on the programming language that you are using.
C Sharp
Microsoft has written some guidelines where it seems that HtmlButton is the right way to name a class for this cases.
Javascript
Javascript has some global variables with acronyms and it uses them all in upper case (but funnily, not always consistently) here are some examples:
encodeURIComponent
XMLHttpRequest
toJSON
toISOString
Currently I am using the following rules:
Capital case for acronyms: XMLHTTPRequest, xmlHTTPRequest, requestIPAddress.
Camel case for abbreviations: ID[entifier], Exe[cutable], App[lication].
ID is an exception, sorry but true.
When I see a capital letter I assume an acronym, i.e. a separate word for each letter. Abbreviations do not have separate words for each letter, so I use camel case.
XMLHTTPRequest is ambigous, but it is a rare case and it's not so much ambiguous, so it's ok, rules and logic are more important than beauty.
The JavaScript Airbnb style guide talks a bit about this. Basically:
// bad
const HttpRequests = [ req ];
// good
const httpRequests = [ req ];
// also good
const HTTPRequests = [ req ];
Because I typically read a leading capital letter as a class, I tend to avoid that. At the end of the day, it's all preference.
disclaimer: English is not my mother tone. But I've thought about this problem for a long time, esp when using node (camelcase style) to handle database since the name of table fields should be snakeized, this is my thought:
There are 2 kinds of 'acronyms' for a programmer:
in natural language, UNESCO
in computer programming language, for example, tmc and textMessageContainer, which usually appears as a local variable.
In programming world, all acronyms in natural language should be treated as word, the reasons are:
when we programming, we should name a variable either in acronym style or non-acronym-style. So, if we name a function getUNESCOProperties, it means UNESCO is an acronym ( otherwise it shouldn't be all uppercase letters ), but evidently, get and properties are not acronyms. so, we should name this function
either gunescop or getUnitedNationsEducationalScientificAndCulturalOrganizationProperties, both are unacceptable.
natural language is evolving continuously, and
today's acronyms will become words tommorow, but programs should be independent of this trend and stand forever.
by the way, in the most-voted answer, IO is the acronym in computer language meaning (stands for InputOutput), but I don't like the name, since I think the acronym (in computer language) should only be used to name a local variable but a top-level class/function, so InputOutput should be used instead of IO
There is also another camelcase convention that tries to favor readability for acronyms by using either uppercase (HTML), or lowercase (html), but avoiding both (Html).
So in your case you could write getUNESCOProperties. You could also write unescoProperties for a variable, or UNESCOProperties for a class (the convention for classes is to start with uppercase).
This rule gets tricky if you want to put together two acronyms, for example for a class named XML HTTP Request. It would start with uppercase, but since XMLHTTPRequest would not be easy to read (is it XMLH TTP Request?), and XMLhttpRequest would break the camelcase convention (is it XM Lhttp Request?), the best option would be to mix case: XMLHttpRequest, which is actually what the W3C used. However using this sort of namings is discouraged. For this example, HTTPRequest would be a better name.
Since the official English word for identification/identity seems to be ID, although is not an acronym, you could apply the same rules there.
This convention seems to be pretty popular out there, but it's just a convention and there is no right or wrong. Just try to stick to a convention and make sure your names are readable.
UNESCO is a special case as it is usually ( in English ) read as a word and not an acronym - like UEFA, RADA, BAFTA and unlike BBC, HTML, SSL