What is considered as best practice for adding special symbols into HTML? Using the symbol itself, for e.g. © or its code value ©?
Example 1:
<p>Qualcomm©<p>
Example 2:
<p>Qualcomm®<p>
Both have their pros and cons. There isn't a strongly established best practice.
Using a literal character:
Is easier to read
Doesn't require developers to remember the character reference code
Requires fewer bytes/characters to send over the network (or store in a database, which might be more significant).
Using a character reference:
May be easier to type (depending on the developer's keyboard)
Is immune to being screwed up by character encoding errors
Related
In the article Better web typography in a few simple steps, it says
Talking about apostrophes, the correct sign for them is the right single quotation mark. A dead give-away for amateur typography is the presence of straight quotation marks, also called 'dumb quotes' by type-savvy designers.
I've been using these "dumb quotes" all along!
Now, when one is writing regular HTML (and not Markdown, which automatically produces apostrophes), how is one supposed to sanely write correct apostrophes? Am I just supposed to inject ’ wherever a ' would go before? Is there a program that automatically does this?
How do professional web designers take care of this problem?
You have couple of options here:
As was pointed out before, either use numerical or named HTML entities.
Write your HTML with single apostrophes and then do a search and replace before publishing. This is workable, but could lead to unexpected replacements if you aren’t careful.
Insert the actual single quote using the appropriate keyboard sequence for your operating system: option-shift-] on a Mac or alt-0146 on a PC and make sure to save and serve your HTML as UTF-8 encoded. That way you don't have to screw around with entity names, but asumes a UTF-8 clean workflow.
I would like to program a parser for a Markup language similar to BBCode, Markdown, Wikisyntax etc. using a high-level language like Python or Perl. It should feature sectioning, code highlighting, automatic link creation, embedding images but allowing HTML for more complex formatting.
Has anyone done similar things or has worked closely with those systems and could describe generally how this could be done efficiently?
Although efficiency is not really of concern for such a small system, it is generally favourable.
In particular I would like to learn if there is a more efficient way than using regular expressions for such a program.
For your general discussion…
You should start with the following blueprint:
you need to iterate charwise over entire data
you need to identify every char by its context, for it may be a tag-opening ('<', '[' etc) or just the char. This may be done by having an escapement flag, triggered by an escape-char (like backslashes in some languages do). if you use that approach, you also need to check for an escaped escapement.
you may also need some flag telling you to be inside a comment or special data section, that may have different escapement rules.
you need to build a tree-like structure or at least some stack for nested tags. This is why regexes are a bad idea: they not only take much to much overhead, they're also of no use if you want to get the correct closing tag for the second x (x=any tag) in the following snipped: <x><x><x></x><x><x></x></x><x></x><!-- </x> -->this one →</x><x></x></x>
We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp
Has anyone implemented a good system for ensuring that output is properly HTML-encoded where it makes sense? Maybe even something that recognizes when output should be URL-encoded or JSON-encoded instead?
The lazy approach — just encoding all inputs — causes problems when you want to send those inputs to a database, or to a block of JavaScript code. So something a little smarter is needed.
The tedious approach — putting the proper encoding function around each piece of data on the template — works, but it's easy for developers to forget to do it.
Is there a good approach that makes it easy for developers, and ensures that the right encoding is done? I was listening to one of the SO podcasts, and Joel tossed out an idea about using typed data to enforce a difference between HTML-encoded strings and non-encoded strings. Maybe that could be a starting point.
I'm looking more for a strategy than for an implementation in a particular language (although I'd be happy to hear about implementations that already exist and work).
EDIT: Here are some links I've found so far:
A type-based solution to the "strings problem"
String::Smart
Reducing XSS by way of Automatic Context-Aware Escaping in Template Systems
Secure String Interpolation in JS
Data that goes into your database probably should not have any escaping for HTML, JavaScript, or what have you. If you do include markup, you'll just have to strip it out if you decide to inject this data into a CSV file or PDF, etc...
Instead, whenever you query 'raw' data like this out of the database, escape the data at that time as appropriate to wherever you're injecting it; HTML, a JavaScript string, server-side scripting, etc.
If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curious how you read that tag with out knowing the encoding. I realize this is solved problem. I am just curious how its done.
Update 1
I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii? For example the character E in UTF-16 (U+0045) is 0xfeff0045. That is 0xfeff then 0x0045, but some encodings change the endian of that. Do you have to figure it out by checkign for 0xfeff and realizing that can't be ASCII or something?
Here's what W3C has to say about it:
The XML encoding declaration functions
as an internal label on each entity,
indicating which character encoding is
in use. Before an XML processor can
read the internal label, however, it
apparently has to know what character
encoding is in use--which is what the
internal label is trying to indicate.
In the general case, this is a
hopeless situation. It is not entirely
hopeless in XML, however, because XML
limits the general case in two ways:
each implementation is assumed to
support only a finite set of character
encodings, and the XML encoding
declaration is restricted in position
and content in order to make it
feasible to autodetect the character
encoding in use in each entity in
normal cases.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
The encoding name is limited to ([A-Za-z0-9._] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-*, ISO 10646/Unicode).
Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...
If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.
For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)