How to use proper apostrophes in HTML instead of "dumb quotes"? - html

In the article Better web typography in a few simple steps, it says
Talking about apostrophes, the correct sign for them is the right single quotation mark. A dead give-away for amateur typography is the presence of straight quotation marks, also called 'dumb quotes' by type-savvy designers.
I've been using these "dumb quotes" all along!
Now, when one is writing regular HTML (and not Markdown, which automatically produces apostrophes), how is one supposed to sanely write correct apostrophes? Am I just supposed to inject ’ wherever a ' would go before? Is there a program that automatically does this?
How do professional web designers take care of this problem?

You have couple of options here:
As was pointed out before, either use numerical or named HTML entities.
Write your HTML with single apostrophes and then do a search and replace before publishing. This is workable, but could lead to unexpected replacements if you aren’t careful.
Insert the actual single quote using the appropriate keyboard sequence for your operating system: option-shift-] on a Mac or alt-0146 on a PC and make sure to save and serve your HTML as UTF-8 encoded. That way you don't have to screw around with entity names, but asumes a UTF-8 clean workflow.

Related

Adding special symbols into HTML

What is considered as best practice for adding special symbols into HTML? Using the symbol itself, for e.g. © or its code value ©?
Example 1:
<p>Qualcomm©<p>
Example 2:
<p>Qualcomm®<p>
Both have their pros and cons. There isn't a strongly established best practice.
Using a literal character:
Is easier to read
Doesn't require developers to remember the character reference code
Requires fewer bytes/characters to send over the network (or store in a database, which might be more significant).
Using a character reference:
May be easier to type (depending on the developer's keyboard)
Is immune to being screwed up by character encoding errors

HTML entities to Hex equivalent

I have legacy xml files with html entities such as — etc. How can I convert this entities to hex equivalent such as —. Is there any easy way to do this using a batch command or something else? I am not high level programmer so any detail help will be appreciated.
Just a simple find and replace may be all you need. Most, if not all text/code editors have a find/replace function.
Chances are that there are only a few characters strings that make up the majority of what you need to replace and fortunately, they're all pretty unique so it's unlikely that you'll have any accidental replacements.

Is it advisable to have non-ascii characters in the URL?

We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp

Technique for ensuring HTML- and URL-encoding

Has anyone implemented a good system for ensuring that output is properly HTML-encoded where it makes sense? Maybe even something that recognizes when output should be URL-encoded or JSON-encoded instead?
The lazy approach — just encoding all inputs — causes problems when you want to send those inputs to a database, or to a block of JavaScript code. So something a little smarter is needed.
The tedious approach — putting the proper encoding function around each piece of data on the template — works, but it's easy for developers to forget to do it.
Is there a good approach that makes it easy for developers, and ensures that the right encoding is done? I was listening to one of the SO podcasts, and Joel tossed out an idea about using typed data to enforce a difference between HTML-encoded strings and non-encoded strings. Maybe that could be a starting point.
I'm looking more for a strategy than for an implementation in a particular language (although I'd be happy to hear about implementations that already exist and work).
EDIT: Here are some links I've found so far:
A type-based solution to the "strings problem"
String::Smart
Reducing XSS by way of Automatic Context-Aware Escaping in Template Systems
Secure String Interpolation in JS
Data that goes into your database probably should not have any escaping for HTML, JavaScript, or what have you. If you do include markup, you'll just have to strip it out if you decide to inject this data into a CSV file or PDF, etc...
Instead, whenever you query 'raw' data like this out of the database, escape the data at that time as appropriate to wherever you're injecting it; HTML, a JavaScript string, server-side scripting, etc.

Syntax highlight design pattern

I'm looking for some good overviews of best practices and common patterns for enabling syntax highlighting in a textbox. It seems like a very common exercise almost all languages have a UI control that enables syntax highlighting in different languages. I'm just curious to see if there is a common pattern of implementation.
Is everyone using regular expressions? Is there a repository for regular expressions that are commonly used in syntax highlighting scenarios?
Are there alternative/better approaches to syntax highlighting?
Update
Links to relevant resources about performing syntax highlighting in a given language or concepts related to syntax highlighting would be great. Lexing (lexical analysis) was brought up in an answer but without a link to learn more. Anything to help better understand this commonly solved problem would be great.
Lexical Analysis on Wikipedia
Regular expressions are definitely the first place most start out at. However, they can't really cope with many edge cases that one meets in most languages - text that looks like keywords can be in found string literals, string literals in turn can contain escaped delimiters, as well as special characters. Same thing goes for comments, etc.
Basically to do a good job of syntax highlighting you need to perform lexing of the source - parsing it with the application of language-specific heuristics to build a list of regions, where each region of the source is annotated with how it is to be styled.
As edits take place, you can again apply language rules to see how far this change can alter the presentation of a region. For example typing a letter inside a string literal simply makes the string literal region longer, but typing a closing quote truncates the region and turns the leftover part of it into code, subject to all the other lexing rules.