replacing ' and " characters with proper quotation marks - html

I'm developing a simple site that doesn't use PHP. This the first website I've programmed from scratch and I'm trying to find the standard method for replacing ' and " (foot and inch marks) with proper apostrophe's and smart quotes . I've looked for an answer but mostly what I'm finding applies to PHP or other backend situations I don't recognize, or it seems to be an article written in the early 2000's. The only thing I know to do is to edit the actual content with the appropriate entity name, but what a pain that would be for a whole site. There must be a trick I don't know about? Did I mention this is my first website:)

Use a HTML editor that allows to load multiple files and do a multiple replace in all the files at once. I use Coffee Cup HTML Editor for such, albeit it is buggy.

I'm more a backend guy when it comes to handle smart typography (I'm the creator of JoliTypo) but another tool I know is Typesetters.js.
You have to install a custom font and a javascript file but you will end-up with a pretty good rendering / typography. I think it only handle English quotes too.

Related

mso-number-format for Accounting

So does anyone know if there is an mso-number-format to mimic the Accounting format in Excel (Negatives are put in parens, zeros are dashes, all have 2 decimals and everything gets a dollar sign waaaay to the left)
I have an html table that i am opening in excel that i would Love to have this format.
I found the following one online, but it doens't seem to work:
mso-number-format:\#\,\#\#0\.00_\)\;\[Black\]\\(\#\,\#\#0\.00\\)
Thanks
This is the string that I return from C# that gives a true Accounting format in Excel:
"\"_-$* #,##0.00_-;-$* #,##0.00_-;_-$* \\\" - \\\"??_-;_-#_-\""
Some important things to note when dealing with mso-number-format:
All custom formats must be surrounded
in double quotes
Any double quotes
'inside' the custom format must be
replaced by literal "escaped double
quotes", for Excel to interpret
later.
So, you want the actual markup to be outputted like this
"_-$* #,##0.00_-;-$* #,##0.00_-;_-$* \" - \"??_-;_-#_-"
Hope this helps
MSO is pretty proprietary to Microsoft, and hopefully this stuff won't be supported in the future so that UI Devs like me don't have to un-do it....fingers crossed.
My first inclination would be to dynamically build the excel spreadsheet with a tool like PHPExcel so that you have 100% control over formatting, calculations, etc the way that Excel is looking for it. Certainly there are variations of this software for the respective technology you have at your disposal (.net, java, etc)
Absent that solution, there are wonderful JQuery plugins such as this However, I'm not entirely sure this would suffice when you pull the html natively into Excel--it might not fire.
Do you have back-end technology available? Something like a regex replacement could quickly solve the problem with no ill effects on Excel.
From Microsoft's own website, CSS can't format numbers.

Minimize html, doubts and questions

Minimizing html is the only section on Google's Page Speed where there is still room for improvement.
My site is all dynamic and the HTML is already Deflated so there is no reason to put any more pressure on the server (I don't want to minimize pages real time before sending).
What I could do was to minimize the template files. My templates files are a mix of PHP and HTML so I've come up with some code that I think is pretty safe but would like to be community revised.
// this will loop trough all template files
// php is cleaned first so that line-comments will not interfere with the regex
$original = file_get_contents($dir.'/'.$file);
$php_clean = php_strip_whitespace($dir.'/'.$file);
$minimized = preg_replace('/\s+/', ' ', $php_clean);
This will make my template files as a single very long file alternated with some places where DB content is inserted. Google's homepage source looks more or less like what I get so I wonder if they follow a similar approach.
Question 1: Do you antecipate potencial problems?
Question 2: Is there anyway better (more efficient to do this)?
And please remember that I'm not trying to validate HTML as the templates are not valid HTML (header and footer are includes, for example).
Edit: Do take into consideration that the template files will be minimized on deploy. As CSS and Javascript files are minimized and compressed using YUI Compressure and Closure, the template files would be minimized like-wise, on deploy. Not on client-request.
Thank you.
Google's own Closure Templates (Soy) strips whitespace at the end of the line by default, and the template designer explicitly inserts a space using {sp}. This probably isn't a good enough reason to switch away from PHP, but I just wanted to bring it to your attention.
In addition, realize that HTML 4 allows you to exclude some tags, as recommended by the Page Speed documentation on minifying HTML (http://code.google.com/p/page-speed/wiki/MinifyHtml). You can exclude </p>, </td>, </tr>, etc. For a complete list of elements for which you can omit the end tag, search for "- O" in the HTML 4 DTD (http://www.w3.org/TR/REC-html40/sgml/dtd.html). You can even omit the <html>, <head>, <body>, and <tbody> tags entirely, as both start and end tags are optional ("O O" in the DTD).
You can also omit the quotes around attributes (http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.2) such as id, class (with a single class name), and type that have simple content (i.e., matches /^[-A-Za-z0-9._:]+$/). For attributes that have a single possible value, you can exclude the value (e.g., say simply checked rather than checked=checked).
Some people may find these tips repulsive because we've been conditioned for so many years to prepare for the upcoming world of simple LALR parsers for XHTML. Thus, tools like Dave Raggett's HTML Tidy generate HTML with proper closing tags and quotes around attribute values. But let's face it, all the browsers already have parsers that understand HTML 4, any new browser will use the HTML 5 parser rather than XHTML, and we should get comfortable writing HTML that is optimized for size.
That being said, besides a couple large companies like Google and Facebook, my guess is that page size is a negligible component of latency, so if you're optimizing your own site it's probably because of your own obsessive tendencies rather than performance.
White space can be significant (e.g. in pre elements).
When I had a particularly large page (i.e. large enough that there was a benefit in minifying the HTML) I used HTML Tidy and cached the results.
tidy -c -n -omit -ashtml -utf8 --doctype strict \
--drop-proprietary-attributes yes --output-bom no \
--wrap 0
I think you'll end up running into issues with load time with this approach, as the get contents, strip whitespace, and preg replace calls are going to take a lot longer to do than whatever bandwidth the minified HTML is saving you.
I've been running tests on all my sites for a couple of weeks and I can say that this method is pretty consistent. It will only affect template content, so there is little risk of messing up with unknown <pre> or similar.
It is run before deploy so there is no impact on server - actually there should be a little speed up as the file becomes smaller.
Do remember that all content that comes from the database will not suffer any influence as, like said before, this runs before deploy and on template files only.
The method seams solid enough to pass it into production.
If anything goes wrong I'll post it here.

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

Tools to reduce generated HTML size

I'm using google docs, and some templates we are using were created using MS-Office.
The resulting HTML is fat and ugly, and the 500KB per doc limitation on google makes some cleanup mandatory.
I was able to find redundant "style" attributes and move them to some CSS class, and rename the most redundant classes names to shorter ones, which makes me save about 50% of the original size.
Are you aware of some existing tools/scripts/lib which could do this painful job for me, or at least help me to write this magic tool ?
Thanks in advance !
EDIT: I gave a try to both tidy, demoronizer and "manual rewrite":
- Input : 140Kb
- Tidy'ed : 110Kb
- Demoronized : 135Kb
So my favorite answer will be "rewrite it!"
Thanks !
MS-Office makes crappy HTML, period. You're better of spending time rebuilding the HTML from the original text than trying to walk through that minefield.
I made a few macros that do some search/replace functions on Word to do basic things like wrap <p> tags around paragraphs and stuff like that, then re-markup the whole thing from scratch.
You could try tidy it will clean up many things.
Without commenting on its name, I could mention demoronizer, which the author describes as:
...a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications.
YMMV.
One of my favourite utilties now is actually Windows Live Writer - it does a neat job of stripping rubbish out of Word doc files. Some might disagree but I use it quite often!

How do you handle translation of text with markup?

I'm developing multi-language support for our web app. We're using Django's helpers around the gettext library. Everything has been surprisingly easy, except for the question of how to handle sentences that include significant HTML markup. Here's a simple example:
Please log in to continue.
Here are the approaches I can think of:
Change the link to include the whole sentence. Regardless of whether the change is a good idea in this case, the problem with this solution is that UI becomes dependent on the needs of i18n when the two are ideally independent.
Mark the whole string above for translation (formatting included). The translation strings would then also include the HTML directly. The problem with this is that changing the HTML formatting requires changing all the translation.
Tightly couple multiple translations, then use string interpolation to combine them. For the example, the phrase "Please %s to continue" and "log in" could be marked separately for translation, then combined. The "log in" is localized, then wrapped in the HREF, then inserted into the translated phrase, which keeps the %s in translation to mark where the link should go. This approach complicates the code and breaks the independence of translation strings.
Are there any other options? How have others solved this problem?
Solution 2 is what you want. Send them the whole sentence, with the HTML markup embedded.
Reasons:
The predominant translation tool, Trados, can preserve the markup from inadvertent corruption by a translator.
Trados can also auto-translate text that it has seen before, even if the content of the tags have changed (but the number of tags and their position in the sentence are the same). At the very least, the translator will give you a good discount.
Styling is locale-specific. In some cases, bold will be inappropriate in Chinese or Japanese, and italics are less commonly used in East Asian languages, for example. The translator should have the freedom to either keep or remove the styles.
Word order is language-specific. If you were to segment the above sentence into fragments, it might work for English and French, but in Chinese or Japanese the word order would not be correct when you concatenate. For this reason, it is best i18n practice to externalize entire sentences, not sentence fragments.
2, with a potential twist.
You certainly could localize the whole string, like:
loginLink=Please log in to continue
However, depending on your tooling and your localization group, they might prefer for you to do something like:
// tokens in this string add html links
loginLink=Please {0}log in{1} to continue
That would be my preferred method. You could use a different substitution pattern if you have localization tooling that ignores certain characters. E.g.
loginLink=Please %startlink%log in%endlink% to continue
Then perform the substitution in your jsp, servlet, or equivalent for whatever language you're using ...
Disclaimer: I am not experienced in internationalization of software myself.
I don't think this would be good in any case - just introduces too much coupling …
As long as you keep formatting sparse in the parts which need to be translated, this could be okay. Giving translators the possibility to give special words importance (by either making them a link or probably using <strong /> emphasis sounds like a good idea. However, those translations with (X)HTML possibly cannot be used anywhere else easily.
This sounds like unnecessary work to me …
If it were me, I think I would go with the second approach, but I would put the URI into a formatting parameter, so that this can be changed without having to change all those translations.
Please log in to continue.
You should keep in mind that you may need to teach your translators a basic knowledge of (X)HTML if you go with this approach, so that they do not screw up your markup and so that they know what to expect from that text they write. Anyhow, this additional knowledge might lead to a better semantic markup, because, as mentioned above, texts could be translated and annotated with (X)HTML to reflect local writing style.
What ever you do keep the whole sentence as one string. You need to understand the whole sentece to translate it correctly.
Not all words should be translated in all languages: e.g. in Norwegian one doesn't use "please" (we can say "vær så snill" literally "be so kind" but when used as a command it sounds too forceful) so the correct norwegian vould be:
"Logg inn for å fortsette" lit.: "Log in to continue" or
"Fortsett ved å logge inn" lit.: "Continue by to log in" etc.
You must allow completely changing the order, e.g. in a fictional demo language:
"Für kontinuer Loggen bitte ins" (if it was real) lit.: "To continue log please in"
Some language may even have one single word for (most of) this sentence too...
I'll recommend solution 1 or possibly "Please %{startlink}log in%{endlink} to continue" this way the translator can make the whole sentence a link if that's more natural, and it can be completely restructured.
Interesting question, I'll be having this problem very soon. I think I'll go for 2, without any kind of tricky stuff. HTML markup is simple, urls won't move anytime soon, and if anything is changed a new entry will be created in django.po, so we get a chance to review the translation ( ex: a script should check for empty translations after makemessages ).
So, in template :
{% load i18n %}
{% trans 'hello world' %}
... then, after python manage.py makemessages I get in my django.po
#: templates/out.html:3
msgid "hello world"
msgstr ""
I change it to my needs
#: templates/out.html:3
msgid "hello world"
msgstr "bonjour monde"
... and in the simple yet frequent cases I'll encounter, it won't be worth any further trouble. The other solutions here seems quite smart but I don't think the solution to markup problems is more markup. Plus, I want to avoid too much confusing stuff inside templates.
Your templates should be quite stable after a while, I guess, but I don't know what other trouble you expect. If the content changes over and over, perhaps that content's place is not inside the template but inside a model.
Edit: I just checked it out in the documentation, if you ever need variables inside a translation, there is blocktrans.
Makes no sense, how would you translate "log in"?
I don't think many translators have experience with HTML (the regular non-HTML-aware translators would be cheaper)
I would go with option 3, or use "Please %slog in%s to continue" and replace the %s with parts of the link.