I am using magicsuggest as a auto-complete plugin of a web application with web2py. I define a list variable dt=['张','李'] in the model/db.py. The element in the list is Chinese. However when I embeded the variable in the html like{{=XML(dt)}} according to the manual book of magicsuggest. The chinese character was garbled. After several days searching, I find the list variable with chinese character was encode into hex in the html. I know there is something wrong about encode/decode. Could someone help me to display the correct chinese character in the html?
XML() is meant to take a string, not a list of strings. If you pass it something other than a string, it will first be converted to a string, so your code is equivalent to {{=XML(str(dt))}}, and you'll notice that in Python, str(['张','李']) yields "['\\xe5\\xbc\\xa0', '\\xe6\\x9d\\x8e']".
Instead, you can do {{=XML(dt[0])}}, and you will see the first character in the list displayed properly.
If you want to display a comma separated list surrounded by brackets, you can do:
{{=json.dumps(dt, encoding="UTF-8", ensure_ascii=False)}}
Related
I am doing a project that involves searching words in the Arabic script on Wiktionary, and when I do a GET request on certain word pages, I get something like this for example:
title="\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9">\xd8\xb1\xd8\xa3\xd8\xb3\xd9\x85\xd8\xa7\xd9\x84\xd9\x8a\xd8\xa9</a></li>\n<li><a href="/wiki/%D8%B1%D8%A3%D8%B3%D9%8A"
This corresponds to the following URL: https://en.wiktionary.org/wiki/%D8%B1%D8%A3%D8%B3%D9%8A.
Does anyone know what the \xd8 or %D8 encodings are called? I want to say they are hex codes, but I have already looked up hex codes for the Arabic script and they certainly are not these.
The percentages you see in the url are used to substitute characters that are'nt allowed in URLs, such as special characters like "/", ":" and "&" and non ASCII characters. This is called percent encoding - https://en.m.wikipedia.org/wiki/Percent-encoding
The "\xd.." prefixed represent hexadecimal character codes, since arabic characters fall outside of UTF-8 thats how that have to be represented. Thats assuming that HTML you showed used UTF-8 encoding.
I wish to print the tab character with the format function. I can achieve this with ~C and then placing #\tab as an argument to format, but this seems a bit verbose as for a newline one can simply place a ~% in the string.
What is the most commonly used practise for printing tabs with the format function?
Thanks for all the help!
There is no notation for the tab character in FORMAT.
There are several choices, but none is really really good.
use #\tab (or a variable set to the character) as the argument, as you mention, is okay for me
embed a literal tab character in the string. This may break with some editor settings, where the editor replaces tabs with spaces. It's also not directly visible.
use a function in a format string, which writes a tab character
use a reader macro to introduce extended string syntax. Probably not bad. Maybe there exists even one. There was a post on comp.lang.lisp with an example.
I'm parsing some HTML using Beautiful Soup, and occasionally the HTML it returns includes some special characters, such as — (long dash) and ® (register symbol).
I'm currently storing this html as a string in my db as is, and as a result when I display these variables in my templates the special characters appear as they do above. I've tried unescaping the characters using {{ variable|safe }} but that didn't work.
What is the right way to store, and then display, these types of special characters in Django?
What you're looking for is here:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Entity Conversion
You'll want to use the convertEntities parameter and encode them as unicode.
The final line should be something like
decodedString=unicode(BeautifulStoneSoup(encodedString,convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
To display them again
"Your string with a long dash in it".encode('ascii', 'xmlcharrefreplace')
When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)
%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.
The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.
The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/
What server side language are you using? Check for a URL Decode function.
If you are using php you can use urldecode() but you should be careful about + characters.
I am looking for a list of characters and symbols for use in HTML in PDF or image format. It could be some sort of cheat-sheet. Basically I want a reference list for use in HTML for replacing for example '&' with '&'. I have found the list in http://www.w3schools.com/tags/ref_entities.asp but if anyone can point me to pdf or image format of the list.
Regards
There is a complete list in the specification but, with the exception of <, &, and " or ', you should be able to use any character directly in UTF-8 (which results in much more readable documents).
Cheat-sheet