How to print non-ascii characters in SBCL Common Lisp - output

Assuming I have such character stored in variable character, how do I print it?
For example GREEK_SMALL_LETTER_XI with code 958.
(format t "~a" character) would just give ?

The OP mentioned in a comment that he was moving to Linux. In SBCL 1.4.15.Debian (and I presume on other Linuxes) Unicode characters are only printed as characters (as opposed to codes) by the (format) function, and not by (print).
Example:
(print (code-char 26159)) produces "#\U662F"
which is the unicode index of the character.
while
(format T "~a" (code-char 26159)) produces "是"

Related

Remove backslash from nested json [duplicate]

When I create a string containing backslashes, they get duplicated:
>>> my_string = "why\does\it\happen?"
>>> my_string
'why\\does\\it\\happen?'
Why?
What you are seeing is the representation of my_string created by its __repr__() method. If you print it, you can see that you've actually got single backslashes, just as you intended:
>>> print(my_string)
why\does\it\happen?
The string below has three characters in it, not four:
>>> 'a\\b'
'a\\b'
>>> len('a\\b')
3
You can get the standard representation of a string (or any other object) with the repr() built-in function:
>>> print(repr(my_string))
'why\\does\\it\\happen?'
Python represents backslashes in strings as \\ because the backslash is an escape character - for instance, \n represents a newline, and \t represents a tab.
This can sometimes get you into trouble:
>>> print("this\text\is\not\what\it\seems")
this ext\is
ot\what\it\seems
Because of this, there needs to be a way to tell Python you really want the two characters \n rather than a newline, and you do that by escaping the backslash itself, with another one:
>>> print("this\\text\is\what\you\\need")
this\text\is\what\you\need
When Python returns the representation of a string, it plays safe, escaping all backslashes (even if they wouldn't otherwise be part of an escape sequence), and that's what you're seeing. However, the string itself contains only single backslashes.
More information about Python's string literals can be found at: String and Bytes literals in the Python documentation.
As Zero Piraeus's answer explains, using single backslashes like this (outside of raw string literals) is a bad idea.
But there's an additional problem: in the future, it will be an error to use an undefined escape sequence like \d, instead of meaning a literal backslash followed by a d. So, instead of just getting lucky that your string happened to use \d instead of \t so it did what you probably wanted, it will definitely not do what you want.
As of 3.6, it already raises a DeprecationWarning, although most people don't see those. It will become a SyntaxError in some future version.
In many other languages, including C, using a backslash that doesn't start an escape sequence means the backslash is ignored.
In a few languages, including Python, a backslash that doesn't start an escape sequence is a literal backslash.
In some languages, to avoid confusion about whether the language is C-like or Python-like, and to avoid the problem with \Foo working but \foo not working, a backslash that doesn't start an escape sequence is illegal.

Is there a way to encode spaces inside a Tcl string?

There is a third party software that linked with Tcl library and is parsing given tcl script.
One if it's commands takes a file name or list of file names. When there are spaces in file name/path it chokes... considers it a list of names, even if I enclose it in curly braces or double quotes.
Can a string/path (spaces within) be encode in a way that other Tcl commands will still interpret it correctly?
The following are the standard escape sequences for a space in Tcl:
“\ ” — a backslash followed by a space.
“\040” — a backslash followed by a 0 and the octal for 32.
“\x20” — a backslash followed by a x and the two-digit hexadecimal for 32. (Only really suitable if the character following is not a hex digit because of a bug in many versions of Tcl.)
“\u0020” — a backslash followed by a u and the four-digit hexadecimal for 32.
“\U000020” — a backslash followed by a U and the six-digit hexadecimal for 32. (Introduced in Tcl 8.6 as part of migration path to supporting newer Unicode characters.)
One of those might work in your situation. Or might if you quote the backslash with another backslash enough times; you're effectively playing “guess how badly wrong someone's software is” at this point. (It sounds like stuff is going through an ill-advised eval somewhere. Maybe several times. Maybe different amounts on different code paths. That's awful if it is true…)

Replace non-ASCII characters with SGML entity codes with Emacs

I have a HTML file with a few non-ASCII characters, say encoded in UTF-8 or UTF-16. To save the file in ASCII, I would like to replace them with their (SGML/HTML/XML) entity codes. So for example, every ë should become ë and every ◊ should become ◊. How do I do that?
I use Emacs as an editor. I'm sure it has a function to do the replace, but I cannot find it. What am I missing? Or how do I implement it myself?
I searched high and low but it seems Emacs (or at least version 24.3.1) doesn't have such a function. Nor can I find it somewhere.
Based on a similar (but different) function I did find, I implemented it myself:
(require 'cl)
(defun html-nonascii-to-entities (string)
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(mapconcat
#'(lambda (char)
(case char
(t (if (and (<= 8 char)
(<= char 126))
(char-to-string char)
(format "&#%02d;" char)))))
string
""))
(defun html-nonascii-to-entities-region (region-begin region-end)
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(interactive "r")
(save-excursion
(let ((escaped (html-nonascii-to-entities (buffer-substring region-begin region-end))))
(delete-region region-begin region-end)
(goto-char region-begin)
(insert escaped))))
I'm no Elisp guru at all, but this works!
I also found find-next-unsafe-char to be of value.
Edit: an interactive version!
(defun query-replace-nonascii-with-entities ()
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(interactive)
(perform-replace "[^[:ascii:]]"
`((lambda (data count)
(format "&#%02d;" ; Hex: "&#x%x;"
(string-to-char (match-string 0)))))
t t nil))
There is a character class which includes exactly the ASCII character set. You can use a regexp that matches its complement to find occurrences of non-ASCII characters, and then replace them with their codes using elisp:
M-x replace-regexp RET
[^[:ascii:]] RET
\,(concat "&#" (number-to-string (string-to-char \&)) ";") RET
So when, for example, á is matched: \& is "á", string-to-char converts it to ?á (= the number 225), and number-to-string converts that to "225". Then, concat concatenates "&#", "225" and ";" to get "á", which replaces the original match.
Surround these commands with C-x ( and C-x ), and apply C-x C-k n and M-x insert-kbd-macro as usual to make a function out of them.
To see the elisp equivalent of calling this function interactively, run the command and then press C-x M-: (Repeat complex command).
A simpler version, which doesn't take into account the active region, could be:
(while (re-search-forward "[^[:ascii:]]" nil t)
(replace-match (concat "&#"
(number-to-string (string-to-char (match-string 0)))
";")))
(This uses the recommended way to do search + replace programmatically.)
I think you are looking for iso-iso2sgml

Convert UTF8 to ASCII using lazarus

I am reading some strings from a text file, the problem is that the strings are UTF8 and contain characters that I wish to remove such as: Ă
An not easy solution would be for me to replace each occurence of illegal characters, but because I am lazy I want a simpler solution
So far I tried this :
line := Utf8ToAnsi(line);
Where line is my UTF8 encoded string ... I tried eaven declaring line as UTF8String ...
Is there a viable solution in this matter? Thanks
An not easy solution would be for me to replace each occurence of
illegal characters, but because I am lazy I want a simpler solution
I developed a function that replaces each diacritical character occurrence to a similar ASCII character, e.g: Á -> A, Ç -> C, ã -> a, and so on. Please take a look at this link.
HTH

charset-utf8 and character entities

I am proposing to convert my windows-1252 XHTML web pages to UTF-8.
I have the following character entities in my coding:
' — apostrophe,
► — right pointer,
◄ — left pointer.
If I change the charset and save the pages as UTF-8 using my editor:
the apostrophe remains in as a character entity;
the pointers are converted to symbols within the code (presumably because the entities are not supported in UTF-8?).
Questions:
If I understand UTF-8 correctly, you don't need to use the entities and can type characters directly into the code. In which case is it safe for me to replace #39 with a typed in apostrophe?
Is it correct that the editor has placed the pointer symbols directly into my code and will these be displayed reliably on modern browsers, it seems to be ok? Presumably, I can't revert to the entities anyway, if I use UTF-8?
Thanks.
It's charset, not chartset.
1) it depends on where the apostrophe is used, it's a valid ASCII character as well so depending on the characters intention (wether its for display only (inside a DOMText node) or used in code) you may or may not be able to use a literal apostrophe.
2) if your editor is a modern editor, it will be using utf sequences instead of just char to display text. most of the sequences used in code are just plain ASCII (and ASCII is a subset of utf8) so those characters will take up one byte. other characters may take up two, three or even four bytes in a specialized manner. they will still be displayed to you as one character, but the relation between character and byte has become different.
Anyway; since all valid ASCII characters are exactly the same in ASCII, utf8 and even windows-1252. you should not see any problems using utf8. And you can still use numeric and named entities because they are written in those valid characters. You just don't have to.
P.S. All modern browsers can do utf8 just fine. but our definitions of "modern" may vary.
Entities have three purposes: Encoding characters it isn't possible to encode in the character encoding used (not relevant with UTF-8), encoding characters it is not convenient to type on a given keyboard, and encoding characters that are illegal unescaped.
► should always produce ► no matter what the encoding. If it doesn't, it's a bug elsewhere.
► directly in the source is fine in UTF-8. You can do either that or the entity, and it makes no difference.
' is fine in most contexts, but not some. The following are both allowed:
<span title="Jon's example">This is Jon's example</span>
But would have to be encoded in:
<span title='Jon's example'>This is Jon's example</span>
because otherwise it would be taken as the ' that ends the attribute value.
Use entities if you copy/paste content from a word processor or if the code is an XML dialect. Use a macro in your text-editor to find/replace the common ones in one shot. Here is a simple list:
Half: ½ => ½
Acute Accent: é => é
Ampersand: & => &
Apostrophe: ’ => '
Backtick: ‘ => `
Backslash: \ => \
Bullet: • => •
Dollar Sign: $ => $
Cents Sign: ¢ => ¢
Ellipsis: … => …
Emdash: — => —
Endash: – => –
Left Quote: “ => “
Right Quote: ” => ”
References
XML Entity Names