Convert UTF8 to ASCII using lazarus - freepascal

I am reading some strings from a text file, the problem is that the strings are UTF8 and contain characters that I wish to remove such as: Ă
An not easy solution would be for me to replace each occurence of illegal characters, but because I am lazy I want a simpler solution
So far I tried this :
line := Utf8ToAnsi(line);
Where line is my UTF8 encoded string ... I tried eaven declaring line as UTF8String ...
Is there a viable solution in this matter? Thanks

An not easy solution would be for me to replace each occurence of
illegal characters, but because I am lazy I want a simpler solution
I developed a function that replaces each diacritical character occurrence to a similar ASCII character, e.g: Á -> A, Ç -> C, ã -> a, and so on. Please take a look at this link.
HTH

Related

How can I find out which character a certain string is encoding in my database?

Recently I exported parts of my mySQL database, and noticed that the text had several strange characters in it. For example, the string ’ often appeared.
When trying to find out what this meant, I found the stackoverflow question: Character Encoding and the ’ Issue. From that question I now know that the string ’ stands for a quote.
But how can I find out more generally what a string of characters stands for? For example, the letter  often appears in my database as well, and is actually causing me a problem now on a certain page, and to solve the problem, I would like to know what that character means.
I've looked at several tables showing character encoding, but haven't been able to figure out how to use these tables to see why ’ means ', or, more importantly for me, what  stands for. I'd be very grateful if someone could point me in the right direction.
The latin1 encoding for ’ is (in hex) E28099.
The utf8 encoding for ’ is E28099.
But you pasted in C3A2E282ACE284A2, which is the "double encoding" of that apostrophe.
What apparently happened is that you had ’ in the client; the client was generating utf8 encodings. But your connection parameters to MySQL said "latin1". So, your INSERT statement dutifully treated it as 3 latin1 characters E2 80 99 (visually ’), and converted each one to utf8, hex C3A2 E282AC E284A2.
Read about "double encoding" in Trouble with UTF-8 characters; what I see is not what I stored
Meanwhile, browsers tend to be forgiving about double-encoding, or else it might have shown ’
latin1 characters are each 1 byte (2 hex digits). utf8/utf8mb4 characters are 1-to-4 bytes; some 2-byte and 3-byte encodings showed up in your exercise.
As for Â... Go to http://mysql.rjweb.org/doc.php/charcoll#8_bit_encodings and look at the second table there. Notice how the first two columns have lots of things starting with Â. In latin1, that is hex C2. In utf8, many punctuation marks are encoded as 2 bytes: C2xx. For example, the copyright symbol, © is utf8 hex C2A9, which is misinterpreted ©.

How to disable neo4j-import quotation checking

I try to import some large csv dataset into neo4j using the neo4j-import tool. Quotation is not used anywhere, and therefore i get errors when parsing using --quote " --quote ' --quote ´ and alike. even choosing very rare unicode chars doesnt help with this multi-gig csv because it also contains arabic letters, math symbols and everything you can imagine.
So: Is there a way to disable the quotation checking completely?
Perhaps it would be useful to have the import tool able to accept character configuration values specifying ASCII codes. If so then you could specify --quote \0 and no character would match. That would also be useful for specifying other special characters in general I'd guess.
You need to make sure the CSV file uses quotation marks, since they allow the tool to reliably determine when strings end.
Any string in your data file might contain the delimiter character (a comma, by default). Even if there were a way to turn off quotation checking, the tool would treat every delimiter character as the end of a field. Therefore, any string field that happened to contain the delimiter character would be terminated prematurely, causing errors.

how to detect thai language in SQL query

I have a column in a table which is a string, and some of those strings have thai language in it, so an example of a thai string is:
อักษรไทย
Is there such way to query/find a string like this in a column?
You could search for strings that start with a character in the Thai Unicode block (i.e. between U+0E01 and U+0E5B):
WHERE string BETWEEN 'ก' AND '๛'
Of course this won't include strings that start with some other character and go on to include Thai language, such as those that start with a number. For that, you would have to use a much less performant regular expression:
WHERE string RLIKE '[ก-๛]'
Note however the warning in the manual:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
You can do some back and forth conversion between character sets.
where convert(string, 'AL32UTF8') =
convert(convert(string, 'TH8TISASCII'), 'AL32UTF8', 'TH8TISASCII' )
will be true if string is made only of thai and ASCII, so if you add
AND convert(string, 'AL32UTF8') != convert(string, 'US7ASCII')
you filter out the strings made only of ASCII and you get the strings made of thai.
Unfortunately, this will not work if your strings contain something outside of ASCII and Thai.
Note: Some of the convert may be superfluous depending on your database default encoding.

special characters with Net::Twitter::Lite

I try to send characters like ü, ä, ß, à and so on to twitter. If I use unicode characters in my scripts they come out wrong in twitter. If I use HTML (which is possible in twitter's web-interface and which used to work previously) I see now ü rather than "ü" in the post. Is there a parameter or something that I have to set? Some call to encode/decode? I am using:
use Net::Twitter::Lite::WithAPIv1_1;
I find myself checking out the test suite of perl modules quite often as it is a good source for examples.
Net::Twitter expects decoded characters, not encoded bytes So, sending
encoded utf8 to Net::Twitter will result in double encoded data.
Source: https://metacpan.org/source/MMIMS/Net-Twitter-Lite-0.12006/t/unicode.t
Try
use encoding 'utf8';
at the begining of your script. sometimes this is the solution of many utf8 problems.

charset-utf8 and character entities

I am proposing to convert my windows-1252 XHTML web pages to UTF-8.
I have the following character entities in my coding:
' — apostrophe,
► — right pointer,
◄ — left pointer.
If I change the charset and save the pages as UTF-8 using my editor:
the apostrophe remains in as a character entity;
the pointers are converted to symbols within the code (presumably because the entities are not supported in UTF-8?).
Questions:
If I understand UTF-8 correctly, you don't need to use the entities and can type characters directly into the code. In which case is it safe for me to replace #39 with a typed in apostrophe?
Is it correct that the editor has placed the pointer symbols directly into my code and will these be displayed reliably on modern browsers, it seems to be ok? Presumably, I can't revert to the entities anyway, if I use UTF-8?
Thanks.
It's charset, not chartset.
1) it depends on where the apostrophe is used, it's a valid ASCII character as well so depending on the characters intention (wether its for display only (inside a DOMText node) or used in code) you may or may not be able to use a literal apostrophe.
2) if your editor is a modern editor, it will be using utf sequences instead of just char to display text. most of the sequences used in code are just plain ASCII (and ASCII is a subset of utf8) so those characters will take up one byte. other characters may take up two, three or even four bytes in a specialized manner. they will still be displayed to you as one character, but the relation between character and byte has become different.
Anyway; since all valid ASCII characters are exactly the same in ASCII, utf8 and even windows-1252. you should not see any problems using utf8. And you can still use numeric and named entities because they are written in those valid characters. You just don't have to.
P.S. All modern browsers can do utf8 just fine. but our definitions of "modern" may vary.
Entities have three purposes: Encoding characters it isn't possible to encode in the character encoding used (not relevant with UTF-8), encoding characters it is not convenient to type on a given keyboard, and encoding characters that are illegal unescaped.
► should always produce ► no matter what the encoding. If it doesn't, it's a bug elsewhere.
► directly in the source is fine in UTF-8. You can do either that or the entity, and it makes no difference.
' is fine in most contexts, but not some. The following are both allowed:
<span title="Jon's example">This is Jon's example</span>
But would have to be encoded in:
<span title='Jon's example'>This is Jon's example</span>
because otherwise it would be taken as the ' that ends the attribute value.
Use entities if you copy/paste content from a word processor or if the code is an XML dialect. Use a macro in your text-editor to find/replace the common ones in one shot. Here is a simple list:
Half: ½ => ½
Acute Accent: é => é
Ampersand: & => &
Apostrophe: ’ => '
Backtick: ‘ => `
Backslash: \ => \
Bullet: • => •
Dollar Sign: $ => $
Cents Sign: ¢ => ¢
Ellipsis: … => …
Emdash: — => —
Endash: – => –
Left Quote: “ => “
Right Quote: ” => ”
References
XML Entity Names