Problems with character sets

Problems with character sets - ms-access

I have MS Access 2010 forms linking to a mySQL5 (utf8) database.
I have the following data stored in a varchar field:
"Jarosław Kot"
MS Access is just display this raw, as opposed to converting it to:
Jarosław Kot
Can anyone offer assistance?
Thanks Paul

The notation ł is a character reference in SGML, HTML, and XML. There is in general no reason to expect any software to treat it as anything but a literal of six characters “&”, “#”, etc., unless the software is interpreting the data as SGML, HTML, or XML.
So if you have data stored so that ł should be interpreted as a character reference, then you should convert the data, statically or dynamically. The specifics depend on what actual data there is—for example, do all the constructs use decimal notation (not hexadecimal), and is it certain that all numbers are to be interpreted as Unicode numbers for characters?

If I understand you correctly you can use Replace function:
Replace("Jarosław Kot", "ł", "ł")

Assuming that your mySQL database character set is effectively set to UTF8 and that all inserts and updates are utf8 compatible (I do not know that much about mySQL, but SQL Server has some specific syntax rules for utf8-compliant data...), you could then convert the available HTML data to plain UTF8 data.
You will not have any problem finding some conversion table (here for example), and, if you are lucky, you could even find a conversion function...

Related

Matching MS Access accented characters - collation in MS Access

My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?

You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.

You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.

MySql collecting database only English but not other

I have a comment section and submission form that any of my member can submit.
If my member post in English I will receive an email update and the comment will be post no problem in English. But if they use other than English such an example of Thai language. Then what happen all the words let say for example สวัสดี it will appear as ??????
I don't know why, but I went to check on my php.ini file and the unicode/encoded setted to UTF8 and also on the MySQL collation setted to UTF8 as well. I make sure the meta setted to UTF8 as well on the .html/.php files, but still causing the same problem.
Any suggestion what else I missed to configure?

Make sure you are using multibyte safe string functions or you might be losing your UTF-8 encoding.
From the PHP mbstring manual:
While there are many languages in
which every necessary character can be
represented by a one-to-one mapping to
an 8-bit value, there are also several
languages which require so many
characters for written communication
that they cannot be contained within
the range a mere byte can code (A byte
is made up of eight bits. Each bit can
contain only two distinct values, one
or zero. Because of this, a byte can
only represent 256 unique values (two
to the power of eight)). Multibyte
character encoding schemes were
developed to express more than 256
characters in the regular bytewise
coding system.
When you manipulate (trim, split,
splice, etc.) strings encoded in a
multibyte encoding, you need to use
special functions since two or more
consecutive bytes may represent a
single character in such encoding
schemes. Otherwise, if you apply a
non-multibyte-aware string function to
the string, it probably fails to
detect the beginning or ending of the
multibyte character and ends up with a
corrupted garbage string that most
likely loses its original meaning.
mbstring provides multibyte specific
string functions that help you deal
with multibyte encodings in PHP. In
addition to that, mbstring handles
character encoding conversion between
the possible encoding pairs. mbstring
is designed to handle Unicode-based
encodings such as UTF-8 and UCS-2 and
many single-byte encodings for
convenience

I just found out that what is causing the problem
in php.ini
line mbstring.internal_encodingit was setted to something else so I setted it to UTF-8 then magical! now everything worked!

MySQL charset needed

I'm developing an application for a native language learning. I need to store some characters as 'ẽũ'. My database is set to utf-8 charset with a default collation, also the table affected by this characters.
The problem is when I try to add a row using a regular SQL insert:
INSERT INTO text(spanish,guarani) VALUES('text','ẽũ');
This throws a warning:
Warning Code : 1366 Incorrect string value: '\xE1\xBA\xBD\xC5\xA9' for column 'guarani' at row 1
And the result is "??" where there are those characters.
Question: These characters are not covered by the UTF-8 charset? Which one I need?
Note: Same problem with latin-1
Thanks.

QUICK!!! Read http://www.joelonsoftware.com/articles/Unicode.html
It is required reading.
Once you have read that, you should ask yourself:
What encoding is the connection using.
What locale is collation using. (If applicable).
What encoding is the SQL statement in?
What encoding are the string literals in?
What encoding is the html form presented in?

As by other answer, you really should read and understand the basics of Unicode.
It's not difficult, (in one day you can grasp it), it's required knowledge for almost every programmer (and certainly for you), it's non ephemeral knowledge and will be your life simpler and happier.
These characters are not covered by
the UTF-8 charset?
UTF-8 is a Unicode charset, Unicode covers (practically) every character. MYSQL's 'utf8' encoding, on the other hand, is not true UTF-8, it leaves some characters out (thouse outside the BMP). But that is not your problem here.
http://www.fileformat.info/info/unicode/char/1ebd/index.htm
http://www.fileformat.info/info/unicode/char/169/index.htm
You see there that your two characters are valid Unicode, are inside the BMP (hence Mysql crippled 'utf8' should support them), and yu can even see it's UTF-8 encoding. And, as you see, \xE1\xBA\xBD\xC5\xA9 seems just right. So the problem seems to be elsewhere. Are you sure you DB is utf8?

How do I find the character encoding of a ms access database?

How do I find out the character encoding for the tables in my MS Access 2003 database.
For example:
Windows-1252
ISO 8859-1
US-ASCII

Is there something not working with CurrentDB.CollatingOrder? I don't know where you look up the value of the resulting number, but in my American DBs, it returns 1033, which is quite familiar as the American English character set.
Ah, yes, if I go into the Object Browser in the VBE and search for CollatingOrder, one of the results shows an ENUM called CollatingOrderEnum, and by clicking on each in turn, you can see its value.
DBEngine(0)(0).CollatingOrder is the same property, and can be used with DAO from outside Access. There is, perhaps, a way to get it with ADO/OLEDB, but I don't use either of them so can't point you in the right direction there.

Beginning with Access_2000 (which was based on Jet 4.0), Access databases store text data internally as Unicode. So if your database file really is an Access_2003 database then the DAO, ODBC, and OLEDB access methods should all return Unicode strings.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008