Matching MS Access accented characters - collation in MS Access

Matching MS Access accented characters - collation in MS Access - ms-access

My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?

You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.

Related

MS Access VBA - how to read utf8 text independantly from Windows system locale

I have a MySQL table with a text field. It contains a hyperlink, and it is encoded in utf8 (utf8-unicode-ci collation). I want to open the hyperlink programmatically from VBA.
The text field may contain characters like "őűö", which are not present in western European codepage (1252), but available in central European (1250).
My first attempt was to run a pass-trough query, read the field value into a VBA string, and open it with Application.Followhyperlink. It works, when windows system locale - default codepage for non-Unicode compatible applications in regional settings - is Hungarian (uses codepage 1250), and fails, when the system locale is German (uses codepage 1252). The VBA string contains a value converted to the codepage specified by the system locale. So "C:\tükörtűrő" will be read as "C:\tukorturo".
I am not allowed to fix the system locale on 100+ computers. So, how to do it right?
Edit:
Lessons learned:
Debug.Print doesn't support Unicode – as stated by Erik von Asmuth. The displayed text in the debug window is misleading.
Application.FollowHyperlink can handle Unicode.
The real problem was a link health check right before opening the link, where I have used the built in GetAttr(), which depends on system locale settings. I have replaced it with GetFileAttributesW(), everything seems to work now. Some credit goes here to Bonnie West. (https://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=74264&lngWId=1)

VBA and Access use UTF-16 internally, not the system codepage, so this shouldn't be a problem at all. Pass-through queries should just work. However:
You need to use the MySQL Unicode driver, not the MySQL ANSI driver
Not all VBA functions support unicode characters. For example,
MsgBox is ANSI only, and will cast unavailable characters to
either questionmarks or the closest equivalent ANSI character.
The VBA code itself is not unicode. You can see this
answer for an approach
to set strings to characters that are unavailable in the codepage
used by VBA.

TCL: what is the difference between format , binary format , scan and binary scan commands?

Can anyone explain me the differences between scan and binary scan .
format and binary format .
I am getting confusion with the binary commands .

To understand the difference between command sets manipulating binary and string data you have to understand the distinction between these two kinds of data.
In Tcl, as in many (most?) high-level languages, strings are rather abstract — that is, they are described in pretty high-level terms. Particularly in Tcl, strings are defined to have the following properties:
They contain characters from the Unicode repertoire.
The Tcl runtime provides the set of standard commands to operate on strings — such as indexing, searching, appending to, extracting a substring etc.
Note that many things are left out from this definition:
The encoding in which these Unicode characters are stored.
How exactly they are stored (NUL-terminated arrays? linked lists of unsigned longs? something else?).
(To put it into a more interesting perspective, Tcl is able to transparently change the underlying representations of strings it manages — between UTF-8 and UTF-16 encoded sequences. But here we're talking about the reference Tcl implementation, and other implementations (such as Jacl for instance) are free to do something else completely.)
The same approach is used to manipulate all the other kinds of data in the Tcl interpreter. Say, integer numbers are stored using native platform "integers" (roughly "as in C") but they are transparently upgraded into arbitrary sized integers if an arithmetic operation is about to overflow the platform-sized result.
So long as you don't leave the comfortable world of the Tcl interpreter, this is all you should know about the data types it manages. But now there's the outside world. In it, abstract concepts which are Tcl strings do not exist. Say, if you need to communicate to some other program over a network socket or by means of using a file or whatever other kind of media, you have to get down to the level of exact layouts of raw bytes which are described by "wire protocols" and file formats or whatever applies to your case. This is where "binaries" come into play: they allow you to precisely specify how the data is laid out so that it's ready to be transferred to the outside world or be consumed from it — binary format makes these "binaries" and binary scan reads them.
Note that certain Tcl commands for working with the outside world are "smart by default" — for instance, the open command which opens files by default assumes they are textual and are encoded in the default system encoding (which is deduced, broadly speaking, from the environment). You can then use the chan configure (of fconfigure — in older versions of Tcl) command to either change this encoding or completely inhibit conversions by specifying the channel is in "binary mode". The same applies to EOL conversions.
Note also that there are specialized packages for Tcl that effectively hide the complexities of working with a particular wire/file format. To present one example, the tdom package works with XML; when you manipulate XML using this package, you're not concerned with how exactly XML must be represented when, say, saved to a file — you just work with tdom's objects, native Tcl strings etc.

The docs are pretty good and contain examples:
scan: http://www.tcl.tk/man/tcl8.6/TclCmd/scan.htm
format: http://www.tcl.tk/man/tcl8.6/TclCmd/format.htm
binary scan: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M42
binary format: http://www.tcl.tk/man/tcl8.6/TclCmd/binary.htm#M16
Maybe you could ask a more specific question?

The format command assembles strings of characters, the binary format command assembles strings of bytes. The scan and binary scan commands do the reverse, extracting formation from character strings and byte strings respectively.
Note that Tcl happens to map byte strings neatly onto character strings where the characters are in the range \u0000–\u00FF, and there are other operations for getting information into and out of binary strings that are sometimes relevant. Most notably, encoding convertto and encoding convertfrom: encoding convertto formats a string as a sequence of bytes that represent that string in a given encoding (an operation which can lose information) and encoding converfrom goes in the opposite direction.
So what encoding are Tcl's strings really in? Well, none really. Or many. The logical level works with character sequences exclusively, and the implementation will actually move things back and forth (mostly between a variant of UTF-8 and UCS-2, though with optimisations for handling byte strings via arrays of unsigned char) as necessary. While this is not always perfectly efficient, most code never notices what's going on due to the type-caching used.
If you have Tcl 8.6, you can peek behind the covers to observe the types with an unsupported command:
# Output is human-readable; experiment to see what it says for you
puts [tcl::unsupported::representation $MyString]
Don't use this to base functional decisions on; Tcl is very happy to mutate types out from under your feet. But it can help when finding out why your code is unexpectedly slow. (Note also that types attach to values, and not to variables.)

MS-Access VBA magically converting unicode strings?

First, I admit not being a VB expert, but I was asked to check our database system taking care of handling the languages of our application. The issue is that some characters with accent seem to magically be converted without them.
For example, the Polish word "przesunąć" will be stored as "przesunac" in the record field at the time of the call to Recordset.MoveNext. "Unicode Compression" is set to true on that column, but I doubt it's related. I'm trying to find out what makes this magic conversion because I don't want it.

Someone stated at http://www.pcreview.co.uk/forums/no-unicode-dao-recordset-t1102041.html that " the Recordset contains correct data but that the Debugger window and Tooltips can't display Unicode strings". Interesting. Dumb, but interesting.
Fine, but why are the strings in ANSI in the file? Well, the next post in the same thread reads "If you want to write in Unicode with VBA, my feeling would be that you must
write in binary mode; not in Text mode." This lead me to http://accessblog.net/2007/06/how-to-write-out-unicode-text-files-in.html where I got my final answer.
Case solved.

Problems with character sets

I have MS Access 2010 forms linking to a mySQL5 (utf8) database.
I have the following data stored in a varchar field:
"Jarosław Kot"
MS Access is just display this raw, as opposed to converting it to:
Jarosław Kot
Can anyone offer assistance?
Thanks Paul

The notation ł is a character reference in SGML, HTML, and XML. There is in general no reason to expect any software to treat it as anything but a literal of six characters “&”, “#”, etc., unless the software is interpreting the data as SGML, HTML, or XML.
So if you have data stored so that ł should be interpreted as a character reference, then you should convert the data, statically or dynamically. The specifics depend on what actual data there is—for example, do all the constructs use decimal notation (not hexadecimal), and is it certain that all numbers are to be interpreted as Unicode numbers for characters?

If I understand you correctly you can use Replace function:
Replace("Jarosław Kot", "ł", "ł")

Assuming that your mySQL database character set is effectively set to UTF8 and that all inserts and updates are utf8 compatible (I do not know that much about mySQL, but SQL Server has some specific syntax rules for utf8-compliant data...), you could then convert the available HTML data to plain UTF8 data.
You will not have any problem finding some conversion table (here for example), and, if you are lucky, you could even find a conversion function...

How do I find the character encoding of a ms access database?

How do I find out the character encoding for the tables in my MS Access 2003 database.
For example:
Windows-1252
ISO 8859-1
US-ASCII

Is there something not working with CurrentDB.CollatingOrder? I don't know where you look up the value of the resulting number, but in my American DBs, it returns 1033, which is quite familiar as the American English character set.
Ah, yes, if I go into the Object Browser in the VBE and search for CollatingOrder, one of the results shows an ENUM called CollatingOrderEnum, and by clicking on each in turn, you can see its value.
DBEngine(0)(0).CollatingOrder is the same property, and can be used with DAO from outside Access. There is, perhaps, a way to get it with ADO/OLEDB, but I don't use either of them so can't point you in the right direction there.

Beginning with Access_2000 (which was based on Jet 4.0), Access databases store text data internally as Unicode. So if your database file really is an Access_2003 database then the DAO, ODBC, and OLEDB access methods should all return Unicode strings.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008