MS-Access VBA magically converting unicode strings? - ms-access

First, I admit not being a VB expert, but I was asked to check our database system taking care of handling the languages of our application. The issue is that some characters with accent seem to magically be converted without them.
For example, the Polish word "przesunąć" will be stored as "przesunac" in the record field at the time of the call to Recordset.MoveNext. "Unicode Compression" is set to true on that column, but I doubt it's related. I'm trying to find out what makes this magic conversion because I don't want it.

Someone stated at http://www.pcreview.co.uk/forums/no-unicode-dao-recordset-t1102041.html that " the Recordset contains correct data but that the Debugger window and Tooltips can't display Unicode strings". Interesting. Dumb, but interesting.
Fine, but why are the strings in ANSI in the file? Well, the next post in the same thread reads "If you want to write in Unicode with VBA, my feeling would be that you must
write in binary mode; not in Text mode." This lead me to http://accessblog.net/2007/06/how-to-write-out-unicode-text-files-in.html where I got my final answer.
Case solved.

Related

Matching MS Access accented characters - collation in MS Access

My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?
You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.

Unreadable Text in JSON

I've been working a bit with some files from Minecraft Dungeons, which were extracted using QuickBMS and made available here: https://minecraft.fandom.com/wiki/Minecraft_Wiki:Minecraft_Dungeons_game_files
In the "data" folder, there are a bunch of json-files, which I believe contain a list of textures associated with any given stage of the game. There is, however, a problem. When opened, it reads like any json-file, it has a bunch of names and values, but some of the values are not human-readable, they instead show up as a string of seemingly unrelated characters. Here an example:
"walkable-plane" : "eNpjYSEOMIMAOp+ZmQmND1fEjF2AiQldAJsWDEPRXUKkowkDAM/qA6o=",
Now, given that these are exclusively characters, and not error signs or something of the sorts, I'm assuming this is an encoding issue. Of course, I don't know for sure, Or I wouldn't be asking this in the first place, but the file as it appears in the text is UTF-8, and it obviously doesn't produce a usable result. So, if anyone knows what exactly this is, and how I could extract information from it, I'd be really thankful.

Writing "latin1" encoded byte array with escape characters

I'm working on a database import/export process in VB.NET which writes data from a MySQL (5.5) database to a plain text file. The application reads the data to a DataTable, then goes through the rows/columns to actually write the data to the OutputFile (System.IO.StreamWriter object). The encoding on the tables in this database is Latin1. There is a MediumBlob field in one of the tables I've been using for testing which contains image files stored as a byte array.
In my attempts to validate the output from my application, I've exported the data directly from the database using the MySQL Workbench, then compared that with the results I get when I write the same data from my application. In the direct export from MySQL Workbench, I see some of these bytes are exported with the backslash. When I read the data through my application, however, this escape character does not appear. Viewed through Notepad++, it clearly shows some distinct differences between the two output results (see screenshot).
Obviously, while apparently very similar, the two are not completely identical. My application is not including the backslashes for escaped characters, and some characters such as NULL are coming out differently altogether. My code for writing this field to the file is:
OutputFile.Write("'" & System.Text.Encoding.GetEncoding(28591).GetString(CType(COPYRow(ColumnIndex), Byte())) & "'")
There doesn't appear to be an overload for the GetString method that allows me to specify an escape character, so I'm wondering if there's another way that, using this method, I can ensure the characters are correctly encoded, including escape characters.
I'm "assuming" that this method should also work in general when I start working with my PostgreSQL database, but with possibly a different encoding. I'm trying to build things as "generic" as possible, but I'll have to worry about specifying encodings at run-time instead of hard-coding them later.
EDIT
I just ran across another SO question, which might point me in the right direction: Convert a Unicode string to an escaped ASCII string. Obviously, it might take a bit more work to get it right, but this looks like the closest thing to what I'm trying to accomplish.

encoding issues between python and mysql

I have a weird encoding problem from my PyQt app to my mysql database.
I mean weird in the sense that it works in one case and not the other ones, even though I seem to be doing the exact same thing for all.
My process is the following:
I have some QFocusOutTextEdit elements in which I write text possibly containing accents and stuff (é,à,è,...)
I get the text written with :
text = self.ui.text_area.toPlainText()
text = text.toUtf8()
Then to insert it in my database I do :
text= str(text).decode('unicode_escape').encode('iso8859-1').decode('utf8')
I also set the character set of my database, the specific tables and the specific columns of the table to utf8.
It is working for one my text areas, and for the other ones it puts weird characters instead in my db.
Any hint appreciated on this !
RESOLVED :
sorry for the disturbance, apparently I had some fields in my database that weren't up to date and this was blocking the process of encoding somehow.
You are doing a lot of encoding, decoding, and reencoding which is hard to follow even if you know what all of it means. You should try to simplify this down to just working natively with Unicode strings. In Python 3 that means str (normal strings) and in Python 2 that means unicode (u"this kind of string").
Arrange for your connection to the MySQL database to use Unicode on input and output. If you use something high-level like Sqlalchemy, you probably don't need to do anything. If you use MySQLdb directly make sure you pass charset="UTF8" (which implies use_unicode) to the connect() method.
Then make sure the value you are getting from PyQT is a unicode value. I don't know PyQT. Check the type of self.ui.text_area or self.ui.text_area.toPlainText(). Hopefully it is already a Unicode string. If yes: you're all set. If no: it's a byte string which is probably encoded in UTF-8 so you can decode it with theresult.decode('utf8') which will give you a Unicode object.
Once your code is dealing with all Unicode objects and no more encoded byte strings, you don't need to do any kind of encoding or decoding anymore. Just pass the strings directly from PyQT to MySQL.

Best HTML encoder for Delphi?

Seems like my data is getting corrupted when using HTTPapp.HTMLEncode( string ): String;
HTMLEncode( 'Jo&hn D<oe' ); // returns 'Jo&am'
This is not correct, and is corrupting my data. Does anyone have suggestions for VCL components that work better? Other than spending my time encoding all the cases
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
Update
After understanding more about HTML, I have found there is no need to encode the other characters referenced in my link. You would only need to know about the four HTML reserved characters being
&,<,>,"
The issue with the VCL HTTPApp.HTMLEncode( ) function is because of the buffer size and the new Delphi 2009/2010 specifications for default Unicode string types, this can be fixed the way that #mason says below, or it can be fixed with a call to WideFormatBuf( ) instead of the FormatBuf( ) that is currently in use.
Replacing the <, >, &, and " characters in a string is trivial. You could thus easily write your own routine for this. (And if your HTML page is UTF-8, there is absolutely no reason to encode any other characters, such as U+222B (the integral sign).)
But if you wish to stick to the Delphi RTL, then you can have a look at HTTPUtil.HTMLEscape with the exactly same signature as HTTPApp.HTMLEncode.
Or, have a look at this SO question.
You're probably using Delphi 2009 or 2010. It looks to me like they forgot to update HTMLEncode for Unicode. It's passing the wrong buffer lengths to FormatBuf.
The HTMLEncode routine is basically right, aside from that, and it's pretty short. You could probably just make your own copy. Everywhere it calls FormatBuf, it gives 5 parameters. The second and fourth are integer values. Double both of them in each call, (there are only four of them), and then it will work.
Also, you ought to open a QC report on this so it will get fixed.
Small hint: do not convert single quote (') to &apos; - some browsers do not understand this code because &apos; is not valid HTML
For details, see: "The Curse of &apos;" and "XHTML and '"
(Both Delphi units mentioned do not convert single quotes).