I'm trying to read some data from an SQL Server 2008 database into an Excel 2007 spreadsheet with C#, using this connection string:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=foo.xlsx;Extended Properties='Excel 12.0 XML;HDR=YES'
One of the columns in the database is a VARCHAR(1000). When I try recreating the schema in the spreadsheet, it seems like Excel's VARCHAR only supports up to 255. This page suggests that the "Total number of characters that a cell can contain" is around 32K, so in principle, it should be possible to get a longer string in.
Is there a simple way to work around the 255 char limit?
Although XLOPER12 will now support a string up to 32,767 Unicode characters long, xlfEvaluate (and other) excel C-Api function continues to be limited to 255 characters long in Excel 2010. It will return xltypeErr if it is passed an XLOPER12 with a string longer than 255.
All strings the user sees in Excel have for many versions now been stored internally as Unicode strings. Unicode worksheet strings can be up to 32,767 (215 - 1) characters in length and can contain any valid Unicode character.
When the C API was first introduced, worksheet strings were byte strings limited in length to 255 characters, and the C API reflected these limitations. With Excel 2007, the C API is updated to handle Excel long Unicode strings. This means that DLL functions registered in the right way can accept Unicode arguments and return Unicode strings.
Note:
Byte strings are still fully supported in the C API for backward compatibility, however they still have the same 255-character limit. No easy solution other than to truncate the string, or divide the string into multiple cells.
Related
My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?
You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.
I'm working on a database import/export process in VB.NET which writes data from a MySQL (5.5) database to a plain text file. The application reads the data to a DataTable, then goes through the rows/columns to actually write the data to the OutputFile (System.IO.StreamWriter object). The encoding on the tables in this database is Latin1. There is a MediumBlob field in one of the tables I've been using for testing which contains image files stored as a byte array.
In my attempts to validate the output from my application, I've exported the data directly from the database using the MySQL Workbench, then compared that with the results I get when I write the same data from my application. In the direct export from MySQL Workbench, I see some of these bytes are exported with the backslash. When I read the data through my application, however, this escape character does not appear. Viewed through Notepad++, it clearly shows some distinct differences between the two output results (see screenshot).
Obviously, while apparently very similar, the two are not completely identical. My application is not including the backslashes for escaped characters, and some characters such as NULL are coming out differently altogether. My code for writing this field to the file is:
OutputFile.Write("'" & System.Text.Encoding.GetEncoding(28591).GetString(CType(COPYRow(ColumnIndex), Byte())) & "'")
There doesn't appear to be an overload for the GetString method that allows me to specify an escape character, so I'm wondering if there's another way that, using this method, I can ensure the characters are correctly encoded, including escape characters.
I'm "assuming" that this method should also work in general when I start working with my PostgreSQL database, but with possibly a different encoding. I'm trying to build things as "generic" as possible, but I'll have to worry about specifying encodings at run-time instead of hard-coding them later.
EDIT
I just ran across another SO question, which might point me in the right direction: Convert a Unicode string to an escaped ASCII string. Obviously, it might take a bit more work to get it right, but this looks like the closest thing to what I'm trying to accomplish.
If I execute a query against the MySQL Connector/C library the data I'm getting back all appears to be in straight char * format, including numerical data types.
For example, if I execute a query that returns 4 columns, all of which are INTEGER in MySQL, rather than getting back 4 bytes worth of data (each byte representing a single column row value), I'm actually getting back 4 ASCII encoded character bytes, where 1 is actually a byte with the numeric value 49 in it (ASCII for 1).
Is this accurate or am I just missing something complete?
Do I really need to then atoi that returned byte into an int in my code or is there a mechanism to get the native C data types out of the MySQL client directly?
I guess my real question is: is the mysql_store_result structure converting that data to ASCII encoded representations in a way that can be bypassed by my application code?
I believe the data is sent on the wire as text in the MySQL protocol (I just confirmed this with Wireshark). So that means mysql_store_result() is not converting the data, it's just simply passing the data on as it was received. MySQL actually sends integers as text. I agree this always seemed like an odd design to me as well.
MySQL originally only offered the Text Protocol that you are currently using, in which (as you note) results are encoded as strings. MySQL v4.1 (released in April 2003) introduced the Prepared Statement protocol, which (amongst other things) transmits results in a binary format.
See C API Prepared Statements for more information on how to use the latter protocol with Connector/C.
I have MS Access 2010 forms linking to a mySQL5 (utf8) database.
I have the following data stored in a varchar field:
"Jarosław Kot"
MS Access is just display this raw, as opposed to converting it to:
Jarosław Kot
Can anyone offer assistance?
Thanks Paul
The notation ł is a character reference in SGML, HTML, and XML. There is in general no reason to expect any software to treat it as anything but a literal of six characters “&”, “#”, etc., unless the software is interpreting the data as SGML, HTML, or XML.
So if you have data stored so that ł should be interpreted as a character reference, then you should convert the data, statically or dynamically. The specifics depend on what actual data there is—for example, do all the constructs use decimal notation (not hexadecimal), and is it certain that all numbers are to be interpreted as Unicode numbers for characters?
If I understand you correctly you can use Replace function:
Replace("Jarosław Kot", "ł", "ł")
Assuming that your mySQL database character set is effectively set to UTF8 and that all inserts and updates are utf8 compatible (I do not know that much about mySQL, but SQL Server has some specific syntax rules for utf8-compliant data...), you could then convert the available HTML data to plain UTF8 data.
You will not have any problem finding some conversion table (here for example), and, if you are lucky, you could even find a conversion function...
I have a column of type char(32) where I want to store an MD5 hash key. The problem is i've used SQL to update the existing records using HashBytes() function which creates values like
:›=k! ©úw"5Ýâ‘<\
but when I do the insert via .NET it comes through as
3A9B3D6B2120A9FA772235DDE2913C5C
What do I need to do to get these to match up? Is it the encoding?
HashKey isn't a SQL function, did you mean HASHBYTES? Some actual code would help. SQL appears to be computing the raw binary hash and displaying it as ASCII characters.
.NET is computing the hash, then converting it to hexadecimal (or so it appears). CHAR(32) isn't a good way to store raw binary data, you would want to use the BINARY type.
An Example in SQL:
SELECT SUBSTRING(sys.fn_varbintohexstr(HASHBYTES('MD5',0x2040)),3, 32)
And an Example in .NET:
using (MD5 md5 = MD5.Create())
{
var data = new byte[] { 0x20, 0x40 };
var hashed = md5.ComputeHash(data);
var hexHash = BitConverter.ToString(hashed).Replace("-", "");
Console.Out.WriteLine("hexHash = {0}", hexHash);
}
These will both produce the same value. (Where 0x2040 is sample data).
You can either store the hexadecimal data as CHAR(32), or as BINARY(16). Storing the Binary data is twice as space efficient than storing it as hex. What you should not be doing is storing the binary data as CHAR(16).
It's not clear what you mean by "when I do the insert via .NET" - but you shouldn't be storing binary data just in a raw form, as it looks like your'e doing using HashKey(). (Do you definitely mean HashKey by the way? I can't find a reference for it, but there's HashBytes...)
Two common options are to encode the raw binary data as hex - which it looks like you're doing in the second case - or to use base64. Either way should be easy from .NET (Base64 marginally easier, using Convert.ToBase64String) and you probably just need to find the equivalent SQL Server function.
MD5 is typically stored as in hex encoding. I'd guess that your hashkey() SQL function is not hex encoding the MD5 hash, rather it's just returning the ASCII characters representing the hash. But your .NET method is HEX encoding. If you store your MD5 hashing consistently as HEX (or not - up to you but usually stored as HEX), then the results between the two should always be consistent.
For example, the : symbol from your SQL hash is the first character returned from HashKey(). In the .NET method, the first 2 characters are 3A. 31 is 51 in decimal. ASCII code 51 is the colon (:) character. Similarly, you can work your way through each other character, and do the HEX conversion.
See any ASCII codes table for reference, i.e. http://www.asciitable.com/