Handling Japanese characters in query to MS access database (in RStudio) - ms-access

I would like to know how to handle Japanese characters in a query to a Microsoft Access database. I am trying to use a query selecting variable names written in Japanese using the function odbcQuery from RODBC package in R.
I am working with Windows. My version of RStudio is 1.1.383, and my version of Access is 14.0.7015.1000 (32-bit).
I think R understands the Japanese characters in my query, but when I try to actually carry out the query I get the following error message:
> query <- "SELECT [LOA-FTD_1_5_1_CALCULATE_LOA_query].月日 FROM [LOA-FTD_1_5_1_CALCULATE_LOA_query]"
> sqlQuery(channel,query)
[1] "42000 -3100 [Microsoft][ODBC Microsoft Access Driver] Syntax error in query expression '[LOA-FTD_1_5_1_CALCULATE_LOA_query].<U+6708><U+65E5>'."
[2] "[RODBC] ERROR: Could not SQLExecDirect 'SELECT [LOA-FTD_1_5_1_CALCULATE_LOA_query].<U+6708><U+65E5> FROM [LOA-FTD_1_5_1_CALCULATE_LOA_query]'"
Here, 月日 was converted into U+6708 and U+65E5 in the error message. These are the UTF-8 codes for the two characters, so I guess the string is sent encoded in UTF-8 to MS Access, but MS Access is then unable to read it? Is MS Access even part of the process of carrying out the query?
So it must be an encoding issue, where RStudio and MS Access do not understand each other. When I looked at similar issues with Japanese characters, the problem was usually to display values in a table. Here the variable names are in Japanese, so the query does not work at all.
I am quite lost, so I am open to any idea or remark.
Thank you.

I found an answer that works for me in this post.
The trick (at least in my case) was to set locale to Japanese_Japan.932 before any data importing.
Here is the code for this command:
Sys.setlocale("LC_ALL", locale = "Japanese_Japan.932")
Then I imported my data from Access without having to change encoding, and the Japanese characters are displayed correctly in the resulting data frame. Moreover, this allows Japanese characters in the query to be understood.

Related

MS Access CREATE TABLE "WITH COMPRESSION" syntax?

I'm trying to write a CREATE TABLE statement for Microsoft Access (to be executed via a C# / .NET app using an OleDbConnection), utilizing the WITH COMPRESSION attribute to cause character columns (TEXT) to be created using single-byte characters rather than Unicode double-byte characters, as documented on MSDN here.
The WITH COMPRESSION attribute can be used only with the CHARACTER and MEMO (also known as TEXT) data types and their synonyms.
The WITH COMPRESSION attribute was added for CHARACTER columns because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft® Jet databases that contain predominately character data, this could mean that the database file would nearly double in size when converted to the Microsoft Access database engine format. However, Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS) can easily be compressed to a single byte. If you define a CHARACTER column with this attribute, data will automatically be compressed as it is stored and uncompressed when retrieved from the column.
When I try to execute the following statement (which I believe to be syntactically correct per MSDN) via an OleDbConnection, I get a syntax error.
CREATE TABLE [Foo] ([COL1] TEXT(255) WITH COMPRESSION)
Likewise, executing the same statement directly within MS Access 2013 as a query gives a syntax error at WITH.
Executing
CurrentProject.Connection.Execute("CREATE TABLE [Foo1] ([COL1] TEXT(255) WITH COMPRESSION)")
from Access VBA does work, however.
If I take out the WITH COMPRESSION attribute, the statement executes without error both via OleDb and directly in MS Access.
Any ideas what I'm doing wrong?
My problem turned out to be a syntax error that wasn't reflected properly in my original question.
However, solving that problem revealed that the documentation for MS Access CREATE TABLE on MSDN https://msdn.microsoft.com/en-us/library/office/ff837200.aspx is incorrect regarding the sequence of attributes for the CREATE TABLE statement. According to the documentation, the syntax is:
CREATE [TEMPORARY] TABLE table (field1 type [(size)] [NOT NULL] [WITH COMPRESSION | WITH COMP] [index1] [, field2 type [(size)] [NOT NULL] [index2] [, …]] [, CONSTRAINT multifieldindex [, …]])
but in fact, [WITH COMPRESSION | WITH COMP] must appear before [NOT NULL] or you get a syntax error.
Additionally, it's not possible to execute the CREATE TABLE statement using the WITH COMPRESSION attribute from a query directly within MS Access. You have to either use VBA or (as in my case) an external program via OleDbConnection.
My experience with "WITH COMPRESSION" and MS-ACCESS 2013
Impossible to run such a script from query window.
Possible from VBA but with limitations:
currentdb.Execute "... WITH COMPRESSION" -> "Syntax error in CREATE
TABLE" CurrentProject.Connection.Execute " ..." - > Ok
I confirm what "Mr. T" says: WITH COMPRESSION must appear before NOT NULL

Getting 2 different lengths for a text field in perl from a DBI query

I have encrypted data in a mysql table stored as a text field.
Everything was originally written in Windows perl and that still works without issue.
My problem is that I am running the same code on Linux and when I query the table the text result in perl tells me it is longer (which causes my decryption to blow up since it is too long).
This happens running the same script so I know there is not a code difference.
Mysql server is 5.1.63 running on OpenSuSE Linux 11.4 x64.
Linux perl is v5.12.3
Windows perl is 5.10.1
The field in question is defined as text, utf8_general_ci and when I access it via JDBC the data reports 128 bytes,
the SQL in question is simple (pruned down to just what matters here)
my $gatherSQL = "select
table.encryptedText from action.theTable table
where table.custno=" . $dbHandle->quote($custno)
my $getHandle = $dbHandle->prepare($gatherSQL);
$getHandle->execute();
my $arrayRef = $getHandle->fetchall_arrayref();
foreach my $myrow (#$arrayRef)
{
$type = $$myrow[0];
}
$getHandle->finish();
#DB Handle is opened with a simple
my $workSQLhandle = DBI->connect("dbi:mysql:$dataDB:$dataServer:$dataPort", $userToUse, $pwToUse);
return($workSQLhandle);
When I run the code in Windows (through a samba share) I get a length of the field of 128 (which decrypts)
Same code on the same machine run from a command prompt tells me the same return string is 193 chars long (and won't decrypt)
I did a compare of the results coming back and they are identical but perl tells me one is longer than the other.
Any thoughts on how I can address this and what the root cause is?
check if perhaps mysql/perl are doing some translations on the text. e.g select length(table.encryptedText) to see what mysql thinks the length is. encrypted text tends to look like binary garbage, and if you're storing it in a TEXT-type field, it WILL be subject to automatic charset translation. encrypted data should go into BLOB fields, which are otherwise identical to TEXT, but are NOT charset-sensitive.

MSSQL to MySQL migration - char encoding issues with UCS-2 surrogate pairs, how can I remove these from MSSQL database?

I have been tasked with migrating a Microsoft SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.
-MSSQL source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.
-MySQL target database wants the character set UTF-8
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the MSSQL database.
The migration toolkit copytable program did not provide a very useful error message: "Error during charset conversion of wstring: No error". It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).
When this surrogate pair was removed from the MSSQL database the row was migrated successfully to MySQL.
Here is the problem:
I have tried to search for these characters in a test MSSQL table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results... I must be doing something incorrectly.
Searching for
SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)
Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).
Searching for
SELECT * FROM chartest WHERE text LIKE '%' + (NCHAR(0xdc83))+ '%'
Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?
I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.
I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time...
I would appreciate any suggestions very much! Please let me know if there is any information I have left out.
I had this error, and now I have discovered the source of the problem. I had a hard time finding out, so maybe this will be useful to someone, even though I realize, my problem and workaround may not be spot on matching op's original trouble.
I am migrating data from MSSQL to MySQL, and the content being migrated is html-content from Sitecore CMS (target CMS is Drupal, btw).
I've found, that I get this error when converting the database and hitting records, that contain Instagram-embeds. Instagram-embeds work in the way, that the embedded post data is copied to the embed code (instead of being loaded async., et.c. - even the image is included as base64-css...), and the young people nowadays tend to put a lot of emoji's in their image-descriptions (using their iPhones with Emoji keyboard). Emoji's are represented by 4-byte encoded characters, but MySQL utf8 only allows for 3-byte encoded unicode characters.
My initial error from running wbcopytables.exe (which is the non-GUI way of doing Migration Wizard in MySQL Workbench) was the
Error during charset conversion of wstring: No error
but upgrading MySQL Workbench to recent version (from 5.something to 6.x) makes the error a bit more descriptive, hinting table and column (alas, not row):
ERROR: Could not successfully convert UCS-2 string to UTF-8 in table
[MyDatabase].[dbo].[MyTable] (column MyColumn).
Original string: ...
Anyway - a solution *could* be to use utf8mb4 which would allow for the emoji's. Read more here.
But it looks like, it's a bad idea to do this in e.g. my case with Drupal.
So - the solution I ended up with was simply to strip these characters in my migrate-script. There is no point in keeping these for users of the site in question, since they are being displayed as rectangles on the webpage anyway. Since you can't search-and-replace with regex in SQL Server, I processed the data using a DAL and c# .NET, and I found the help here (thanks a ton, Jon Skeet) - turns out there is a regex-pattern for matching one half of a surrogate pair in UTF-16. See below (and use the pattern in another language if needed).
var noUcs2SurrogatePairsString = Regex.Replace(stringWithUcs2SurrogatePairs, #"\p{Cs}", string.Empty);
I had a very similar problem today, and I found that it was caused by empty strings, replaced them with NULLs or a value representing no data and the migration worked fine.
I solved just editing the "import data script.cmd" where it reads columns "As NVARCHAR" by replacing those with "VARCHAR" only.
Note: My table columns was VARCHAR type already, so... for some stupid reason the migration script improperly cast it to UNICODE (NVARCHAR) type.
This issue has now been resolved. I used user Remus Rusanu's suggestion here for finding the rows with these surrogate pair characters using CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:
UPDATE test
SET a = SUBSTRING(a, 1, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)

Spanish characters in SQL select

I'm working on a Spanish language website where some text is stored in a MS SQL 2008 database table.
The text is stored in the db table with characters such as á, í and ñ.
When I retrieve the data, the characters don't display on the page.
This is probably a very simple fix but please educate me.
You must use Unicode instead of ANSI strings and functions, and must choose a web page encoding that has the required character set. Some searches on those terms will yield all you need. Look up content type 1252 and 8859 as well in case you get stuck (examples, not answers).

Characters entered from foreign users showing as?

I'm working on a site that has users from other countries. For the most part we get English text but sometimes people use special characters like Chinese symbols or the E with the accent. These symbols are displaying as "?" when shown on the site.
The site has a UTF-8 charset declaration and the SQL Server database field is Nvarchar. I did a test by going to Google translate and having it translate "Good morning" into Japanese. When I copied the resulting Kanji to my site and saved it myself it worked fine.
What could be causing this issue? I'm guessing it's because the text is being entered in a charset that is not UTF-8. Will accept-charset="UTF-8" resolve the issue? If not what can I do? Even if there is no way to fix existing bad data can I prevent this issue in the future?
SQL Server 7.0 and SQL Server 2000 use
a different Unicode encoding (UCS-2)
and do not recognize UTF-8 as valid
character data.
See the following knowledge base article for dealign with storing/retreieving utf-8 data in a MS SQL Server database: http://support.microsoft.com/kb/232580