Accented characters not correctly imported with BULK INSERT - sql-server-2008

I am importing a source CSV file, I don't know the source encoding and I can only see either � (ANSI encoding) or � (UTF8-without-BOM encoding) when I open a the file with Notepad++ (related question).
This file has been imported to the database mssql-2008 using bulk insert:
DECLARE #bulkinsert NVARCHAR(2000)
SET #bulkinsert =
N'BULK INSERT #TempData FROM ''' +
#FilePath +
N''' WITH (FIRSTROW = 2,FIELDTERMINATOR = ''","'',ROWTERMINATOR =''\n'')'
EXEC sp_executesql #bulkinsert
This is then copied to the regular table1 from #tempData in a column1 (varchar()). Now when I look into this table1 I see some ? in place of those characters.
I have tried to cast to nvarchar() but it does not help.
when I digged into what those characters really are with support of the link we download at same time, I saw that the characters were é,ä,å and so on.
I would use replace to fix the data but I need to make some ugly codes and look into individual pattern of words and replace, so seems difficult.
database/table collation: SQL_Latin1_General_CP1_CI_AS
column1(Varchar(80))
Can I change these characters to English-like characters or the original characters instead of ? marks.
I have looked at Collation and Unicode Support which did not help me. I understood what it means about encoding but did not supply me with what to do. I have looked into most of the posts here in stackoverflow yes there are some posts about it but did not match my search.
I am unable to figure out where the problem lies.

In my case I can fix the encoding problem with the CODEPAGE option:
BULK
INSERT #CSV
FROM 'D:\XY\xy.csv'
WITH
(
CODEPAGE = 'ACP',
DATAFILETYPE ='char',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2
)
Possible values:
CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
You can find more information about the option here:
BULK INSERT

It was answered in the comment. Did you try it?
http://msdn.microsoft.com/en-us/library/ms189941.aspx
Option DATAFILETYPE ='widenative'
Based on comment from Esailiga did the text get truncated before or after the bulk import. I agree it sounds like the CSV file itself is single byte. Unicode requires option DATAFILETYPE ='widenative'. If the CSV file is single byte the is not magic translation back.
What is too bad is é is extended ASCII and supported with SQL char so more evidence the problem is at the CSV.
SELECT CAST('é' AS char(1))
notice this works as extended ASCII (<255)
Sounds like you need to go back to the source.
The ? in SQL is unknown. Same as � in notepad.

I still cannot believe that after all these years Microsoft has not fixed this obvious bug. There should be no problem with èéêë etc because they are all ascii(<255). This quest is posed over and over again on many sites and the question has yet to be answered
My data is in a table in excel. having generated the insert into statements the table is parsed a 2nd time looking for asccii > 'z' and generating and update table set column statement to overwrite the imported data. Cumbersome but workable

I've done it! After all these years and we were all looking in the wrong place. No work needed no rewriting scripts...
The problem lies with SSMS... if you "New Query" by right-clicking on "Queries" you get to rename the file but not create it that is done for you...
But... if you "Ctrl+N" you get a new query window to edit but no file is created... So you save it yourself and choose encoding on the save button... towards the bottom of the list you'll find UTF-8(without signature) codepage 65001
And that is it...
script after script open a new query window with "ctrl+N" copy and paste from an existing query and save as directed above. And as if by magic it works
If like me you have tables in Excel... parse the table writing the output to the 1st column of a new workbook with 1 sheet in it and then saveas and choose utf-8 encoding
Speed things up with a template file containing a comment "-- utf-8" something like that. save it as utf-8 and use a file listing of *.sql pasted into excel to concatenate a list of
=concatenate("ren templatefile.txt ", char(34), a1, char(34))
in b1 and drop it down
After all these years of manual solutions I am literally sweating with excitement at the discovery. Thank you for getting me so upset

Related

How to save UTF-8 data in MySQL database in Livecode app

I'm trying to save some data gathered from fields in MySQL db. Text contains some Polish characters, but Livecode sends all Polish chars as '?'. Here's part of my code:
Declare variable
put the unicodeText of field "Title" into tTitle
put uniEncode(tTitle, "UTF8") into tTitle
Send this to db:
put "UPDATE magazyn SET NAZWA='" & tTitle & "'" into tSQLStatement
revExecuteSQL gConnectionID,tSQLStatement, "SET NAMES 'utf8'"
For example, word "łąka" is saved as "??ka". I've tried uniEncode, uniDecode, everything is going wrong.
Don't use any encoders/decoders. They will only add to the confusion.
When trying to use utf8/utf8mb4, if you see Question Marks (regular ones, not black diamonds),
The bytes to be stored are not encoded as utf8. Fix this. (Getting rid of the encoders may fix it.)
The column in the database is CHARACTER SET utf8 (or utf8mb4). Fix this.
Also, check that the connection during reading is utf8. I don't know the details of "Livecode"; look in its documentation. If you can't find anything, execute this SQL after connecting: SET NAMES utf8.
Problem solved! Here's code:
get the unicodeText of field "Title"
put unidecode(it,"polish") into tTitle
it will save polish characters in a strange version, but for downloading i'm using this code:
set the unicodetext of fld "List" to uniencode(tList,"polish")
tList variable contains all data gathered from MySQL
Ensure the column in the database is set to utf8 encoding.
Starting with LiveCode 7 all text in fields is unicode, specifically UTF-16. Before you send the text out to any external file or datastore, you need to encode it as UTF-8 (or whatever format you want to store it in. Use the LiveCode textEncode() function for this:
put textEncode(field "Title","utf-8") into tTitle
put "UPDATE magazine SET nazwa = :1" into tSQLStatement
revExecuteSQL gConnectionID, tSQLStatement, "tTitle"
Note: It's also a good idea to use the :N variable substitution method to reduce the risk of SQL code injection attacks.
When you read the data from the database use textDecode to convert back to UTF-16:
put textDecode(tRawDataFromDB,"UTF-16") into old tTitle

MySQL Exporting Arabic/Persian Characters

I'm new to MySQL and i'm working on it through phpMyAdmin.
My problem is that i have imported some tables with (.sql) extension into a database with: UTF8_general_ci format and it contains some Arabic or Persian characters. However, when i export these data into an Excel file, they appear as the following:
The original value: أحمد الكمالي
The exported value: أحمد  الكمالي
I have searched and looked for this issue and tried to solve it by making the output and the server connection with the same format UTF8_general_ci. But, for some reason which i don't know, the phpMyAdmin doesn't allow me to change to the same format, it forces me to chose this: UTF8mb4_general_ci
Anyway, when i export the data, i'm making sure that the format is in UTF8 but it still appears like that.
How can i solve it or fix it?
Note: Here are some screenshots if you want to check organized by numbers.
http://www.megafileupload.com/rbt5/Screenshots.rar
I found easier way that you can rebuild excel file with correct characters.
Export your data from MySQL normally in CSV format.
Open new Excel and go to Data tab.
Select "From Text".if you not find this it is under "Get External Data".
Select your file.
Change file origin to Unicode(UTF-8) and select next.("Delimited" checked by default)
Select Comma delimiter and press finish.
you will see your language characters correctly.See more
Mojibake. Probably...
The bytes you have in the client are correctly encoded in utf8mb4 (good).
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8mb4.)
The column in the tables may or may not have been CHARACTER SET utf8mb4, but it should have been that.
(utf8 and utf8mb4 work equally well for Arabic/Persian.)
Please provide more details if this explanation does not suffice.

Escape semicolon; MYSQL for Excel

I want to import data from an excel sheet into a MySQL database with the MySQL for Excel plugin. In some cells are texts with semicolons and I already figured out this causes a SQL error. I tried escaping the semicolons with backslash but I still get the error message. How can I escape the semicolon?
Kai,
this behaviour is purely the fault of MySQL for Excel, and seem to be a bug.
In the meantime, if you are not keen on changing your Excel data as suggested by others there is a workaround:
In your MySQL-for-Excel window click Options and then select Preview SQL statements before they are sent to the server and Accept.
Then proceed as normal with export / append data using the Add-in, but when a Review SQL script window appears, copy the contents into a different SQL tool (MySQL workbench, HeidiSQL, SQLWorkbench etc), and run. Then click cancel in the Mysql-for-Excel popups, and refresh the query if necessary.
Also: feel free to report the bug at: http://bugs.mysql.com/
Replace the semicolon with some unique text e.g. [SEMICOLON].
Next import the data to SQL and run something like
UPDATE your_table
SET your_field = REPLACE(your_field, '[SEMICOLON]', ';')
WHERE your_field LIKE '%[SEMICOLON]%'
I think all you need to do is consider the requirements Excel has when it imports data from CSV files (the parsing rules are probably the same or similar)
In your case, if a field contains any special characters, just quote the values with double quotes before importing the content in Excel.
So:
UPDATE table
SET field = '"' || field || '"'
WHERE field like '%,%'
The following rules should apply:
Fields containing a line-break, double-quote, and/or commas should be quoted
Any field may be quoted (with double quotes)
A (double) quote character in a field must be represented by two (double) quote characters.
More details: Wikipedia: Comma-separated values

Problem with charset

I have an MYSQL Database in utf-8 format, but the Characters inside the Database are ISO-8859-1 (ISO-8859-1 Strings are stored in utf-8). I've tried with recode, but it only converted e.g. ü to ü). Does anybody out there has an solution??
If you tried to store ISO-8859-1 characters in the a database which is set to UTF-8 you just managed to corrupt your "special characters" -- as MySQL would retrieve the bytes from the database and try to assemble them as UTF-8 rather than ISO-8859-1. The only way to read the data correctly is to use a script which does something like:
ResultSet rs = ...
byte[] b = rs.getBytes( COLUMN_NAME );
String s = new String( b, "ISO-8859-1" );
This would ensure you get the bytes (which came from a ISO-8859-1 string from what you said) and then you can assemble them back to ISO-8859-1 string.
The other problem as well -- what do you use to "view" the strings in the database -- is it not the case that your console doesn't have the right charset to display those characters rather than the characters being stored wrongly?
NOTE: Updated the above after the last comment
I just went through this. The biggest part of my solution was exporting the database to .csv and Find / Replace the characters in question. The character at issue may look like a space, but copy it directly from the cell as your Find parameter.
Once this is done - and missing this is what took me all morning:
Save the file as CSV ( MS-DOS )
Excellent post on the issue
Source of MS-DOS idea

SQL 2008 BULK INSERT with spanish charcters

I'm using SQL BULK insert from a CSV file with some spanish names like Zuñiga. The CSV file in UTF-8 format (As far as I know).
These show up in the table in one of these two formats:
For NVARCHAR - Zu├▒iga
for VARCHAR - Zuñiga
The command I'm using is
BULK INSERT temp_table FROM '<some CSV file>' WITH (CODEPAGE = 'RAW',
DATAFILETYPE = 'char', FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2)
I was aslo testing all variations of CODEPAGE and DATAFILETYPE with similar results
UPDATE
Saved the CSV (using notepad save-as) as unicode and that fixed the problem. But I need some kind of automatic solution. I prefere to fix the SQL to handle it, and not to preptocess the CSV
You cannot use codepage="RAW", you need to specify the proper code page so that the file reader understands the content of the file. If the file is trully UTF-8 then you should set the code page to 65001.