I'm using SQL BULK insert from a CSV file with some spanish names like Zuñiga. The CSV file in UTF-8 format (As far as I know).
These show up in the table in one of these two formats:
For NVARCHAR - Zu├▒iga
for VARCHAR - Zuñiga
The command I'm using is
BULK INSERT temp_table FROM '<some CSV file>' WITH (CODEPAGE = 'RAW',
DATAFILETYPE = 'char', FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2)
I was aslo testing all variations of CODEPAGE and DATAFILETYPE with similar results
UPDATE
Saved the CSV (using notepad save-as) as unicode and that fixed the problem. But I need some kind of automatic solution. I prefere to fix the SQL to handle it, and not to preptocess the CSV
You cannot use codepage="RAW", you need to specify the proper code page so that the file reader understands the content of the file. If the file is trully UTF-8 then you should set the code page to 65001.
Related
I have a column of data, let's call it bank_date, that I receive from an external vendor as a csv file every day. As such the dates in that column show as '1/1/2020'.
I am trying to upload that raw csv file directly to SQL daily. We used to store the SQL bank_date format as text, but we have converted it to a Data data type, and now it keeps zero'ing out every time, with some sort of truncate / "datetime value incorrect" error.
I have now tested 17 different versions of utilizing STR_TO_date (mostly), CAST, and CONVERT, and feel like I'm close, but I'm not quite getting the syntax right.
Also for reference, I did find 2 other workarounds that are successful, but my boss specifically wants it uploaded and converted directly through the import process (not manipulating the raw csv data) for safety reasons. For reference:
Workaround 1: Convert csv date column to the YYYY-MM-DD format and save file. The issue with this is that if you try to open that CSV file again, it auto-changes the date format back to the standard mm/dd/yyyy. If someone doesn't know to watch out for this and is re-opening the csv file to double check something, they're gonna find an error when they upload, and the problem is not easy to identify.
Workaround 2:Create an extra dummy_date column in the table that is formatted as a text data type and upload as normal. Then copy and paste the data into the correct bank_date column using a str_to_date function as follows: UPDATE dummy_date SET bank_date = STR_TO_DATE(dummy_date, ‘%c/%e/%Y’); The issue with this is that it just creates extra unnecessary data that can be confused when other people may not know that 1 of the columns is not intended for querying.
Here is my current code:
USE database_name;
LOAD DATA LOCAL INFILE 'C:/Users/Shelly/Desktop/Date Import.csv'
INTO TABLE bank_table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS
(bank_date, bank_amount)
SET bank_date = str_to_date(bank_date,'%Y-%m-%d');
The "SET" line is what I cannot work out on syntax to convert a csv's 1/5/2020' to SQL's 2020-1-5 format. Every test I've made either produces 0000-00-00 or nulls the column cells. I'm thinking maybe I need to tell SQL how to understand the csv's format in order for it to know how to convert it. Newbie here and stuck.
You need to specify a format for a date that is in the file, not a "required" one:
SET bank_date = str_to_date(bank_date,'%c/%e/%Y');
I have a CSV file in Azure data lake that , when opened with notepad++ looks something like this:
a,b,c
d,e,f
g,h,i
j,"foo
bar,baz",l
Upon inspection in notepad++ (vew all symbols) it shows me this:
a,b,c[CR][LF]
d,e,f[CR][LF]
g,h,i[CR][LF]
j,"foo[LF]
[LF]
bar,baz",l[CR][LF]
That is to say normal Windows Carriage Return and Line Feed stuff after each row.
With the exceptions that for one of the columns someone inserted a fancy story like such:
foo
bar, baz
My TSQL code to injest the CSV looks like this:
COPY INTO
dbo.SalesLine
FROM
'https://mydatalakeblablabla/folders/myfile.csv'
WITH (
ROWTERMINATOR = '0x0d', -- Tried \n \r\n , 0x0d0a here
FILE_TYPE = 'CSV',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
CREDENTIAL = (IDENTITY = 'Managed Identity') --Used to access datalake
)
But the query doesn't work. The common error message in SSMS is:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (NAME) in data file
I have no option to correct the faulty rows in the data lake or modify the CSV in any way.
Obviously it is much larger file with real data, but I took a simple example.
How can I modify or rescript the TSQL code to correct the CSV when it is being read?
I recreated a similar file and uploaded it to my datalake and serverless SQL pool seemed to manage just fine:
SELECT *
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/badFile.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0'
) AS [result]
My results:
It probably seems like a bit of a workaround but if the improved parser in serverless is making light work of problems like this, then why not make use of the whole suite that is Azure Synapse Analytics. You could use serverless query as a source in a Copy activity in a Synapse Pipeline and load it to your dedicated SQL pool and that would have the same outcome as using the COPY INTO command.
In the past I've done stuff like written special parsing routines, loaded up the file as one column and split it in the db or used Regular Expressions but if there's a simple solution why not use it.
I viewed my test file via online hex editor, maybe I'm missing something:
I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...
I am importing a source CSV file, I don't know the source encoding and I can only see either � (ANSI encoding) or � (UTF8-without-BOM encoding) when I open a the file with Notepad++ (related question).
This file has been imported to the database mssql-2008 using bulk insert:
DECLARE #bulkinsert NVARCHAR(2000)
SET #bulkinsert =
N'BULK INSERT #TempData FROM ''' +
#FilePath +
N''' WITH (FIRSTROW = 2,FIELDTERMINATOR = ''","'',ROWTERMINATOR =''\n'')'
EXEC sp_executesql #bulkinsert
This is then copied to the regular table1 from #tempData in a column1 (varchar()). Now when I look into this table1 I see some ? in place of those characters.
I have tried to cast to nvarchar() but it does not help.
when I digged into what those characters really are with support of the link we download at same time, I saw that the characters were é,ä,å and so on.
I would use replace to fix the data but I need to make some ugly codes and look into individual pattern of words and replace, so seems difficult.
database/table collation: SQL_Latin1_General_CP1_CI_AS
column1(Varchar(80))
Can I change these characters to English-like characters or the original characters instead of ? marks.
I have looked at Collation and Unicode Support which did not help me. I understood what it means about encoding but did not supply me with what to do. I have looked into most of the posts here in stackoverflow yes there are some posts about it but did not match my search.
I am unable to figure out where the problem lies.
In my case I can fix the encoding problem with the CODEPAGE option:
BULK
INSERT #CSV
FROM 'D:\XY\xy.csv'
WITH
(
CODEPAGE = 'ACP',
DATAFILETYPE ='char',
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
FIRSTROW = 2
)
Possible values:
CODEPAGE = { 'ACP' | 'OEM' | 'RAW' | 'code_page' } ]
You can find more information about the option here:
BULK INSERT
It was answered in the comment. Did you try it?
http://msdn.microsoft.com/en-us/library/ms189941.aspx
Option DATAFILETYPE ='widenative'
Based on comment from Esailiga did the text get truncated before or after the bulk import. I agree it sounds like the CSV file itself is single byte. Unicode requires option DATAFILETYPE ='widenative'. If the CSV file is single byte the is not magic translation back.
What is too bad is é is extended ASCII and supported with SQL char so more evidence the problem is at the CSV.
SELECT CAST('é' AS char(1))
notice this works as extended ASCII (<255)
Sounds like you need to go back to the source.
The ? in SQL is unknown. Same as � in notepad.
I still cannot believe that after all these years Microsoft has not fixed this obvious bug. There should be no problem with èéêë etc because they are all ascii(<255). This quest is posed over and over again on many sites and the question has yet to be answered
My data is in a table in excel. having generated the insert into statements the table is parsed a 2nd time looking for asccii > 'z' and generating and update table set column statement to overwrite the imported data. Cumbersome but workable
I've done it! After all these years and we were all looking in the wrong place. No work needed no rewriting scripts...
The problem lies with SSMS... if you "New Query" by right-clicking on "Queries" you get to rename the file but not create it that is done for you...
But... if you "Ctrl+N" you get a new query window to edit but no file is created... So you save it yourself and choose encoding on the save button... towards the bottom of the list you'll find UTF-8(without signature) codepage 65001
And that is it...
script after script open a new query window with "ctrl+N" copy and paste from an existing query and save as directed above. And as if by magic it works
If like me you have tables in Excel... parse the table writing the output to the 1st column of a new workbook with 1 sheet in it and then saveas and choose utf-8 encoding
Speed things up with a template file containing a comment "-- utf-8" something like that. save it as utf-8 and use a file listing of *.sql pasted into excel to concatenate a list of
=concatenate("ren templatefile.txt ", char(34), a1, char(34))
in b1 and drop it down
After all these years of manual solutions I am literally sweating with excitement at the discovery. Thank you for getting me so upset
I am trying to import a csv file to a mssql database with the BULK INSERT method. Problem is that it contains special character (norwegian letters æ, ø, and å) and after the insert is ran they get replaced by characters I don't know the encoding of.
To be more specific, ø is replaced with °, å is replaced with Õ and æ is replaced with µ.
I also tried to convert them to UTF-8 before inserting, but I understood that the BULK INSERT method doesn't support this. The respective UTF-8 encodings for æøå then ended up with something like +©.
I have also tried to use the wizard import function, but since I have identity on one of the columns, the import will just insert a 0 for every record rendering the import useless for copying.
Anyone know how I could set the encoding when running the bulk insert as it works perfectly with the identity column. I am using MS SQL Server Managent Studio 2008
I think there are two ways:
Specify CODEPAGE = RAW in your BULK INSERT command (see MSDN).
Create a format file and specify a collation for each column.