SQL COPY INTO is unable to parse a CSV due to unexpected line feeds in some fields. ROWTERMINATOR and FIELDQUOTE paramters do not work - csv

I have a CSV file in Azure data lake that , when opened with notepad++ looks something like this:
a,b,c
d,e,f
g,h,i
j,"foo
bar,baz",l
Upon inspection in notepad++ (vew all symbols) it shows me this:
a,b,c[CR][LF]
d,e,f[CR][LF]
g,h,i[CR][LF]
j,"foo[LF]
[LF]
bar,baz",l[CR][LF]
That is to say normal Windows Carriage Return and Line Feed stuff after each row.
With the exceptions that for one of the columns someone inserted a fancy story like such:
foo
bar, baz
My TSQL code to injest the CSV looks like this:
COPY INTO
dbo.SalesLine
FROM
'https://mydatalakeblablabla/folders/myfile.csv'
WITH (
ROWTERMINATOR = '0x0d', -- Tried \n \r\n , 0x0d0a here
FILE_TYPE = 'CSV',
FIELDQUOTE = '"',
FIELDTERMINATOR = ',',
CREDENTIAL = (IDENTITY = 'Managed Identity') --Used to access datalake
)
But the query doesn't work. The common error message in SSMS is:
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (NAME) in data file
I have no option to correct the faulty rows in the data lake or modify the CSV in any way.
Obviously it is much larger file with real data, but I took a simple example.
How can I modify or rescript the TSQL code to correct the CSV when it is being read?

I recreated a similar file and uploaded it to my datalake and serverless SQL pool seemed to manage just fine:
SELECT *
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/badFile.csv',
FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
) AS [result]
My results:
It probably seems like a bit of a workaround but if the improved parser in serverless is making light work of problems like this, then why not make use of the whole suite that is Azure Synapse Analytics. You could use serverless query as a source in a Copy activity in a Synapse Pipeline and load it to your dedicated SQL pool and that would have the same outcome as using the COPY INTO command.
In the past I've done stuff like written special parsing routines, loaded up the file as one column and split it in the db or used Regular Expressions but if there's a simple solution why not use it.
I viewed my test file via online hex editor, maybe I'm missing something:

Related

SSIS Data flow task (SQl -> CSV) data formatting issue

In SSIS, I have a data flow task which uses a data flow task to get data from SQL and save it in CSV file( using flat file connection manager).
In this, I have a column which will give me values like (case1 and case 2 are just for reference.)
Case 1: 2333,376677
Case 2: E37777,77777
In first case, I am not getting the , in Csv. It is giving me 2333376677.
In case 2, it is working fine.
In connection manager, I am delimiting with text qualifier as ".
Any suggestions on how to fix this?

How can I import a .TSV file using the "Get Data" command using SPSS syntax?

I'm importing a .TSV file, with the first row being the variable name and the first column as IDs, into SPSS using a syntax but I keep getting a Failure opening file error in my output. This is my code so far:
GET DATA
/TYPE=TXT
/FILE=\filelocation\filename.tsv
/DELCASE=LINE
/DELIMTERS="/t"
/QUALIFIER=''
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
/MAP
RESTORE.
CACHE.
EXECUTE.
SAVE OUTFILE = "newfile.sav"
I think I'm having an issue in the delimters or qualifiers subcommand. Wondering if I should also include the variables under the variables subcommand. Any advice would be helpful. Thanks!
The GET DATA command you cite above has an empty /VARIABLES= subcommand.
If you used the "File -> Import Data -> Text Data' wizard, it would have populated this subcommand for you. If you are writing the GET DATA command syntax yourself, then you'd have to supply that list of field names yourself.

SSIS - Reading from a multi part text file

I'm trying to build an SSIS package that reads from a text file and outputs into another text file. The catch is that the file I'm trying to read from has multiple sections and I can't find anything that shows how to do that.
The file looks like this:
[sectionA]
key1=value1
key2=value2
key3=value3
[sectionB]
key4=value4
key5=value5
key6=value6
I started with a couple of tutorials that read from a flat file source but the data gets pulled into an equally ugly table. Hoping someone has some input on this.
The SSIS Flat File Connection is built for speed so it doesnt allow for niceties like that.
I would still use the Flat File Connection but just load all the data into a single, wide NVARCHAR column in a SQL table. I would add an IDENTITY column to that table to get a relative Row Number.
Then I would add downstream tasks using SQL to select by Sections e.g. for Section A rows:
WHERE File_Row_Number > ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionA]' )
AND File_Row_Number < ( SELECT MIN ( File_Row_Number ) FROM Staging_Table WHERE nvarchar_column = '[sectionB]' )
If the split requirements are as simple as those shown I might attempt them in SQL e.g.
How do I split a string so I can access item x?
But I would probably lean towards using Strings.Split in a Script Task where the code will be simpler and safer.

How to copy csv data file to Amazon RedShift?

I'm trying to migrating some MySQL tables to Amazon Redshift, but met some problems.
The steps are simple:
1. Dump the MySQL table to a csv file
2. Upload the csv file to S3
3. Copy the data file to RedShift
Error occurs in step 3:
The SQL command is:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ',' csv;
The error info:
An error occurred when executing the SQL command: copy TABLE_A from
's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx ERROR: COPY CSV is
not supported [SQL State=0A000] Execution time: 0.53s 1 statement(s)
failed.
I don't know if there's any limitations on the format of the csv file, say the delimiters and quotes, I cannot find it in documents.
Any one can help?
The problem is finally resolved by using:
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS
'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' delimiter ','
removequotes;
More information can be found here http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Now Amazon Redshift supports CSV option for COPY command. It's better to use this option to import CSV formatted data correctly. The format is shown bellow.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV;
The default delimiter is ( , ) and the default quotes is ( " ). Also you can import TSV formatted data with CSV and DELIMITER option like this.
COPY [table-name] FROM 's3://[bucket-name]/[file-path or prefix]'
CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' CSV DELIMITER '\t';
There are some disadvantages to use the old way(DELIMITER and REMOVEQUOTES) that REMOVEQUOTES does not support to have a new line or a delimiter character within an enclosed filed. If the data can include this kind of characters, you should use CSV option.
See the following link for the details.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
If you want to save your self some code/ you have a very basic use case you can use Amazon Data Pipeline.
it stats a spot instance and perform the transformation within amazon network and it's really intuitive tool (but very simple so you can't do complex things with it)
You can try with this
copy TABLE_A from 's3://ciphor/TABLE_A.csv' CREDENTIALS 'aws_access_key_id=xxxx;aws_secret_access_key=xxxx' csv;
CSV itself means comma separated values, no need to provide delimiter with this. Please refer link.
[http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format]
I always this code:
COPY clinical_survey
FROM 's3://milad-test/clinical_survey.csv'
iam_role 'arn:aws:iam::123456789123:role/miladS3xxx'
CSV
IGNOREHEADER 1
;
Description:
1- COPY the name of your file store in S3
2- FROM address of file
3- iam_role is a substitution for CREDENTIAL. Note that, iam_role should be defined in iam management menu at your console, and then in trust menu should be assigned to the user as well (That is the hardest part!)
4- CSV uses comma delimiter
5- IGNORHEADER 1 is a must! Otherwise it will throw an error. (skip one row of my CSV and consider it as a header)
Since the resolution has already been provided, I'll not repeat the obvious.
However, in case you receive some more error which you're not able to figure out, simply execute on your workbench while you're connected to any of the Redshift accounts:
select * from stl_load_errors [where ...];
stl_load_errors contains all the Amazon RS load errors in historical fashion where a normal user can view details corresponding to his / her own account but a superuser can have all the access.
The details are captured elaborately at :
Amazon STL Load Errors Documentation
Little late to comment but it can be useful:-
You can use an open source project to copy tables directly from mysql to redshift - sqlshift.
It only requires spark and if you have yarn then it can also be used.
Benefits:- It will automatically decides distkey and interleaved sortkey using primary key.
It looks like you are trying to load local file into REDSHIFT table.
CSV file has to be on S3 for COPY command to work.
If you can extract data from table to CSV file you have one more scripting option. You can use Python/boto/psycopg2 combo to script your CSV load to Amazon Redshift.
In my MySQL_To_Redshift_Loader I do the following:
Extract data from MySQL into temp file.
loadConf=[ db_client_dbshell ,'-u', opt.mysql_user,'-p%s' % opt.mysql_pwd,'-D',opt.mysql_db_name, '-h', opt.mysql_db_server]
...
q="""
%s %s
INTO OUTFILE '%s'
FIELDS TERMINATED BY '%s'
ENCLOSED BY '%s'
LINES TERMINATED BY '\r\n';
""" % (in_qry, limit, out_file, opt.mysql_col_delim,opt.mysql_quote)
p1 = Popen(['echo', q], stdout=PIPE,stderr=PIPE,env=env)
p2 = Popen(loadConf, stdin=p1.stdout, stdout=PIPE,stderr=PIPE)
...
Compress and load data to S3 using boto Python module and multipart upload.
conn = boto.connect_s3(AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucket_name)
k = Key(bucket)
k.key = s3_key_name
k.set_contents_from_file(file_handle, cb=progress, num_cb=20,
reduced_redundancy=use_rr )
Use psycopg2 COPY command to append data to Redshift table.
sql="""
copy %s from '%s'
CREDENTIALS 'aws_access_key_id=%s;aws_secret_access_key=%s'
DELIMITER '%s'
FORMAT CSV %s
%s
%s
%s;""" % (opt.to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,opt.delim,quote,gzip, timeformat, ignoreheader)

SQL 2008 BULK INSERT with spanish charcters

I'm using SQL BULK insert from a CSV file with some spanish names like Zuñiga. The CSV file in UTF-8 format (As far as I know).
These show up in the table in one of these two formats:
For NVARCHAR - Zu├▒iga
for VARCHAR - Zuñiga
The command I'm using is
BULK INSERT temp_table FROM '<some CSV file>' WITH (CODEPAGE = 'RAW',
DATAFILETYPE = 'char', FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2)
I was aslo testing all variations of CODEPAGE and DATAFILETYPE with similar results
UPDATE
Saved the CSV (using notepad save-as) as unicode and that fixed the problem. But I need some kind of automatic solution. I prefere to fix the SQL to handle it, and not to preptocess the CSV
You cannot use codepage="RAW", you need to specify the proper code page so that the file reader understands the content of the file. If the file is trully UTF-8 then you should set the code page to 65001.