AWS Data Pipeline RedShift "delimiter not found" error - csv

I'm working on the data pipeline. In one of the steps CSV from S3 is consumed by RedShift DataNode. My RedShift table has 78 columns. Checked with:
SELECT COUNT(*) FROM information_schema.columns WHERE table_name = 'my_table';
After failed RedshiftCopyActivity 'stl_load_errors' table shows "Delimiter not found" (1214) error for line number 1, for column namespace (this is second column, varchar(255)) on position 0. Consumed CSV line looks like that:
0,my.namespace.string,2119652,458031,S,60,2015-05-02,2015-05-02 14:51:02,2015-05-02 14:51:14.0,1,Counter,1,Counter 01,91,Chaymae,0,,,,227817,1,Dine In,5788,2015-05-02 14:51:02,2015-05-02 14:51:27,17.45,0.00,0.00,17.45,,91,Chaymae,0,0.00,12,M,A,-1,13,F,0,0,2,2.50,F,1094055,Coleslaw Md Upt,8,Sonstige,900,Sides,901,Sides,0.00,0.00,0,,,0.0000,0,0,,,0.00,0.0000,0.0000,0,,,0.00,0.0000,,1,Woche Counter,127,Coleslaw Md Upt,2,2.50
After simple replacement ("," to "\n") I have 78 lines so it looks like the data should be matched... I'm stuck on that. Maybe someone knows how I can find more information about the error or see the solution?
EDIT
Query:
select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();
results with 0 rows.

I figured it out and maybe it will be useful for someone else:
There were in fact two problems.
My first field in the redshift table was of the type INT IDENTITY(1,1) and in CSV I had 0 value there. After removing the first column from CSV, even without specified columns mapping everything was copied without a problem if...
DELIMITER ',' commandOption was added to S3ToRedshiftCopyActivity to force using comma. Without it RedShift recognized dot from namespace (my.namespace.string) as delimiter.

You need to add FORMAT AS JSON 's3://yourbucketname/aJsonPathFile.txt'. AWS has not mentioned this already. Please note that this is only work when your data is in json form like
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}
{'attr1': 'val1', 'attr2': 'val2'} {'attr1': 'val1', 'attr2': 'val2'}

Related

BCP CHAR value to Snowflake

I am trying to create a BCP file with | delimiter and then load it to a snowflake table.
Issue:
in SQL server there are columns defined as CHAR(4) and have values "sss"
so when i do BCP the its being padded to length of 4 "sss " and being loaded to snowflake
due to which our reports are failing because they do something like where column="SSS" but due to trailing space in snowflake the correct columns are not showing up.
we do not want to change our reports. So, is there a way that BCP can handle the padding or trimming of these columns?
note that there 24 tables and each have around 130+ columns so i cant go and put Trim functions on each char column
If your BCP file is maintaining the trailing space, then Snowflake will retain it, too, as long as the field is being FIELD_OPTIONALLY_ENCLOSED_BY a " or '. You may also want to make sure your TRIM_SPACE option is correctly set on your format definition for your COPY INTO command.
If your BCP file isn't maintaining the space and you can't figure out how to get that to work, you could force the space back in during the COPY INTO command with some string functions in your SELECT, or you could create a view for your report that does the same set of string functions to force the space for your report to work from.
So, is there a way that BCP can handle the padding or trimming of these columns?
Yes, but not by some switch or option. The correct way to handle this is to set your datatypes up front. As someone mentioned in comments to your question, your query that is creating BCP output should use VARCHAR(4) instead of CHAR(4). BCP is giving you what you asked of it. They way to avoid whitespace is to use varchar.
Seems like a fairly quick "find and replace" against scripted out query objects would work fine but you know your situation best.
Additionally, "trim" wont work - FYI. Even if the value of the field was only "SSS" (as in your example); if the result/column is defined as CHAR(4) you will get 4 bytes of data and a blank in the 4th place since you only had 3 bytes of data. Trim will work during the query... the padded " " you are getting is placed there by the copy out. The way to correct this is to set your data types as you need up front.
Unless someone knows of a better way in snowflake (im not familiar with it) the only other option is to manipulate the file inbetween SQL and Snowflake. replace " |" with "|"... but... blech.
This is a known "issue" with BCP. The "solution" is to use the queryout option, which means you must include a query with every export. But the data are the way they are.
Eg: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/88c258fe-d1a6-4f3a-9dac-40388d04e9c7/remove-space-in-columns-on-bcp-out?forum=transactsql
But this is really a Snowflake problem, because Snowflake has its own default CHAR semantics.
You get a warning in the documentation String & Binary Data Types but that doesn't tell the whole truth.
The following executed on Oracle (and apparently MSSQL? MySQL?) will select the aaa line:
CREATE TABLE C AS SELECT CAST('aaa ' AS CHAR(4)) t FROM DUAL;
SELECT * FROM C WHERE t = 'aaa';
but won't on Snowflake, unless you create the column with COLLATION:
CREATE OR REPLACE TABLE C (t CHAR(4) COLLATE 'en_US-rtrim');
INSERT INTO C VALUES('aaa ');
SELECT * FROM C WHERE t = 'aaa';
Unfortunately, you can't ALTER the collation after creation, which would have been convenient after a COPY INTO <table>.
PS: Mike Walton's answer is better, TRIM_SPACE is much cleaner than COLLATE.

How to change the field name in big query?

I have a nested JSON to upload in Big Query.
{
"status":{
"sleep":"12333",
"wake":"3837"
}
}
After inserting it in Big Query, I am getting the field names as :
status_sleep and status_wake
I require the field names to be seperated by delimeters like '.' or any other delimeter
status.sleep and status.wake
Please suggest how to add the field deimeter. I checked there is a field delimeter key for uploading the data in csv format.
After you insert data with above schema you have record named status with two fields in it status.sleep and status.wake
When you query as
SELECT * FROM yourtable
without providing aliases - you will get output named as status_sleep and status_wake because dot notation is reserved for referencing nested data.
But you still can reference your data with dots as in below
SELECT status.sleep as sleep, status.wake as wake FROM yourtable

Removing single quotes from a flat file when loading to Hive

Hey im creating an Hive external table over my flat file data.
The data in my flat file is something like this :
'abc',3,'xyz'
When I load it into the Hive table it shows me the result with the single quotes.
But I want it to be something like this :
abc,3,xyz
Is there any way to do this?
I can think of two ways to get desired result.
Use existing String functions available in hive - SUBSTR and LENGTH.
select SUBSTR("\'abc\'",2,length("\'abc\'")-2) , SUBSTR("\'3\'",2,length("\'3\'")-2) , SUBSTR("\'xyz\'",2,length("\'xyz\'")-2)
Generalized query
select SUBSTR(col1,2,length(col1)-2) , SUBSTR(col2,2,length(col2)-2) , SUBSTR(col3,2,length(col3)-2)
NOTE: Hive SUBSTR method expect string index to start from "1" not "0"
Write your own UDF to chop first and last letter of every string.
How to convert million rows?
Lets assume you have a table (named "staging") with 3 columns and 1million record.
if you run below query, you will have new table "final" which will not have any single quotes at the start or end.
INSERT INTO final SELECT SUBSTR(col1,2,length(col1)-2) , SUBSTR(col2,2,length(col2)-2) , SUBSTR(col3,2,length(col3)-2) from staging
Once the above query finish job , you will have your desired result in "final" table

How to replace all occurrences of matching string in a database table using ColdFusion

Working with a MS Access database, using one particular table, and scattered throughout the table at varying positions in date columns (which themselves can be in varying orders as a result of the data import) is the text "Not known". I want to replace occurrences of that text string across the whole data table.
The only way I can think of doing it is export to a csv format, and do a REReplace then import the data again, but I would like to know if there is a 'slicker' way?
The columns contain data which is a data import from a csv file so all the columns are text, they can contain a mix of "date string", text, numbers (as string) and null.
You can use replace, it follows basic TSQL implementation :
http://msdn.microsoft.com/en-us/library/ms186862.aspx
Here is an example I did updating the customers table of the Northwind sample database:
update customers set Customers.[Job Title] = replace( Customers.[Job Title], 'Purchasing', 'Manufacturing');
So to distill it into a generic example :
update TABLENAME set FIELD =
replace( FIELD, 'STRING_TO_REPLACE', 'STRING_TO_REPLACE_WITH' )
That updates the entire table in one statement. Be careful ;)
You can do this using Access, running edit-replace command. If you need to do this in code - you can open recordset, loop through records and for each field run:
rst.fields(i)=replace(rst.fields(i),"Not known","Something")
this is how it works in VBA, beleive you can do something similar in coldfusion
Why not just open the CSV file in Notepad++ (or similar) and do a Find/Replace?

How do I get SSIS Data Flow to put '0.00' in a flat file?

I have an SSIS package with a Data Flow that takes an ADO.NET data source (just a small table), executes a select * query, and outputs the query results to a flat file (I've also tried just pulling the whole table and not using a SQL select).
The problem is that the data source pulls a column that is a Money datatype, and if the value is not zero, it comes into the text flat file just fine (like '123.45'), but when the value is zero, it shows up in the destination flat file as '.00'. I need to know how to get the leading zero back into the flat file.
I've tried various datatypes for the output (in the Flat File Connection Manager), including currency and string, but this seems to have no effect.
I've tried a case statement in my select, like this:
CASE WHEN columnValue = 0 THEN
'0.00'
ELSE
columnValue
END
(still results in '.00')
I've tried variations on that like this:
CASE WHEN columnValue = 0 THEN
convert(decimal(12,2), '0.00')
ELSE
convert(decimal(12,2), columnValue)
END
(Still results in '.00')
and:
CASE WHEN columnValue = 0 THEN
convert(money, '0.00')
ELSE
convert(money, columnValue)
END
(results in '.0000000000000000000')
This silly little issue is killin' me. Can anybody tell me how to get a zero Money datatype database value into a flat file as '0.00'?
I was having the exact same issue, and soo's answer worked for me. I sent my data into a derived column transform (in the Data Flow Transform toolbox). I added the derived column as a new column of data type Unicode String ([DT_WSTR]), and used the following expression:
Price < 1 ? "0" + (DT_WSTR,6)Price : (DT_WSTR,6)Price
I hope that helps!
Could you use a Derived Column to change the format of the value? Did you try that?
I used the advanced editor to change the column from double-precision float to decimal and then set the Scale to 2:
Since you are exporting to text file, just export data preformatted.
You can do it in the query or create a derived column, whatever you are more comfortable with.
I chose to make the column 15 characters wide. If you import into a system that expects numbers those zeros should be ignored...so why not just standardize the field length?
A simple solution in SQL is as follows:
select
cast(0.00 as money) as col1
,cast(0.00 as numeric(18,2)) as col2
,right('000000000000000' + cast( 0.00 as varchar(10)), 15) as col3
go
col1 col2 col3
--------------------- -------------------- ---------------
.0000 .00 000000000000.00
Simply replace '0.00' with your column name and don't forget to add the FROM table_name, etc..
It is good to use derived column and need to check the condition as well
pricecheck <=0 ? "0" + (DT_WSTR,10)pricecheck : (DT_WSTR,10)pricecheck
or alternative way is to use vb script
Ultimately what I ended up doing was using the FORMAT() function.
CAST(FORMAT(balance, '0000000000.0000') AS varchar(30)) AS "balance"
This does have some significant CPU performance impact (often at least an order of magnitude) due to the way SQL Server implements that function, but nothing worked easier, more correctly, or more consistently for me. I was working with less than 100,000 rows and the package executes no more than once an hour. Going from 100ms to 1000ms just wasn't a big deal in my situation.
The FORMAT() function returns an nvarchar(4000) by default, so I also cast it back to a varchar of appropriate size since my output file needed to be in Windows-1252 encoding. Transcoding text is much more obnoxious in SSIS than it has any right to be.