Detecting the value 0x in a varchar column sourced from Excel - sql-server-2008

I have a sql table that gets populated via SQLBulkCopy from Excel. The copy down is done using the Microsoft ACE drivers.
I had a problem with one particular file - when it was loaded down to sql, some of the columns (which appear empty in excel) contained an odd value.
For example, running this sql:
SELECT
CONVERT(VARBINARY(10),MyCol),
LEN(MyCol)
FROM MyTab
would return
0x, 0
i.e. - converting the value in the column to varbinary shows something, but doing length of the varchar shows no length. I realise that the value shown is the stem of a hex value, but its weird that its gets there, and how hard it is to detect.
Obviously I can just clear out the cells in Excel, but I really need to detect this automatically as end users will have the same issue. It is causing issues further down the line when the data gets processed. Its quite hard to trace the problem back from its eventual symptoms to being this issue in the source.
Other than the above conversion to varbinary to output in SSMS I've not come up with a way of detecting these values, either in Excel or via a SQL script to remove them.
Any ideas?

This may help you:
-- Conversion from hex string to varbinary:
DECLARE #hexstring VarChar(MAX);
SET #hexstring = 'abcedf012439';
SELECT CAST('' AS XML).Value('xs:hexBinary( substring(sql:variable("#hexstring"), sql:column("t.pos")) )', 'varbinary(max)')
FROM (SELECT CASE SubString(#hexstring, 1, 2) WHEN '0x' THEN 3 ELSE 0 END) AS t(pos)
GO
-- Conversion from varbinary to hex string:
DECLARE #hexbin VarBinary(MAX);
SET #hexbin = 0xabcedf012439;
SELECT '0x' + CAST('' AS XML).Value('xs:hexBinary(sql:variable("#hexbin") )', 'varchar(max)');
GO
One method is to add a new column, convert the data, drop the
old column and rename the new column to the old name.

As Martin points out above, 0x is what you get when you convert an empty string. eg:
SELECT CONVERT(VARBINARY(10),'')
So the problem of detecting it obviously goes away.
I have to assume that there is some rubbish in the excel cell, that is being filtered out in the process of the write down by either the ACE driver or the SQLBulkCopy. Because there was something in the field originally, the value written is empty instead of null.
In order to make sure that everything is consistent in the data we'll need to do a post process to switch all empty values to nulls so that the next lots of scripts work.

Related

How to convert date from csv file into integer

I have to send data from csv into SQL DB.
Problem starts when I try to convert data into Int. It wasnt my idea and I really cant do much with this datatype. When I'm trying to achieve this problem pop up:
Data Conversion 2: Data conversion failed while converting column
"pr_czas" (387) to column "C pr_dCz_id" (14). The conversion returned
status value 2 and status text "The value could not be converted
because of a potential loss of data.".
Tried already to ignore this problem but then another problems came up so there is no other way than solving this.
I have to convert this data from csv file which is str 50 into int 4
It must be int4. One of the requirements Dont know what t odo.
This is data I'm trying to put into int4. Look on pr_czas
This is data's datatype
Before I tried to do same thing with just DD.MM.YYYY but got same result...
Given an input column named [pr_czas] that contain string values that look like 31.01.2020 00:00 which appears to be a formatted date time represented in the format "DD.mm.YYYY HH:MM", I would like to express that as a whole number DDMMYYHHMM
Add a derived column to your data flow and call this new_pr_czas
The logic I'm going to use is a series of REPLACE statements and cast the final result to an integer. Replace the period, replace the colon and the space - all with nothing
(DT_I8)REPLACE(REPLACE(REPLACE([pr_czas], ".", ""), ":", ""), " ", "")
This is an easy case but things to note.
An integer/int32/I4 has a maximum value of 2 billion.
310120200000 is too large to fit into that space so you would need to make that an bigint/int64/I8. If I remember your previous question, you were having troubles with a lookup task so this data type mismatch might hurt you there.
The other thing to be aware of is that leading zeros will be dropped when converted to a number because they are not significant. If you need to retain the leading zeros, then you're working with string data type. This is an advantage to working with the ISO standard but if your data expects DD, then far be it for me to say otherwise.
If you need to slice your date into another format, then you'll want to have a few derived columns. The first one will generate a string column for each piece of pr_czas - year, month, day, hour and minute. You'll use the substring method for this and findstring to find the period space and colon.
The next data flow will be used to put those string pieces back into the new format and cast that to I8. Why? Because you can't debug doing it all in one shot but you can put a data viewer between two derived columns to figure out where a slice went awry.

BCP CHAR value to Snowflake

I am trying to create a BCP file with | delimiter and then load it to a snowflake table.
Issue:
in SQL server there are columns defined as CHAR(4) and have values "sss"
so when i do BCP the its being padded to length of 4 "sss " and being loaded to snowflake
due to which our reports are failing because they do something like where column="SSS" but due to trailing space in snowflake the correct columns are not showing up.
we do not want to change our reports. So, is there a way that BCP can handle the padding or trimming of these columns?
note that there 24 tables and each have around 130+ columns so i cant go and put Trim functions on each char column
If your BCP file is maintaining the trailing space, then Snowflake will retain it, too, as long as the field is being FIELD_OPTIONALLY_ENCLOSED_BY a " or '. You may also want to make sure your TRIM_SPACE option is correctly set on your format definition for your COPY INTO command.
If your BCP file isn't maintaining the space and you can't figure out how to get that to work, you could force the space back in during the COPY INTO command with some string functions in your SELECT, or you could create a view for your report that does the same set of string functions to force the space for your report to work from.
So, is there a way that BCP can handle the padding or trimming of these columns?
Yes, but not by some switch or option. The correct way to handle this is to set your datatypes up front. As someone mentioned in comments to your question, your query that is creating BCP output should use VARCHAR(4) instead of CHAR(4). BCP is giving you what you asked of it. They way to avoid whitespace is to use varchar.
Seems like a fairly quick "find and replace" against scripted out query objects would work fine but you know your situation best.
Additionally, "trim" wont work - FYI. Even if the value of the field was only "SSS" (as in your example); if the result/column is defined as CHAR(4) you will get 4 bytes of data and a blank in the 4th place since you only had 3 bytes of data. Trim will work during the query... the padded " " you are getting is placed there by the copy out. The way to correct this is to set your data types as you need up front.
Unless someone knows of a better way in snowflake (im not familiar with it) the only other option is to manipulate the file inbetween SQL and Snowflake. replace " |" with "|"... but... blech.
This is a known "issue" with BCP. The "solution" is to use the queryout option, which means you must include a query with every export. But the data are the way they are.
Eg: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/88c258fe-d1a6-4f3a-9dac-40388d04e9c7/remove-space-in-columns-on-bcp-out?forum=transactsql
But this is really a Snowflake problem, because Snowflake has its own default CHAR semantics.
You get a warning in the documentation String & Binary Data Types but that doesn't tell the whole truth.
The following executed on Oracle (and apparently MSSQL? MySQL?) will select the aaa line:
CREATE TABLE C AS SELECT CAST('aaa ' AS CHAR(4)) t FROM DUAL;
SELECT * FROM C WHERE t = 'aaa';
but won't on Snowflake, unless you create the column with COLLATION:
CREATE OR REPLACE TABLE C (t CHAR(4) COLLATE 'en_US-rtrim');
INSERT INTO C VALUES('aaa ');
SELECT * FROM C WHERE t = 'aaa';
Unfortunately, you can't ALTER the collation after creation, which would have been convenient after a COPY INTO <table>.
PS: Mike Walton's answer is better, TRIM_SPACE is much cleaner than COLLATE.

Why does SSIS TOKEN function fail to count adjacent column delimiters?

I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN().
This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example:
1^Apple^0001^01/01/2010^Anteater^A1
2^Banana^0002^03/15/2010^Bear^B2
3^Cranberry^0003^4/15/2010^Crow^C3
If these strings are in a column called OldImportRecord, the delimiter is a caret (as shown), and we wish to put the fifth field into a Derived Column, we would use an expression like:
TOKEN(OldImportRecord,"^",5)
This returns Anteater, Bear, Crow, etc. In fact, we can create Derived Columns for each of the fields in this record (note that the index is one-based), change them as needed, and then build another delimited record for export.
Here's the problem. What if some of our data includes some empty strings (or Nulls rendered as empty strings)?
4^^0004^6/15/2010^Duck^D4
The TOKEN() fails to count the adjacent column delimiters, which throws off the column count. Now it only sees five columns instead of six columns. Our TOKEN(OldImportRecord,"^",5) returns "D4" instead of the intended "Duck". When we extract the fourth column, we wind up trying to put "Duck" into a Date column, and all sorts of fun ensues.
Here's a partial workaround:
TOKEN(REPLACE(OldImportRecord,"^^","^ ^"),"^",5)
Notice this misses every second delimiter pair, so it will fail for a string like "5^^^^Emu^E5", which looks like"5^ ^^ ^Emu^E5" after the REPLACE(). The column count is still wrong.
So here's my full workaround. This includes two nested REPLACE statements(), an RTRIM() to remove the superfluous spaces, and a DT_STR cast because I would like to keep the result in VARCHAR:
(DT_STR,255,1252)RTRIM(TOKEN(REPLACE(REPLACE(OldImportRecord,"^^","^ ^"),"^^","^ ^"),"^",5))
I am posting this for information, since others may also run into this problem.
Does anyone have a better workaround, or even a real solution?
Reason for the issue:
TOKEN method in SSIS uses the implementation of strtok function in C++. I gathered this information while reading the book Microsoft® SQL Server® 2012 Integration Services. It is mentioned as note on page 113 (I like this book! Lots of nice information.).
I searched for the implementation of strtok function and I found the following links.
INFO: strtok(): C Function -- Documentation Supplement - The code sample in this link shows that the function does ignore consecutive delimiter characters.
The answers to the following SO questions point out that strtok function is designed to ignore consecutive delimiters.
Need to know when no data appears between two token separators using strtok()
strtok_s behaviour with consecutive delimiters
I think that the TOKEN and TOKENCOUNT functions are working as per design but whether that is how SSIS should behave might be a question for the Microsoft SSIS team.
Original Post - Above section is an update:
I created a simple package in SSIS 2012 based on your data inputs. As you had described in your question, the TOKEN function does not behave as intended. I agree with you that the function doesn't seem to work. This post is not an answer to your original issue.
Here is an alternative way to write the expression in a relatively simpler fashion. This will only work if the last segment in your input record will always have a value (say A1, B2, C3 etc.).
Expression can be rewritten as:
This statement will take the input record as the parameter, the delimiter caret (^) as the second parameter. The third parameter calculates the total number segments in the records when split by the delimiter. If you have data in the last segment, you are guaranteed to have two segments. You can then subtract 1 to fetch the penultimate segment.
(DT_STR,50,1252)TOKEN(OldImportRecord,"^",TOKENCOUNT(OldImportRecord,"^") - 1)
I created a simple package with data flow task. OLE DB source retrieves the data and the derived transformation parses and splits the data as per the screenshot below. The output is then inserted into the destination table. You can see the source and destination tables in the last screenshot. Destination table has two columns. The first column stores the penultimate segment data and the segments count based on the delimiter (which again isn't correct). You can notice that the last record didn't fetch the correct results. If the last record didn't have the value 8, then the above expression will fail because the expression will evaluate to zero index.
Hope that helps to simplify your expression.
If you don't hear from anyone else, I would recommend logging this issue in Microsoft Connect website.
Create table and populate scripts:
CREATE TABLE [dbo].[SourceTable](
[OldImportRecord] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[DestinationTable](
[NewImportRecord] [varchar](50) NOT NULL,
[CaretCount] [int] NOT NULL
) ON [PRIMARY]
GO
INSERT INTO dbo.SourceTable (OldImportRecord) VALUES
('1^Apple^0001^01/01/2010^Anteater^A1'),
('2^Banana^0002^03/15/2010^Bear^B2'),
('3^Cranberry^0003^4/15/2010^Crow^C3'),
('4^^0004^6/15/2010^Duck^D4'),
('5^^^^Emu^E5'),
('6^^^^Geese^F6'),
('^^^^Pheasant^G7'),
('8^^^^Sparrow^');
GO
Derived column transformation inside data flow task:
Data in source and destination tables:
Not only does TOKEN skip adjacent delimiters, it also skips leading and trailing delimiters as well. So, using your example, if you had a field "good" field that looks like this:
1^Apple^0001^01/01/2010^Anteater^A1
Followed by one with adjacent and leading delimiters like this:
^^^0004^6/15/2010^Duck^
TOKENCOUNT would only find two delimiters and you'd end up with 0004 assigned to Token1, 6/15/2010 for Token2, and Duck for Token3.
I used a different kind of replace. Rather than placing spaces between adjacent delimiters, which wouldn't help with leading or training, I used replace to surround the delimiters with characters I absolutely wouldn't find in my text. The following Expression works well for me. It's wordy, but it is what it is.
(DT_STR,255,1252)REPLACE(TOKEN(REPLACE(OldImportRecord,"^","~^~"),"^",1),"~","")
Of course, you'd replace the number 1 with whatever Token you wanted and adjust the cast according to your needs. Hope that helps.

SSIS how to convert string (DT_STR) to money (DT_CY) when source has more than 2 decimals

I have a source flat file with values such as 24.209991, but they need to load to SQL Server as type money. In the DTS (which I am converting from), that value comes across as 24.21. How do I convert that field in SSIS?
Right now, I am just changing the type from DT_STR to DT_CY, and it gives a run error of 'Data conversion failed. The data conversion for column "Col003" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".'
Do I use a Data Conversion task? And then what?
I've also tried setting the source output column to DT_NUMERIC, and then convert that to DT_CY, with the same result.
I've also tried using Derived Columns, casting the DT_STR field Col003 to (DT_NUMERIC,10,2)Col003 and then casting that to (DT_CY)Col003_Numeric. That's getting a cast error.
The flat file defaults to all fields being DT_STR. Use the Advanced option on editing the connection to have the numeric field as float (DT_R4). Then, in the advanced editing of the Flat File Source (on the Data Flow tab), set that output column to money (DT_CY).
Then, the field will convert without any additional conversions. The issue was leaving the source file definition as DT_STR.
If you don't have any null value use Data Conversion, and make sure you don't have any funny character (e.g. US$200 produce error)
If you have null or empty fields in your field and you are using Flat file source, make sure that you tick "Return null value from source.."
Another trick I have used is something like: (taxvalue != "" ? taxvalue : NULL(DT_WSTR,50)). in Derived Column transformation (you can just replace the field)
Generally SSIS doesn't convert empty strings to money properly.
For some reason in my scenario, the OLE DB Destination actually was configured to accept a DT_CY. However, casting to this format (no matter the length of the input and destination data, and no matter wether or not the data was NULL when it arrived) always caused the same issue.
After adding data viewers, I can conclude that this has something to do with the locale. Here in Denmark, we use comma (,) as decimal delimiters and dots (.) as thousands-delimiters, instead of the opposite.
This means that a huge number like 382,939,291,293.38 would (after the conversion to DT_CY) look like 382.939.291.293,38. Even though I highly doubted that it could be the issue, I decided to do the opposite of what I originally had intended.
I decided to go to the advanced settings of my OLE DB Destination and change the DT_CY column's type to DT_STR instead. Then, I added a Derived Column transformation, and entered the following expression to transform the column before the data would arrive at the destination.
REPLACE(SUBSTRING(Price, 2, 18), ",", ".") where Price was the column's name.
To my big surprise, this solved the problem, since I figured out that my OLE DB Destination was now sending the data as a string, which the SQL Server understood perfectly fine.
I am certain that this is a bug! I was using SQL Server 2008, so it might have been solved in later editions. However, I find it quite critical that such an essential thing is not working correctly!

How do I get SSIS Data Flow to put '0.00' in a flat file?

I have an SSIS package with a Data Flow that takes an ADO.NET data source (just a small table), executes a select * query, and outputs the query results to a flat file (I've also tried just pulling the whole table and not using a SQL select).
The problem is that the data source pulls a column that is a Money datatype, and if the value is not zero, it comes into the text flat file just fine (like '123.45'), but when the value is zero, it shows up in the destination flat file as '.00'. I need to know how to get the leading zero back into the flat file.
I've tried various datatypes for the output (in the Flat File Connection Manager), including currency and string, but this seems to have no effect.
I've tried a case statement in my select, like this:
CASE WHEN columnValue = 0 THEN
'0.00'
ELSE
columnValue
END
(still results in '.00')
I've tried variations on that like this:
CASE WHEN columnValue = 0 THEN
convert(decimal(12,2), '0.00')
ELSE
convert(decimal(12,2), columnValue)
END
(Still results in '.00')
and:
CASE WHEN columnValue = 0 THEN
convert(money, '0.00')
ELSE
convert(money, columnValue)
END
(results in '.0000000000000000000')
This silly little issue is killin' me. Can anybody tell me how to get a zero Money datatype database value into a flat file as '0.00'?
I was having the exact same issue, and soo's answer worked for me. I sent my data into a derived column transform (in the Data Flow Transform toolbox). I added the derived column as a new column of data type Unicode String ([DT_WSTR]), and used the following expression:
Price < 1 ? "0" + (DT_WSTR,6)Price : (DT_WSTR,6)Price
I hope that helps!
Could you use a Derived Column to change the format of the value? Did you try that?
I used the advanced editor to change the column from double-precision float to decimal and then set the Scale to 2:
Since you are exporting to text file, just export data preformatted.
You can do it in the query or create a derived column, whatever you are more comfortable with.
I chose to make the column 15 characters wide. If you import into a system that expects numbers those zeros should be ignored...so why not just standardize the field length?
A simple solution in SQL is as follows:
select
cast(0.00 as money) as col1
,cast(0.00 as numeric(18,2)) as col2
,right('000000000000000' + cast( 0.00 as varchar(10)), 15) as col3
go
col1 col2 col3
--------------------- -------------------- ---------------
.0000 .00 000000000000.00
Simply replace '0.00' with your column name and don't forget to add the FROM table_name, etc..
It is good to use derived column and need to check the condition as well
pricecheck <=0 ? "0" + (DT_WSTR,10)pricecheck : (DT_WSTR,10)pricecheck
or alternative way is to use vb script
Ultimately what I ended up doing was using the FORMAT() function.
CAST(FORMAT(balance, '0000000000.0000') AS varchar(30)) AS "balance"
This does have some significant CPU performance impact (often at least an order of magnitude) due to the way SQL Server implements that function, but nothing worked easier, more correctly, or more consistently for me. I was working with less than 100,000 rows and the package executes no more than once an hour. Going from 100ms to 1000ms just wasn't a big deal in my situation.
The FORMAT() function returns an nvarchar(4000) by default, so I also cast it back to a varchar of appropriate size since my output file needed to be in Windows-1252 encoding. Transcoding text is much more obnoxious in SSIS than it has any right to be.