I have an SSIS Package which should take data from a Flat File (txt).
One of the fields should be an Unsigned Integer and i should load it to an SQL table.
In the "Flat File Connection Manager Editor" i set the "Format" of the flat file to Fixed width (don't have any delimiters only a spec file with columns lengths.
The field i am talking about should be 4 chars long (according to the definition).
but in some values i get the "}" sign on the 4th char, for example: "010}"
I trusted the definition and tried to load this value into an unsigned integer with no luck.
Does anyone recognize such a formatting?
If you do, how can i load it into the proper data type?
thank you in advanced.
Oren.
There are several things that could be going wrong on your import. First you have to know the encoding of your original file:
How can I detect the encoding/codepage of a text file
The encoding will determine the actual size in bytes of your char and more importantly how each character is stored. You see, a unicode string of four chars can be anywhere from four bytes to 16 bytes(maybe more if you have compound characters) depending on the encoding. An int is usually four bytes(DT_I 4) but ssis offers you up to 32(I think). So when you're loading your unknown number of bytes into the predetermined unsigned int some stuff might be getting truncated and you end up with garbage values.
If your dont know or can't find the encoding I would assume it is UTF-8, but thats really not good practice. This is a little bit about it: http://en.wikipedia.org/wiki/UTF-8
you can also take a look at the unicode character sets for different encodings(UTF-8, UTF-16..) and look for the "}" character and its matching value. That might give you a hint as to why it is showing up.
Then your flat file source should match that type of encoding. Check(or uncheck) the Unicode check box to set it/ or pick a "Code Page" . Then load the value of that column into a string(of the right encoding), not an unsigned int.
Finaly when you know you have the right value, you can use a "Data Conversion" to cast it to an unsigned int, or anything else really.
EDIT: The "Data Conversion" will, as per its name, convert your imported value. That may not work, depending on how the original file was written. The "derived column" cast will be your other option, which won't change the actual value, just tell the compiler to interpret those bits as another type (unsigned int).
If I understand your question right .
one of the ways is to use a "derived Column transformation". Choose to Add a new column in it.
If the data you are fetching is in DT_WSTR data type, you can use the following expression to get rid of '}' by ''. Then type cast it according to the field you want to load. Here I am using (DT_I4)
Then Map the new column to the destination.
(DT_I4) REPLACE(character_expression,searchstring,replacementstring)
Hope It helps.
Related
I'm searching for cases in MySQL/MariaDB where the value transmitted when storing will differ from the value that can be retrieved later on. I'm only interested in fields with non-binary string data types like VARCHAR and *TEXT.
I'd like to get a more comprehensive understanding on how much a stored value can be trusted. This would especially be interesting for cases where the output just lacks certain characters (like with the escape character example below) as this is specifically dangerous when validating.
So, this boils down to: Can you create an input string (and/or define an environment) where this doesn't output <value> in the second statement?
INSERT INTO t SET v = <value>, id = 1; // success
SELECT v FROM t WHERE id = 1;
Things I can think of:
strings containing escaping (\a → a)
truncated if too long
character encoding of the table not supporting the input
If something fails silently probably also depends on how strict the SQL mode is set (like with the last two examples).
Thanks a lot in advance for your input!
you can trust that all databases do, what the standards purpose, with strings and integer it is simple, because it saves the binary representation of that number or character in your choosen character set.
Decimal Double and single values are different, because the can't be saved directly and so it comes to fractals see decimal representation
That also follows standards, but you have to account with it.
A source database field of type INT is read through an OLE DB Source. It is eventually written to a Flat File Destination. The destination Flat File Connection Manager > Advanced page reports it as a four-byte signed integer [DT_I4].
This data type made me think it indicated binary. Clearly, it does not. I was surprised that it was not the more generic numeric [DT_NUMERIC].
I changed this type setting to single-byte signed integer [DT_I1]. I expected this to fail, but it did not. The process produced the same result, even though the value of the field was always > 127. Why did this not fail?
Some of the values that are produced are
1679576722
1588667638
1588667638
1497758544
1306849450
1215930367
1215930367
1023011178
1932102084
Clearly, outside the range of a single-byte signed integer [DT_I1].
As a related question, is it possible to output binary data to a flat file? If so, what settings and where should be used?
Data types validation
I think this issue is related to the connection manager that is used, since the data type validation (outside the pipeline) is not done by Integration services, it is done by the service provider:
OLEDB for Excel and Access
SQL Database Engine for SQL Server
...
When it comes to flat file connection manager, it doesn't guarantee any data types consistency since all values are stored as text. As example try adding a flat file connection manager and select a text file that contains names, try changing the columns data types to Date and go to the Columns preview tab, it will show all columns without any issue. It only take care of the Row Delimiter, column delimiter , text qualifier and common properties used to read from a flat file. (similar to TextFieldParser class in VB.NET)
The only case that data types may cause an exception is when you are using a Flat file source because the Flat file source will create an External columns with defined metadata in the Flat file connection manager and link them to the original columns (you can see that when you open the Advanced editor of the Flat file source) when SSIS try reading from flat file source the External columns will throw the exception.
Binary output
You should convert the column into binary within the package and map it to the destination column. As example you can use a script component to do that:
public override void myInput_ProcessInputRow(myInputBuffer Row)
{
Row.ByteValues=System.Text.Encoding.UTF8.GetBytes (Row.name);
}
I haven't try if this will work with a Derived column or Data conversion transformation.
References
Converting Input to (DT_BYTES,20)
DT Bytes in SSIS
After re-reading the question to make sure it matched my proof-edits, I realized that it doesn't appear that I answered your question - sorry about that. I have left the first answer in case it is helpful.
SSIS does not appear to enforce destination metadata; however, it will enforce source metadata. I created a test file with ranges -127 to 400. I tested this with the following scenarios:
Test 1: Source and destination flat file connection managers with signed 1 byte data type.
Result 1: Failed
Test 2: Source is 4 byte signed and destination is 1 byte signed.
Result 2: Pass
SSIS's pipeline metadata validation only cares about the metadata of the input matching the width of the pipeline. It appears to not care what the output is. Though, it offers you the ability to set the destination to whatever the downstream source is so that it can check and provide a warning if the destination's (i.e., SQL Server) metadata matches or not.
This was an unexpected result - I expected it to fail as you did. Intuitively, the fact that it did not fail still makes sense. Since we are writing to a CSV file, then there is no way to control what the required metadata is. But, if we hook this to a SQL Server destination and the metadata doesn't match, then SQL Server will frown upon the out of bounds data (see my other answer).
Now, I would still set the metadata of the output to match what it is in the pipeline as this has important considerations with distinguishing string versus numeric data types. So, if you try to set a datetime as integer then there will be no text qualifier, which may cause an error on the next input process. Conversely, you could have the same problem of setting an integer to a varchar and having, which means it would get a text qualifier.
I think the fact that destination metadata is not enforced is a bit of a weak link in SSIS. But, it can be negated by just setting it to match the pipeline buffer, which is done automatically assuming it is the last task that is dropped to the design. With that being said, if you update the metadata on the pipeline after development is complete then you are in for a real treat with getting the metadata updated throughout the entire pipeline because some tasks have to be opened and closed while others have to be deleted and re-created in order to update the metadata.
Additional Information
TL DR: TinyInt is stored as an unsigned data type in SQL Server, which means it supports values between 0 and 255. So a value greater than 127 is acceptable - up to 255. Anything over will result in an error.
The byte size indicates the maximum number of possible combinations where the signed/unsigned indicates whether or not the range is split between positive and negative values.
1 byte = TinyInt in SQL Server
1 byte is 8 bits = 256 combinations
Signed Range: -128 to 127
Unsigned Range: 0 to 255
It is important to note that SQL Server does not support signing the data types directly. What I mean here is that there is no way to set the integer data types (i.e., TinyInt, Int, and BigInt) as signed or unsigned.
TinyInt it is unsigned
Int and BigInt are signed
See reference below: Max Size of SQL Server Auto-Identity Field
If we attempt to set a TinyInt to any value that is outside of the Unsigned Range (e.g., -1 or 256), then we get the following error message:
This is why you were able to set a value greater than 127.
Int Error Message:
BigInt Error Message:
With respect to Identity columns, if we declare an Identity column as Int (i.e., 32 bit ~= 4.3 billion combinations) and set the seed to 0 with an increment of 1, then SQL Server will only go to 2,147,483,647 rows before it stops, which is the maximum signed value. But, we are short by half the range. If we set the seed to -2,147,483,648 (don't forget to include 0 in the range) then SQL Server will increment through the full range of combinations before stopping.
References:
SSIS Data Types and Limitations
Max Size of SQL Server Auto-Identity Field
I am making a table of users where I will store all their info: username, password, etc. My question is: Is it better to store usernames in VARCHAR with a utf-8 encoded table or in CHAR. I am asking because char is only 1 byte and utf-8 encodes up to 3 bytes for some characters and I do not know whether I might lose data. Is it even possible to use CHAR in that case or do I have to use VARCHAR?
In general, the rule is to use CHAR encoding under the following circumstances:
You have short codes that are the same length (think state abbreviations).
Sometimes when you have short code that might differ in length, but you can count the characters on one hand.
When powers-that-be say you have to use CHAR.
When you want to demonstrate how padding with spaces at the end of the string causes unexpected behavior.
In other cases, use VARCHAR(). In practice, users of the database don't expect a bunch of spaces at the end of strings.
I have a project that imports a TSV file with a field set as text stream (DT_TEXT).
When I have invalid rows that get redirected, the DT_TEXT fields from my invalid rows gets appended to the first proceeding valid row.
Here's my test data:
Tab-delimited input file: ("tsv IN")
CatID Descrip
y "desc1"
z "desc2"
3 "desc3"
CatID is set as in integer (DT_I8)
Descrip is set as text steam (DT_TEXT)
Here's my basic Data Flow Task:
(I apologize, I cant post images until my rep is above 10 :-/ )
So my 2 invalid rows get redirected, and my 3rd row directs to sucess,
But here is my "Success" output:
"CatID","Descrip"
"3","desc1desc2desc3"
Is this a bug when using DT_TEXT fields? I am fairly new to SSIS, so maybe I misunderstand the use of text streams. I chose to use DT_TEXT as I was having truncation issues with DT_STR.
If its helpful, my tsv Fail output is below:
Flat File Source Error Output Column,ErrorCode,ErrorColumn
x "desc1"
,-1071607676,10
y "desc2"
,-1071607676,10
Thanks in advance.
You should really try and avoid using the DT_TEXT, DT_NTEXT or DT_IMAGE data types within SSIS fields as they can severely impact dataflow performance. The problem is that these types come through not as a CLOB (Character Large OBject), but as a BLOB (Binary Large OBject).
For reference see:
CLOB: http://en.wikipedia.org/wiki/Character_large_object
BLOB: http://en.wikipedia.org/wiki/BLOB
Difference: Help me understand the difference between CLOBs and BLOBs in Oracle
Using DT_TEXT you cannot just pull out the characters as you would from a large array. This type is represented as an array of bytes and can store any type of data, which in your case is not needed and is creating problems concatenating your fields. (I recreated the problem in my environment)
My suggestion would be to stick to the DT_STR for your description, giving it a large OutputColumnWidth. Make it large enough so no truncation will occur when reading from your source file and test it out.
Here's the type of text my encryption function throws out:
I generated several strings and they're never bigger than 50 characters, but I would like to give it 75 characters in mysql. I tried using varchar, but the string gets cut off because it doesn't like some characters. Any idea what data type I should use?
If it's binary data (it looks like it is), you probably should store it in a BLOB.
You can use a blob, but for short data, that will make your selects slow.
Use binary(75) or varbinary(75).