Using SQL Server Integration Services (SSIS) to perform incremental data load, comparing a hash of to-be-imported and existing row data. I am using this:
http://ssismhash.codeplex.com/
to create the SHA512 hash for comparison. When trying to compare data import hash and existing hash from database using a Conditional Split task (expression is NEW_HASH == OLD_HASH) I get the following error upon entering the expression:
The data type "DT_BYTES" cannot be used with binary operator "==". The type of one or both of the operands is not supported for the operation. To perform this operation, one or both operands need to be explicitly cast with a cast operator.
Attempts at casting each column to a string (DT_WSTR, 64) before comparison have resulted in a truncation error.
Is there a better way to do this, or am I missing some small detail?
Thanks
Have you tried expanding the length beyond 64? I believe DT_BYTES is valid up to 8000 characters. I verified the following are legal cast destinations for DT_BYTES based on the books online article:
DT_I4
DT_UI4
DT_I8
DT_UI8
DT_STR
DT_WSTR
DT_GUID
DT_IMAGE
I also ran a test in BIDS and verified it had no problem comparing the values once I cast them to a sufficiently long data type.
SHA512 is a bit much as your chances of actually colliding are 1 in 2^256. SHA512 always outputs 512 bits which is 64 bytes. I have a similar situation where I check the hash of an incoming binary file. I use a Lookup Transformation instead of a Conditional Split.
This post is older but in order to help other users...
The answer is that in SSIS you cannot compare binary data using the == operator.
What I've seen is that people will most often convert (and store) the hashed value as varchar or nvarchar which can be compared in SSIS.
I believe the other users have answered your issue with "truncation" correctly.
Related
I'm searching for cases in MySQL/MariaDB where the value transmitted when storing will differ from the value that can be retrieved later on. I'm only interested in fields with non-binary string data types like VARCHAR and *TEXT.
I'd like to get a more comprehensive understanding on how much a stored value can be trusted. This would especially be interesting for cases where the output just lacks certain characters (like with the escape character example below) as this is specifically dangerous when validating.
So, this boils down to: Can you create an input string (and/or define an environment) where this doesn't output <value> in the second statement?
INSERT INTO t SET v = <value>, id = 1; // success
SELECT v FROM t WHERE id = 1;
Things I can think of:
strings containing escaping (\a → a)
truncated if too long
character encoding of the table not supporting the input
If something fails silently probably also depends on how strict the SQL mode is set (like with the last two examples).
Thanks a lot in advance for your input!
you can trust that all databases do, what the standards purpose, with strings and integer it is simple, because it saves the binary representation of that number or character in your choosen character set.
Decimal Double and single values are different, because the can't be saved directly and so it comes to fractals see decimal representation
That also follows standards, but you have to account with it.
A source database field of type INT is read through an OLE DB Source. It is eventually written to a Flat File Destination. The destination Flat File Connection Manager > Advanced page reports it as a four-byte signed integer [DT_I4].
This data type made me think it indicated binary. Clearly, it does not. I was surprised that it was not the more generic numeric [DT_NUMERIC].
I changed this type setting to single-byte signed integer [DT_I1]. I expected this to fail, but it did not. The process produced the same result, even though the value of the field was always > 127. Why did this not fail?
Some of the values that are produced are
1679576722
1588667638
1588667638
1497758544
1306849450
1215930367
1215930367
1023011178
1932102084
Clearly, outside the range of a single-byte signed integer [DT_I1].
As a related question, is it possible to output binary data to a flat file? If so, what settings and where should be used?
Data types validation
I think this issue is related to the connection manager that is used, since the data type validation (outside the pipeline) is not done by Integration services, it is done by the service provider:
OLEDB for Excel and Access
SQL Database Engine for SQL Server
...
When it comes to flat file connection manager, it doesn't guarantee any data types consistency since all values are stored as text. As example try adding a flat file connection manager and select a text file that contains names, try changing the columns data types to Date and go to the Columns preview tab, it will show all columns without any issue. It only take care of the Row Delimiter, column delimiter , text qualifier and common properties used to read from a flat file. (similar to TextFieldParser class in VB.NET)
The only case that data types may cause an exception is when you are using a Flat file source because the Flat file source will create an External columns with defined metadata in the Flat file connection manager and link them to the original columns (you can see that when you open the Advanced editor of the Flat file source) when SSIS try reading from flat file source the External columns will throw the exception.
Binary output
You should convert the column into binary within the package and map it to the destination column. As example you can use a script component to do that:
public override void myInput_ProcessInputRow(myInputBuffer Row)
{
Row.ByteValues=System.Text.Encoding.UTF8.GetBytes (Row.name);
}
I haven't try if this will work with a Derived column or Data conversion transformation.
References
Converting Input to (DT_BYTES,20)
DT Bytes in SSIS
After re-reading the question to make sure it matched my proof-edits, I realized that it doesn't appear that I answered your question - sorry about that. I have left the first answer in case it is helpful.
SSIS does not appear to enforce destination metadata; however, it will enforce source metadata. I created a test file with ranges -127 to 400. I tested this with the following scenarios:
Test 1: Source and destination flat file connection managers with signed 1 byte data type.
Result 1: Failed
Test 2: Source is 4 byte signed and destination is 1 byte signed.
Result 2: Pass
SSIS's pipeline metadata validation only cares about the metadata of the input matching the width of the pipeline. It appears to not care what the output is. Though, it offers you the ability to set the destination to whatever the downstream source is so that it can check and provide a warning if the destination's (i.e., SQL Server) metadata matches or not.
This was an unexpected result - I expected it to fail as you did. Intuitively, the fact that it did not fail still makes sense. Since we are writing to a CSV file, then there is no way to control what the required metadata is. But, if we hook this to a SQL Server destination and the metadata doesn't match, then SQL Server will frown upon the out of bounds data (see my other answer).
Now, I would still set the metadata of the output to match what it is in the pipeline as this has important considerations with distinguishing string versus numeric data types. So, if you try to set a datetime as integer then there will be no text qualifier, which may cause an error on the next input process. Conversely, you could have the same problem of setting an integer to a varchar and having, which means it would get a text qualifier.
I think the fact that destination metadata is not enforced is a bit of a weak link in SSIS. But, it can be negated by just setting it to match the pipeline buffer, which is done automatically assuming it is the last task that is dropped to the design. With that being said, if you update the metadata on the pipeline after development is complete then you are in for a real treat with getting the metadata updated throughout the entire pipeline because some tasks have to be opened and closed while others have to be deleted and re-created in order to update the metadata.
Additional Information
TL DR: TinyInt is stored as an unsigned data type in SQL Server, which means it supports values between 0 and 255. So a value greater than 127 is acceptable - up to 255. Anything over will result in an error.
The byte size indicates the maximum number of possible combinations where the signed/unsigned indicates whether or not the range is split between positive and negative values.
1 byte = TinyInt in SQL Server
1 byte is 8 bits = 256 combinations
Signed Range: -128 to 127
Unsigned Range: 0 to 255
It is important to note that SQL Server does not support signing the data types directly. What I mean here is that there is no way to set the integer data types (i.e., TinyInt, Int, and BigInt) as signed or unsigned.
TinyInt it is unsigned
Int and BigInt are signed
See reference below: Max Size of SQL Server Auto-Identity Field
If we attempt to set a TinyInt to any value that is outside of the Unsigned Range (e.g., -1 or 256), then we get the following error message:
This is why you were able to set a value greater than 127.
Int Error Message:
BigInt Error Message:
With respect to Identity columns, if we declare an Identity column as Int (i.e., 32 bit ~= 4.3 billion combinations) and set the seed to 0 with an increment of 1, then SQL Server will only go to 2,147,483,647 rows before it stops, which is the maximum signed value. But, we are short by half the range. If we set the seed to -2,147,483,648 (don't forget to include 0 in the range) then SQL Server will increment through the full range of combinations before stopping.
References:
SSIS Data Types and Limitations
Max Size of SQL Server Auto-Identity Field
If I execute a query against the MySQL Connector/C library the data I'm getting back all appears to be in straight char * format, including numerical data types.
For example, if I execute a query that returns 4 columns, all of which are INTEGER in MySQL, rather than getting back 4 bytes worth of data (each byte representing a single column row value), I'm actually getting back 4 ASCII encoded character bytes, where 1 is actually a byte with the numeric value 49 in it (ASCII for 1).
Is this accurate or am I just missing something complete?
Do I really need to then atoi that returned byte into an int in my code or is there a mechanism to get the native C data types out of the MySQL client directly?
I guess my real question is: is the mysql_store_result structure converting that data to ASCII encoded representations in a way that can be bypassed by my application code?
I believe the data is sent on the wire as text in the MySQL protocol (I just confirmed this with Wireshark). So that means mysql_store_result() is not converting the data, it's just simply passing the data on as it was received. MySQL actually sends integers as text. I agree this always seemed like an odd design to me as well.
MySQL originally only offered the Text Protocol that you are currently using, in which (as you note) results are encoded as strings. MySQL v4.1 (released in April 2003) introduced the Prepared Statement protocol, which (amongst other things) transmits results in a binary format.
See C API Prepared Statements for more information on how to use the latter protocol with Connector/C.
In mysql, if I do something like
round((amount * '0.75'),2)
it seem to work just fine like without single quotes for 0.75. Is there a difference in how mysql process this?
In the hope to close out this question, here's a link that explains type conversion in expression evaluation: https://dev.mysql.com/doc/refman/5.5/en/type-conversion.html
When an operator is used with operands of different types, type
conversion occurs to make the operands compatible. Some conversions
occur implicitly. For example, MySQL automatically converts numbers to
strings as necessary, and vice versa.
mysql> SELECT 1+'1';
-> 2
In your case, MySQL sees arithmetic and performs implicit conversion on any string contained in the expression. There is going to be an overheard in converting a string to number, but it's negligible. My preference is to explicitly type out a number instead of quoting it. That method has helped me in code clarity and maintainability.
I have a column of type char(32) where I want to store an MD5 hash key. The problem is i've used SQL to update the existing records using HashBytes() function which creates values like
:›=k! ©úw"5Ýâ‘<\
but when I do the insert via .NET it comes through as
3A9B3D6B2120A9FA772235DDE2913C5C
What do I need to do to get these to match up? Is it the encoding?
HashKey isn't a SQL function, did you mean HASHBYTES? Some actual code would help. SQL appears to be computing the raw binary hash and displaying it as ASCII characters.
.NET is computing the hash, then converting it to hexadecimal (or so it appears). CHAR(32) isn't a good way to store raw binary data, you would want to use the BINARY type.
An Example in SQL:
SELECT SUBSTRING(sys.fn_varbintohexstr(HASHBYTES('MD5',0x2040)),3, 32)
And an Example in .NET:
using (MD5 md5 = MD5.Create())
{
var data = new byte[] { 0x20, 0x40 };
var hashed = md5.ComputeHash(data);
var hexHash = BitConverter.ToString(hashed).Replace("-", "");
Console.Out.WriteLine("hexHash = {0}", hexHash);
}
These will both produce the same value. (Where 0x2040 is sample data).
You can either store the hexadecimal data as CHAR(32), or as BINARY(16). Storing the Binary data is twice as space efficient than storing it as hex. What you should not be doing is storing the binary data as CHAR(16).
It's not clear what you mean by "when I do the insert via .NET" - but you shouldn't be storing binary data just in a raw form, as it looks like your'e doing using HashKey(). (Do you definitely mean HashKey by the way? I can't find a reference for it, but there's HashBytes...)
Two common options are to encode the raw binary data as hex - which it looks like you're doing in the second case - or to use base64. Either way should be easy from .NET (Base64 marginally easier, using Convert.ToBase64String) and you probably just need to find the equivalent SQL Server function.
MD5 is typically stored as in hex encoding. I'd guess that your hashkey() SQL function is not hex encoding the MD5 hash, rather it's just returning the ASCII characters representing the hash. But your .NET method is HEX encoding. If you store your MD5 hashing consistently as HEX (or not - up to you but usually stored as HEX), then the results between the two should always be consistent.
For example, the : symbol from your SQL hash is the first character returned from HashKey(). In the .NET method, the first 2 characters are 3A. 31 is 51 in decimal. ASCII code 51 is the colon (:) character. Similarly, you can work your way through each other character, and do the HEX conversion.
See any ASCII codes table for reference, i.e. http://www.asciitable.com/