SSIS Lookup transformation lookup key not working properly - ssis

I have a lookup transformation that lookups up field ContainerSrcID_Barcode DT_WSTR(20) field from table A against ContainerSrcID DT_WSTR(40) from table B and values are being kicked out to an error table if they do not match. I've noticed that values that exists in both of those key field are being kicked out to error table. The only difference I see in those two key is the length size. Is there another reason why values that are in both of those keys are considered not matched?

The length of the fields doesn't matter. If the lookup is not finding matches, then the data doesn't actually match. Probably you have whitespace characters or special non-ascii characters (like carriage returns, etc) in one of the fields and not the other.

Related

MySQL using MATCH AGAINST for long unique values (8.0.27)

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.
Sample Entries (actual data from different sources I need to match):
AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==
AND
<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=#t2.msgid.quoramail.com>
AND
["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]
So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?
What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?
Here's a summary of the discussion in comments:
The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.
The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.
You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.
Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.
MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.
The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.
Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.
Now a simple BTree lookup plus Join will solve all your plans.
TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.

Why would I use ID in MySQL when I can search with the username?

In many tutorials about MySQL they often use an ID which is made automatically when an user has made an account. Later on the ID is used to search about that profile or update that profile.
Question: Why would I use ID in MySQL when I can search with the username?
I can use the username to search in a MySQL table too, so what are the pros and cons when using an ID?
UPDATE:
Many thanks for your reactions!
So let's say a user wants to log in on a website. He will provide an username and password. But for my code I first have to do an query to know the ID, because the user don't know the ID. Is this correct or is there another way to do it?
If I would store the ID of the user in a cookie and when the user logs in then I first look if the ID is the right one with the username. And then checks if the password is correct. Then I can use the ID for queries. Is that an good idea? Of course I will use prepared statements on all of this.
Please refer to this post.
1 - It's faster. A JOIN on an integer is much quicker than a JOIN on a string field or combination of fields. It's more efficient to compare integers than strings.
2 - It's simpler. It's much easier to map relations based on a single numeric field than on a combination of other fields of varying data types.
3 - It's data-independent. If you match on the ID you don't need to worry about the relation changing. If you match on a name, what do you do if their name changes (i.e. marriage)? If you match on an address, what if someone moves?
4 - It's more efficient If you cluster on an (auto incrementing) int field, you reduce fragmentation and reduce overall size of the data set. This also simplifies indexes needed to cover your relations.
From "an ID which is made automatically" I assume you are talking about an integer column having the attribute AUTO_INCREMENT.
Several reasons a numeric auto-incremented PK is better than a string PK:
A value of type INT is stored on 4 bytes, a string uses 1 to 4 bytes for each character, depending on the charset and the character (plus 1 or 2 extra bytes that store the actual string length for VARCHAR types). Except when your string column contains only 2-3 ASCII characters, an INT always takes less space than a string; this affects the next two entries from this list.
The primary key is an index and any index is used to speed up the search of rows in the table. The search is done by comparing the searched value with the values stored in the index. Comparing integral numeric values (INT vs. INT) requires a single CPU operation; it works very fast. Comparing string values is harder: the corresponding characters from the two strings are compared taking into the account the characteristics of their encoding, collation, upper/lower case etc; usually more than one pairs of characters need to be compared; this takes a lot of CPU operations and is much slower than comparing INTs.
The InnoDB storage engine keeps a reference to the PK in every index of the table. If the PK of the table is not set or not numeric, InnoDB internally creates a numeric auto-incremented column and uses it instead (and makes the visible PK refer to it too). This means you don't waste any database space by adding an extra ID column (or you don't save anything by not adding it, if you prefer to get it the other way around).
Why does InnoDB work this way? Read the previous item again.
The PK of a table usually migrates as a FK in a related table. This means the value of the PK column of each rows from the first table is duplicated into the FK field in the related table (think of the classic example of employee that is works in a department; the department_id column of department is duplicated into the employee table). Here the column type affects both the used space and the speed (the FK is usually used for JOIN, WHERE and GROUP BY clauses in queries).
Here is one reason to do it from a lot.
If the username is really primary key for your relation using the surrogate key (ID) is at least the space optimization. In normalization process your relation can be splited to the several tables. Replacing the username(varchar 30) by ID(int) in related tables as foreign key can save a lot of space.

string mismatch error while joining two files

I am joining two files. One file is a extraction from table(in0 port) having record format like this utf8 string("\x01", maximum_length=3).
And another file is a normal text file(in1 port) having record format like this ascii string(3).
While joining i am getting below error:
Field "company" in key specifier for input in1 has type "ascii string(3)",
but field "kg3_company_cd" in key specifier for input in0 has type "utf8 string("\x01", maximum_length=3)".
This join may be attempted in spite of the type mismatch by
setting configuration variable AB_ALLOW_DANGEROUS_KEY_CASTING to true.
However, typically the input streams will have been hash-partitioned on
the join keys of different types, making it unlikely that all equal join.
The issue is that a utf8 string and an ascii string are different underlying data to represent the same value. The error message you're receiving is warning you that if your join is running in parallel, it's likely that the hash partitioning algorithm would have sent matching key values from each flow to different partitions because the underlying data representing the "equal" strings is different. Ex: If both flows have 3 records each where the keys field values are ("A", "AB", ABC"), key "AB" may be on partition 0 for one flow, but partition 7 for the other flow. Your join component will run one instance for each partition, expecting the data to be partitioned correctly. The instance for partition 0 will see key "AB" on one flow but not the other. If it's an inner join, you'll see only those matching key records that were coincidentally sent to the same partition on the output.
You should pick which string encoding you want and ensure both flows have matching encoding before the join. Just add a reformat prior.

What does the number after the field name mean in this mysql index?

We're looking at the indexes of a few tables in mySQL and noticed something. There are several indexes that contain the type field, but some of them have a number in parentheses after the field name, while others don't.
For example, the index for keyname node_status_type includes type in the list of fields, but there is no number. node_title_type, on the other hand, includes type (4). What does the (4) mean?
That means it is indexing the first (X) number of characters/bytes in the field.
It is actually good practice to use column index prefixes in any cases where the data will be unique (or distinguishable for your purposes) within the first X characters/bytes. It keeps index sizes down.
Check the documentation for more information with regards to prefix index usage on different field data types.
The Docs

Is there a way to get SQL Server to automatically do selects on hash values of nvarchar fields?

I'm not sure how to better phrase this question so it's possible I missed a previously asked question. Feel free to close this and point me to the correct one if it exists.
I have a table with two important columns (that is to say it has much more but only two is pertinent to this question). First column is a GUID (an id) and the second is a nvarchar (storing an URL). The combination of the ID and the URL has to be unique (so a same guid can be repeated but each row has a different URL and vice-versa but there cannot be more than one row of the same guid and URL).
Currently, before every INSERT, I do a SELECT to see if there exists a row with the same id and URL. However it looks like lookups on the nvarchar is slow. Therefore I think I will update the table to store an extra column which is filled in with the hash (SHA1) of the URL upon insertion. Now we only do a lookup on the smaller hash (varbinary?) which I assume will be significantly faster than before.
Is there a way to get SQL Server 2008 to automatically store the hash and do a lookup against that hash value instead of the actual text? I'm assuming that the indecies are b-trees, so what I'm asking for is for SQL Server to create the b-tree with the hash values of the text in the nvarchar field and when a select is run, it should calculate the hash and do a lookup in the tree with the hash value. Is this possible?
If you do lookups on your (id, url) fields - do you have an index on those two columns?? If not - add one and see if that speeds up your lookups enough.
If not: yes, you can definitely get this functionality automagically - the magic word is: computed column.
In SQL Server, you can have columns that compute their values automatically, based on a formula you provide. This can be either just a simple arithmetic formula, or you can call a stored function to compute the value.
In order to make this fast for your checks, you would have to make sure you can make that computed column persisted - then you can index it, too. This excludes larger scale computations - the formula has to be clear, concise, and deterministic.
So, do this:
ALTER TABLE dbo.YourTable
ADD HashValue AS CAST(HASHBYTES('SHA1', CAST(ID AS VARCHAR(36)) + Url) AS VARBINARY(20)) PERSISTED
Now your table has a new HashValue column (call it whatever you like), and you can select that value and inspect it.
Next put an index on that new column
CREATE NONCLUSTERED INDEX IX_Hash_YourTable
ON dbo.YourTable(HashValue)
Now your lookup should be flying!
could you just put a unique constraint on the table for those two columns and perform the insert inside of a try / catch block?
It would save you from the extra work of calculating the hash, and the extra space of storing it
You can have a trigger that calculates the hash on insert and update and puts it in if required.
In terms of stopping the insert just add a unique index on them