string mismatch error while joining two files - ab-initio

I am joining two files. One file is a extraction from table(in0 port) having record format like this utf8 string("\x01", maximum_length=3).
And another file is a normal text file(in1 port) having record format like this ascii string(3).
While joining i am getting below error:
Field "company" in key specifier for input in1 has type "ascii string(3)",
but field "kg3_company_cd" in key specifier for input in0 has type "utf8 string("\x01", maximum_length=3)".
This join may be attempted in spite of the type mismatch by
setting configuration variable AB_ALLOW_DANGEROUS_KEY_CASTING to true.
However, typically the input streams will have been hash-partitioned on
the join keys of different types, making it unlikely that all equal join.

The issue is that a utf8 string and an ascii string are different underlying data to represent the same value. The error message you're receiving is warning you that if your join is running in parallel, it's likely that the hash partitioning algorithm would have sent matching key values from each flow to different partitions because the underlying data representing the "equal" strings is different. Ex: If both flows have 3 records each where the keys field values are ("A", "AB", ABC"), key "AB" may be on partition 0 for one flow, but partition 7 for the other flow. Your join component will run one instance for each partition, expecting the data to be partitioned correctly. The instance for partition 0 will see key "AB" on one flow but not the other. If it's an inner join, you'll see only those matching key records that were coincidentally sent to the same partition on the output.
You should pick which string encoding you want and ensure both flows have matching encoding before the join. Just add a reformat prior.

Related

MySQL using MATCH AGAINST for long unique values (8.0.27)

I have a situation where we're storing long unique IDs (up to 200 characters) that are single TEXT entries in our database. The problem is we're using a FULLTEXT index for speed purposes and it works great for the smaller GUID style entries. The problem is it won't work for the entries > 84 characters due to the limitations of innodb_ft_max_token_size, which apparently cannot be set > 84. This means any entries more than 84 characters are omitted from the Index.
Sample Entries (actual data from different sources I need to match):
AQMkADk22NgFmMTgzLTQ3MzEtNDYwYy1hZTgyLTBiZmU0Y2MBNDljMwBGAAADVJvMxLfANEeAePRRtVpkXQcAmNmJjI_T7kK7mrTinXmQXgAAAgENAAAAmNmJjI_T7kK7mrTinXmQXgABYpfCdwAAAA==
AND
<j938ir9r-XfrwkECA8Bxz6iqxVth-BumZCRIQ13On_inEoGIBnxva8BfxOoNNgzYofGuOHKOzldnceaSD0KLmkm9ET4hlomDnLu8PBktoi9-r-pLzKIWbV0eNadC3RIxX3ERwQABAgA=#t2.msgid.quoramail.com>
AND
["ca97826d-3bea-4986-b112-782ab312aq23","ca97826d-3bea-4986-b112-782ab312aaf7","ca97826d-3bea-4986-b112-782ab312a326"]
So what are my options here? Is there any way to get the unique strings of 160 (or so) characters working with a FULLTEXT index?
What's the most efficient Index I can use for large string values without spaces (up to 200 characters)?
Here's a summary of the discussion in comments:
The id's have multiple formats, either a single token of variable length up to 200 characters, or even an "array," being a JSON-formatted document with multiple tokens. These entries come from different sources, and the format is outside of your control.
The FULLTEXT index implementation in MySQL has a maximum token size of 84 characters. This is not able to search for longer tokens.
You could use a conventional B-tree index (not FULLTEXT) to index longer strings, up to 3072 bytes in current versions of MySQL. But this would not support cases of JSON arrays of multiple tokens. You can't use a B-tree index to search for words in the middle of a string. Nor can you use an index with the LIKE predicate to match a substring using a wildcard in the front of the pattern.
Therefore to use a B-tree index, you must store one token per row. If you receive a JSON array, you would have to split this into individual tokens and store each one on a row by itself. This means writing some code to transform the content you receive as id's before inserting them into the database.
MySQL 8.0.17 supports a new kind of index on a JSON array, called a Multi-Value Index. If you could store all your tokens as a JSON array, even those that are received as single tokens, you could use this type of index. But this also would require writing some code to transform the singular form of id's into a JSON array.
The bottom line is that there is no single solution for indexing the text if you must support any and all formats. You either have to suffer with non-optimized searches, or else you need to find a way to modify the data so you can index it.
Create a new table 2 columns: a VARCHAR(200) CHARSET ascii COLLATION ascii_bin (BASE64 needs case sensitivity.)
That table may have multiple rows for one row in your main table.
Use some simple parsing to find the string (or strings) in your table to add them to this new table.
PRIMARY KEY(that-big-column)
Update your code to also do the INSERT of new rows for new data.
Now a simple BTree lookup plus Join will solve all your plans.
TEXT does not work with indexes, but VARCHAR up to some limit does work. 200 with ascii is only 200 bytes, much below the 3072 limit.

How are mysql data types processed at the moment of table creation?

I was wondering if SQL actually does something with a column data type at the moment of table creation. I mean I understand that mysql needs it when inserting data to understand what is allowed to insert in it. But at the moment of table creation does SQL allocate different areas of memory or something like that? Or data types are only mandatory at the moment of table creation for the ease of future table insert statements?
The datatypes are stored for use with queries.
During an INSERT, the data for the row being inserted is laid out based on the datatypes. INT will use 4 bytes for a binary integer. VARCHAR(40) will be laid out as a length plus up to 40 characters for a string. DATE takes 3 bytes in a certain format. Etc.
Most datatypes go in (via INSERT) and come out (via SELECT) as strings. So, the string '2020-12-31', when used in a DATE is turned into the 3-byte internal format.
If you try to put the string '123xyz' into INT, it converts that string to an integer, and gets 123. (This example is usually considered wrong, but that's what is done.)
When you JOIN two tables, the datatypes of the columns you are joining on should be the same. If they are different datatypes, then one is converted to the other if possible.

SSIS Lookup transformation lookup key not working properly

I have a lookup transformation that lookups up field ContainerSrcID_Barcode DT_WSTR(20) field from table A against ContainerSrcID DT_WSTR(40) from table B and values are being kicked out to an error table if they do not match. I've noticed that values that exists in both of those key field are being kicked out to error table. The only difference I see in those two key is the length size. Is there another reason why values that are in both of those keys are considered not matched?
The length of the fields doesn't matter. If the lookup is not finding matches, then the data doesn't actually match. Probably you have whitespace characters or special non-ascii characters (like carriage returns, etc) in one of the fields and not the other.

MySQL CAST() causes significant performance hit

So I ran the following in the MySQL console as a control test to see what was holding back the speed of my query.
SELECT bbva_deductions.ded_code, SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS tt_emplid,
bbva_job.paygroup, bbva_job.file_nbr, bbva_deductions.ded_amount
FROM bbva_deductions
LEFT JOIN bbva_job
ON CAST(SUBSTRING_INDEX(bbva_deductions.employee_id, '-' , -1) AS UNSIGNED) = bbva_job.emplid LIMIT 500
It took consistently around 4 seconds to run. (seems very high for only 500 rows). Simply removing the CAST within the JOIN decreased that to just 0.01 seconds.
In this context, why on earth is CAST so slow?
Here is the output of an EXPLAIN for this query:
And the same for the query without a CAST:
EXPLAIN EXTENDED:
As documented under How MySQL Uses Indexes:
MySQL uses indexes for these operations:
[ deletia ]
To retrieve rows from other tables when performing joins. MySQL can use indexes on columns more efficiently if they are declared as the same type and size. In this context, VARCHAR and CHAR are considered the same if they are declared as the same size. For example, VARCHAR(10) and CHAR(10) are the same size, but VARCHAR(10) and CHAR(15) are not.
Comparison of dissimilar columns may prevent use of indexes if values cannot be compared directly without conversion. Suppose that a numeric column is compared to a string column. For a given value such as 1 in the numeric column, it might compare equal to any number of values in the string column such as '1', ' 1', '00001', or '01.e1'. This rules out use of any indexes for the string column.
In your case, you are attempting to join on a comparison between a substring (of a string column in one table) and a string column in another table. An index can be used for this operation, however the comparison is performed lexicographically (i.e. treating the operands as strings, even if they represent numbers).
By explicitly casting one side to an integer, the comparison is performed numerically (as desired) - but this requires MySQL to implicitly convert the type of the string column and therefore it is unable to use that column's index.
You have hit this road bump because your schema is poorly designed. You should strive to ensure that all columns:
are encoded using the data types that are most relevant to their content; and
contain only a single piece of information — see Is storing a delimited list in a database column really that bad?
At the very least, your bbva_job.emplid should be an integer; and your bbva_deductions.employee_id should be split so that its parts are stored in separate (appropriately-typed) columns. With appropriate indexes, your query will then be considerably more performant.

Conversion of strings with binary values

This is a question of converting strings from DB2 to SQL Server.
On DB2 you can have a column that contains a mix of strings and binary data (e.g. using REDEFINS in COBOL to combine string and decimal values into a DB2 column).
This will have unpredictable results during data replication as the binary zero (0x00) is treated as string-terminator (in the C family of software languages).
Both SQL Server and DB2 are able to store binary zero in the middle of fixed length char columns without any issue.
Has anyone any experiences with this problem? The way I see it, the only way to fix it, is to amend the COBOL program and the database schema, so if you have a column of 14 chars, where the first 10 is a string and the last 4 a decimal, split this up into two columns containing one "part" each.
If you want to just transfer the data 1:1, I'd just create a binary(x) field of equal length, of varbinary(x) in case the length differs.
If you need to easily access the stored string and decimal values, you could create a number of computed columns that extract the string/decimal values from the binary(x) field and represents them as normal columns. This would allow you to do an easy 1:1 migration while having simple and strongly typed access to the contents.
The optimal way would be to create strongly typed columns on the SQL Server database and then perform the actual migration either in COBOL or whatever script/system is used to perform the one time migration. You could still store a binary(x) to save the original value, in case a conversion error occurs, or you need to present the original value to the COBOL system.