MySQL - any way to join to table using a mask? - mysql

The column I want to join my 2 tables on conceptually has the same meaning but the wording may be different. I wonder if there is any way to match them based on a few characters of a word.
Example:
table 1 has HAZMAT and table 2 has ADDITIONAL HAZARDOUS. Let's say I want to use table 1 LEFT(column, 3) to select the HAZ part and match it to cells that contain this part.

Related

MS Access SQL Unequal join for 3 or more tables

I'm thinking of switching to using temp tables and vba.
I want to do this. I have multiple tables, in these tables may or may not have fields with items that have a one to many or one to one relationship. I know what those relationships are (and will create multiple queries accordingly). What I'm hunting for is each value that DOES NOT EXIST in every other table. To make an example:
Say we have 3 single column tables, table 1 is {x, y, z}, table 2 is {a, x, z}, and table 3 is {a,b,x,y,z}, the result will be b for t3 (yes I need to know where the error is). Pretty much, I want to use the unequal wizard but for 3 or more tables.
I may want to look for any item that exists in some but not all other tables. If you want to speak on that, it would be helpful, but I think that is strictly in the vba realm.
I think the challenge here is the open-endedness of the problem you are trying to solve. Varying column names, table names, and uniqueness thresholds across all tables would make it a bit more difficult. In the way I show below, I don't think it would be the most efficient, query-wise, but would be relatively easy to script. The following code assumes values in the tables are unique within each table.
There are 3 queries total:
qry_001_TableValues_ALL
SELECT Table1.MyValue, "Table1" AS Source
FROM Table1
UNION
SELECT Table2.MyValue, "Table2" AS Source
FROM Table2
UNION SELECT Table3.MyValue, "Table3" AS Source
FROM Table3;
qry_002_TableValues_Unique:
SELECT qry_001_TableValues_ALL.MyValue
FROM qry_001_TableValues_ALL
GROUP BY qry_001_TableValues_ALL.MyValue
HAVING (((Count(qry_001_TableValues_ALL.MyValue))=1));
qry_003_TableValues_UniqueWithSource:
SELECT qry_002_TableValues_Unique.MyValue, qry_001_TableValues_ALL.Source
FROM qry_002_TableValues_Unique INNER JOIN qry_001_TableValues_ALL
ON qry_002_TableValues_Unique.MyValue = qry_001_TableValues_ALL.MyValue;
The first table is the one you would need to script out if columns\tables changed. It is looking across all tables and creating a unique list of values from the specified field. The second query looks to look up the Source table name against the original unique value query for all values which have a count of 1, post aggregation. This means of all tables involved, there is only one instace of the values returned, and it joins against the original unique value list again to determine what the source table is. You can script a change to the HAVING clause here to see if there are x tables which contain the value. The final query is simply the one you run to give you the final report of the values you are looking for and where they reside.
Hope this is in the ballpark of what you are trying to do.

MySQL Join on 2 tables based on text field

I have two MySQL tables: "list" and "more"
The "list" table has upwards of 100,000 results.
The "more" table has upwards of 50,000 results.
The data from these tables cannot be combined.
The issue is, I have a column called "short_title" in both tables. They are both VARCHAR(255) and will contain a string like "short-title-here".
I'm using a simple query like:
SELECT L.title, M.more_info
FROM `list` L, `more` M
WHERE M.short_title = L.short_title
Since there are so many results in each table, and I need to be matching the results based on the "short_title" column which is a text field, it makes the queries EXTREMELY slow.
There is an INDEX on the "short_title" column in the "list" table, and the "short_title" column in my "more" table is UNIQUE
Is there anything I can do to the column (example: making them fulltext) that will make these queries faster?
Thank you in advance
**** UPDATE ****
I've changed my query to INNER JOIN the two tables.
The results of the explain query can be found here:
Try indexing the short_title fields.
This is your query (written using explicit joins):
SELECT L.title, M.more_info
FROM list L join
more M
on M.short_title = L.short_title;
You state that short_title is being stored as text. You can create a regular index on it by doing:
create index more_short_title on more(short_title(255));
The 255 is the number of bytes to be indexed. This should speed up the queries.
Also, with a name like short_title, why not just use a varchar() column rather than text?

MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

I recently implemented the UDFs of the Damerau–Levenshtein algorithms into MySQL, and was wondering if there is a way to combine the fuzzy matching of the Damerau–Levenshtein algorithm with the wildcard searching of the Like function? If I have the following data in a table:
ID | Text
---------------------------------------------
1 | let's find this document
2 | let's find this docment
3 | When the book is closed
4 | The dcument is locked
I want to run a query that would incorporate the Damerau–Levenshtein algorithm...
select text from table where damlev('Document',tablename.text) <= 5;
...with a wildcard match to return IDs 1, 2, and 4 in my query. I'm not sure of the syntax or if this is possible, or whether I would have to approach this differently. The above select statement works fine in issolation, but is not working on individual words. I would have to change the above SQL to...
select text from table where
damlev('let's find this document',tablename.text) <= 5;
...which of course returns just ID 2. I'm hoping there is a way to combine the fuzzy and wildcard together if I want all records returned that have the word "document" or variations of it appearing anyway within the Text field.
In working with person names, and doing fuzzy lookups on them, what worked for me was to create a second table of words. Also create a third table that is an intersect table for the many to many relationship between the table containing the text, and the word table. When a row is added to the text table, you split the text into words and populate the intersect table appropriately, adding new words to the word table when needed. Once this structure is in place, you can do lookups a bit faster, because you only need to perform your damlev function over the table of unique words. A simple join gets you the text containing the matching words.
A query for a single word match would look something like this:
SELECT T.* FROM Words AS W
JOIN Intersect AS I ON I.WordId = W.WordId
JOIN Text AS T ON T.TextId = I.TextId
WHERE damlev('document',W.Word) <= 5
and two words would look like this (off the top of my head, so may not be exactly correct):
SELECT T.* FROM Text AS T
JOIN (SELECT I.TextId, COUNT(I.WordId) AS MatchCount FROM Word AS W
JOIN Intersect AS I ON I.WordId = W.WordId
WHERE damlev('john',W.Word) <= 2
OR damlev('smith',W.Word) <=2
GROUP BY I.TextId) AS Matches ON Matches.TextId = T.TextId
AND Matches.MatchCount = 2
The advantages here, at the cost of some database space, is that you only have to apply the time-expensive damlev function to the unique words, which will probably only number in the 10's of thousands regardless of the size of your table of text. This matters, because the damlev UDF will not use indexes - it will scan the entire table on which it's applied to compute a value for every row. Scanning just the unique words should be much faster. The other advantage is that the damlev is applied at the word level, which seems to be what you are asking for. Another advantage is that you can expand the query to support searching on multiple words, and can rank the results by grouping the matching intersect rows on TextId, and ranking on the count of matches.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

Merge columns in MySQL SELECT

I have a table that stores a default configuration and a table that stores a user configuration. I can join the two tables and get all the info I need however I was hoping there might be a cleaner way to overwrite one column with the other when a value exists in the second column.
Example:
Current query result:
id defaultValue userValue
1 one ONE
2 two
3 three THREE
4 four
Desire query result:
id value
1 ONE
2 two
3 THREE
4 four
Maybe there isn't a good way to do this... Thought I'd ask though as it's probably faster to do it in MySQL if a method exists than to do it in PHP.
You can use COALESCE() for this:
SELECT id, COALESCE(uservalue,defaultvalue) AS value
FROM table