Problems writing a query to match records between tables with complex matching criteria - sql-server-2008

I have a matching algorithm that I need to build on a large dataset in a SQL Server 2008 database. This is a minimal example of the kind of thing I need.
Say I have table1 and table2 each with just four columns, unique_id, col1, col2 and col3.
CREATE TABLE table1
(
unique_id VARCHAR(4),
col1 INT,
col2 INT,
col3 INT
);
INSERT INTO table1
VALUES ('ADAF', '2', '4', '17'),
('WSDA', '1', null, '12');
GO
CREATE TABLE table2
(
unique_id VARCHAR(4),
col1 INT,
col2 INT,
col3 INT
);
INSERT INTO table2
VALUES ('QWAS', '2', '4', '17'),
('FDFR', '3', '4', '17'),
('LKPY', '2', '4', null),
('FGDA', '1', null, '12'),
('GAPU', '1', '3', '12');
For all the records in table1, I want to return the unique IDs in table1 and the unique_ids in table2 where at least two out of three of the variables in col1, col2 and col3 are the same. It is OK if one of the variables does not match because it contains a null, but if any of the three columns contain a non-null value that does not match its counterpart in the other table then the match is automatically invalidated.
So in this example in table1 record ID ADAF will return QWAS (exact match) and LKPY as the non-null values match, but not FDFR because there is a non-match in col1.
WSDA will return FGDA and also GAPU as the null to 3 does not count as a mismatch.
I can solve this (inefficiently) by using a union query to get matches on the three columns into a temp table then linking this back to the original data with joins using another union query to get any invalidated matches, then running a query returning all matches, net of any invalidated matches.
However, my real world application needs me to match 3 out of 6 variables, with approximately 10 million rows in my record set matching to a much larger set of possible matches, so I need something more efficient.

You can get the results you want by JOINing the tables on matching values on each column while also allowing for NULL values to act as a match, and then counting the number of actual matches and requiring that to be at least 2:
SELECT t1.unique_id AS t1_id, t2.unique_id AS t2_id
FROM table1 t1
JOIN table2 t2 ON (t1.col1 = t2.col1 OR t1.col1 IS NULL OR t2.col1 IS NULL)
AND (t1.col2 = t2.col2 OR t1.col2 IS NULL OR t2.col2 IS NULL)
AND (t1.col3 = t2.col3 OR t1.col3 IS NULL OR t2.col3 IS NULL)
AND CASE WHEN t1.col1 = t2.col1 THEN 1 ELSE 0 END
+ CASE WHEN t1.col2 = t2.col2 THEN 1 ELSE 0 END
+ CASE WHEN t1.col3 = t2.col3 THEN 1 ELSE 0 END
>= 2
Output (for your sample data)
t1_id t2_id
ADAF QWAS
ADAF LKPY
WSDA FGDA
WSDA GAPU
Demo on dbfiddle

Related

MySQL Removing duplicates based on condition and multiple columns combinations

I have a table in MySQL as below:
ID, COL1, COL2 VALUE
'1', 'OBJ1', 'OBJ2', '5'
'2', 'OBJ1', 'OBJ2', '1'
'3', 'OBJ2', 'OBJ1', '3'
'4', 'OBJ3', 'OBJ1', '4'
'5', 'OBJ3', 'OBJ4', '6'
Relation between col1 and col2 is independent of position, ie OBJ1 in col1 and OBJ2 in col2 is same as OBJ1 in col2 and OBJ2 in col1. This means that OBJ1 and OBJ2 shares a relationship.
Now, this means that the object OBJ1 and OBJ2 have a value of 1,5,3...
I want to keep only distinct values ie OBJ1, OBJ2 should occur only once in the table, not even OBJ2,OBJ1.
Importantly, I want to retain only the row with HIGHEST value.
The result I want is thus:
ID, COL1, COL2 VALUE
'1', 'OBJ1', 'OBJ2', '5'
'4', 'OBJ3', 'OBJ1', '4'
'5', 'OBJ3', 'OBJ4', '6'
What is the best and efficient way of doing this? I have over 10 million rows.
I have searched in many forums/Google but cannot find the exact answer I am looking for..
Try this:
SELECT t1.ID, t1.COL1, t1.COL2, t1.VALUE
FROM mytable AS t1
JOIN (
SELECT LEAST(COL1, COL2) AS C1,
GREATEST(COL1, COL2) AS C2,
MAX(VALUE) AS max_Value
FROM mytable
GROUP BY LEAST(COL1, COL2),
GREATEST(COL1, COL2)
) AS t2 ON t1.COL1 = t1.C1 AND t1.COL2 = t2.C2 AND t1.VLAUE = t2.max_Value
You could use an in clause and subselect grouped by
for solve also the problem related to the distinct pair combination
You should organize the data in a proper way
select
id
, case when col1 <= col2 then col1 else col2 end COL1
, case when col1 > col2 then col1 else col2 end COL2
, value
from start_table
then the query became
SELECT t1.ID, t1.COL1, t1.COL2, t1.VALUE
FROM (
select
id
, case when col1 <= col2 then col1 else col2 end COL1
, case when col1 > col2 then col1 else col2 end COL2
, value
from start_table
) t1
where value in (
select max(value)
FROM (
select
id
, case when col1 <= col2 then col1 else col2 end COL1
, case when col1 > col2 then col1 else col2 end COL2
, value
from start_table
) mytable
group by col1, col2
)
or using an inner join
SELECT t1.ID, t1.COL1, t1.COL2, t1.VALUE
FROM (
select
id
, case when col1 <= col2 then col1 else col2 end COL1
, case when col1 > col2 then col1 else col2 end COL2
, value
from start_table
) t1
inner join
(
select max(value) as value
FROM (
select
id
, case when col1 <= col2 then col1 else col2 end COL1
, case when col1 > col2 then col1 else col2 end COL2
, value
from start_table
) mytable
group by col1, col2
) T2 on t1.value = t2.value
Rebuild the table so that no dups are allowed; in the process, get rid of the dups. (And get rid of the apparently useless id.)
CREATE TABLE new (
col1 ...,
col2 ...,
`value` ...,
PRIMARY KEY(col1, col2),
INDEX(col2, col2, `value`)
) ENGINE=InnoDB;
INSERT INTO new (col1, col2, `value`)
SELECT LEAST(col1, col2),
GREATEST(col1, col2),
`value`
ON DUPLICATE KEY UPDATE
`value` := GREATEST(`value`, VALUES(`value`));
RENAME TABLE real TO old,
new TO real;
DROP TABLE old;
In the future, you will need this for INSERTing/UPDATEing new rows:
INSERT INTO new (col1, col2, `value`)
VALUES (?, ?, ?)
ON DUPLICATE KEY UPDATE
`value` := GREATEST(`value`, VALUES(`value`));
(This assumes you want to increase value whenever it is already in the table.)
These save space and speed (important for 10M rows): Getting rid of id; having optimal indexes; using InnoDB; etc.

Error Code: 1060. Duplicate column name

I've been receiving Error Code: 1060. :
Duplicate column name 'NULL'
Duplicate column name '2016-08-04 01:25:06'
Duplicate column name 'john'
However, I need to insert some field with the same value, but SQL is denying and showing the above error. The error is probably sql can't select the same column name, in that case is there other way of writing the code? Below is my current code
INSERT INTO test.testTable SELECT *
FROM (SELECT NULL, 'hello', 'john', '2016-08-04 01:25:06', 'john'
, '2016-08-04 01:25:06', NULL, NULL) AS tmp
WHERE NOT EXISTS (SELECT * FROM test.testTable WHERE message= 'hello' AND created_by = 'john') LIMIT 1
My Column:
(id, message, created_by, created_date, updated_by, updated_date, deleted_by, deleted_date)
Please assist, thanks.
Your duplicate column names are coming from your subquery. You select null, john, and 2016-08-04 01:25:06 multiple times. Provide the columns you are selecting with names/aliases:
INSERT INTO test.testTable
SELECT *
FROM (SELECT NULL as col1, 'hello' as col2,
'john' as col3, '2016-08-04 01:25:06' as col4,
'john' as col5, '2016-08-04 01:25:06' as col6,
NULL as col7, NULL as col8) AS tmp
WHERE NOT EXISTS (SELECT *
FROM test.testTable
WHERE message= 'hello' AND created_by = 'john')
LIMIT 1
Not sure limit 1 is useful here, you are only selecting a single row to potentially insert.
You are using a subquery. Because you don't give the columns aliases, MySQL has to choose aliases for you -- and it chooses the formulas used for the definition.
You can write the query without the subquery:
INSERT INTO test.testTable( . . .)
SELECT NULL, 'hello', 'john', '2016-08-04 01:25:06', 'john',
'2016-08-04 01:25:06', NULL, NULL
FROM dual
WHERE NOT EXISTS (SELECT 1
FROM test.testTable tt
WHERE tt.message = 'hello' AND tt.created_by = 'john'
);
If you do use a subquery in the SELECT, then use correlation clauses in the WHERE subquery:
INSERT INTO test.testTable( . . .)
SELECT *
FROM (SELECT NULL as col1, 'hello' as message, 'john' as created_by,
'2016-08-04 01:25:06' as date, 'john' as col2,
'2016-08-04 01:25:06' as col3, NULL as col4, NULL as col5
) t
WHERE NOT EXISTS (SELECT 1
FROM test.testTable tt
WHERE tt.message = t.message AND
tt.created_by = t.created_by
);
In addition, the LIMIT 1 isn't doing anything because you only have one row.

Insert into table select where not exists affecting 0 rows

I'm trying to insert into a table with the following syntax:
INSERT INTO table1(
col1, col2, col3)
SELECT distinct
col1, col2, getDate()
FROM table2 WHERE NOT EXISTS(
SELECT 1 FROM table1, table2
WHERE ((table1.col1 = table2.col1) or (table1.col1 is null and table2.col1 is null))
AND ((table1.col2 = table2.col2) or (table1.col2 is null and table2.col2 is null)))
But when I run the query, it shows (0 row(s) affected).
The SELECT statement within the NOT EXISTS statement returns the correct number of rows that I don't want inserted. If I try to insert into the table without the WHERE NOT EXISTS statement, it inserts everything. I only want to insert rows that are not already in table1.
Try this:
INSERT INTO table1(col1, col2, col3)
SELECT distinct col1, col2, getDate()
FROM table2 WHERE NOT EXISTS(
SELECT 1 FROM table1
WHERE ((table1.col1 = table2.col1) or (table1.col1 is null and table2.col1 is null))
AND ((table1.col2 = table2.col2) or (table1.col2 is null and table2.col2 is null)))
You can optimize this query quite a bit, but as a quick fix you could change:
SELECT 1 FROM table1, table2
to:
SELECT 1 FROM table1
This will tie the outer table2 into your subquery.

MYSQL INSERT IF SUMs > CONSTANT

I'm trying to insert a record if a sum of 3 user columns from 2 tables exceeds a constant.
I've searched all over, found you can't put user variables in IFs, WHERE's etc. Found you can't put SUMs in IFs, WHERE's etc. I'm at a total loss. Here's an example of my earlier bad code before unsuccessfully trying to use SUMs in WHEREs, if it helps:
SELECT SUM(num1) INTO #mun1 FROM table1 WHERE user = '0';
SELECT SUM(num2) INTO #mun2 FROM table1 WHERE user = '0';
SELECT SUM(num3) INTO #mun3 FROM table2 WHERE column1 = 'd' AND user = '0';
SET #mun4 = #mun1 - #mun2 - #mun3;
INSERT INTO table2 (user, column1, column2) VALUES ('0', 'd', '100') WHERE #mun4 >= 100;
Try this:
INSERT INTO table2 (user, column1, column2)
select '0', 'd', '100'
from dual
where (SELECT SUM(num1 + num2) FROM table1 WHERE user = '0') +
(SELECT SUM(num3) FROM table2 WHERE column1 = 'd' AND user = '0') > 100;
This is a case of the general solution for a "insert if condition" problem:
insert into ... select ... where condition
The select will only return rows if the condition is true, and importantly, will return no rows if false - meaning the insert only happens if the condition is true, otherwise nothing happens.
This is same as #Bohemian's answer, but you got to add a LIMIT clause to stop inserting multiple records, since select clause may return multiple records
INSERT INTO table2 (user, column1, column2)
SELECT '0', 'd', '100'
FROM dual
WHERE
(SELECT SUM(num1 - num2) FROM table1 WHERE user = '0')
(SELECT SUM(num3) FROM table2 WHERE column1 = 'd' AND user = '0') >
100
LIMIT 1

Updating values of SQL column with values from the same column

If you will look at the image above. I need to update this table for the null values of the TID which is column third in the table, with the values in between two rows that actually has value.
So in the above example, I need to have rows 44-57 as 040, row 60-87 as 077 etc. One pattern that could be used is that column 2 has INS in the string, which denotes that the value in column 3 is to be changed. So I was thinking about using DATA LIKE 'INS%' in some way.
Please let me know what you think of the problem and any possible solutions.
thanks!
DECLARE #x TABLE
(Column1 INT, Column2 VARCHAR(64), TID VARCHAR(10));
INSERT #x VALUES
(42, 'INS{whatever}', '040'),
(43, 'somethingelse', '040'),
(44, 'somethingelse', NULL),
(45, 'somethingelse', NULL),
(46, 'somethingelse', NULL),
(47, 'somethingelse', NULL),
(48, 'somethingelse', NULL),
(49, 'INS{whatever}', '077'),
(50, 'somethingelse', '077'),
(51, 'somethingelse', NULL),
(52, 'somethingelse', NULL);
;WITH x AS (SELECT i = Column1, TID, rn = ROW_NUMBER() OVER (ORDER BY Column1)
FROM #x WHERE Column2 LIKE 'INS%'
),
y AS (SELECT x.TID, s = x.i, e = COALESCE(x2.i, 2000000000)
FROM x LEFT OUTER JOIN x AS x2 ON x.rn = x2.rn -1
)
UPDATE src SET TID = y.TID
FROM #x AS src
INNER JOIN y ON src.Column1 > y.s AND src.Column1 < y.e;
SELECT * FROM #x;
This assumes that:
The first two columns in your sample were duplicated (I ignore the first listed)
Col1 is a primary key
Values are to be assigned as you described based on ascending values in Col1
Performance might be bad to very bad on large tables
Performance would improve with suitable indexing (on Col1 and Col3)
Substitute in your table and column names, and check for minor typos.
UPDATE MyTable
set Col3 = mt2.Col3
from MyTable mt
inner join (-- Get the "earlier" Col3 value for each row that has no value
select t1.Col1, max(t2.Col1) EarlierValueHere
from MyTable t1
inner join MyTable t2
on t2.Col1 < t1.Col1
and t2.Col3 is not null
group by t1.Col1
where t1.Col3 is null) earlier
on earlier.Col1 = mt.Col1
inner join MyTable mt2
on mt2.Col1 = earlier.EarlierValueHere
Another query you might use:
update t set TID = X.NonNullTID
from [YourTable] t
join
(select
t1.Column1, t1.Column2, t1.TID,
(select top 1 tid from [YourTable]
where TID is not null and Column1 <= t1.Column1
order by Column1 desc) as NonNullTID
from [YourTable] t1) X
on X.Column1 = t.Column1