How can I compare if 2 tables have the same data? - mysql

If I have 2 tables and want to find if they have the same data, what is the most straightforward way to do it in MySQL?
I have read about doing a correlated subquery and UNION ALL but this query is about 2 pages (!) and can not really follow what it is doing. There must be an easier way.
Even if it is e.g. make MySQL copy the table data to files and do a vimdiff (I am not sure that this is even possible -is it?- just thinking out loud).
UPDATE
I am interested only in the table data and not structure. This is to clarify due to an ambiguous comment I made

If you just want to tell whether the tables are identical or not as efficiently as possible, use this query:
SELECT 1 FROM (
SELECT * FROM table1
UNION ALL
SELECT * FROM table2
) t
GROUP BY col1, col2, col3
HAVING count(*) = 1
LIMIT 1
List all the columns in GROUP BY to compare the entire table.
If the result is an empty set, the two tables are identical.
If you want to see the differences, use this query:
SELECT * FROM (
SELECT 'table1' tname, col1, col2, col3 FROM table1
UNION ALL
SELECT 'table2' tname, col1, col2, col3 FROM table2
) t
GROUP BY col1, col2, col3
HAVING count(*) = 1
List the same columns in the inner SELECT as in the GROUP BY, plus a column to distinguish the two tables.

Just throwing this out there, you could emulate a full outer join and then return the rows where just the right or the left side is null.
select t1.*
from table1 t1
LEFT OUTER JOIN table2 t2
ON t1.col1 = t2.col1
AND t1.col2 = t2.col2
AND ...
WHERE t2.id is null
UNION
select t2.*
from table2 t2
LEFT OUTER JOIN table1 t1
ON t2.col1 = t1.col1
AND t2.col2 = t1.col2
AND ...
WHERE t1.id is null
With the FULL OUTER JOIN you can show all rows where the other row is not available in the other table.

Use the following query:
SELECT c1 = cjoin AND c2 = cjoin equiv
FROM (SELECT COUNT(*) c1 FROM Table1) t1,
(SELECT COUNT(*) c2 FROM Table2) t2,
(SELECT COUNT(*) cjoin
FROM Table1 t1
JOIN Table2 t2
ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 ...) tjoin
Assuming the tables have a unique key, this will return equiv = 1 if the tables are equal. It doesn't show the differences, it's just a binary test.

I was reading SQL Cookbook from A.Molinaro, when I came across a solution.
It is based on to tables
emp(empno,ename,job,mgr,hiredate,sal,comm,deptno)
and a view
V
which has the same columns but different rows. The columns mgr and comm might be NULL, other columns not.
The solution in the book is very long and it does not show all differences, although this was the stated problem in 3.7.
I made up my solution which is shorter and shows all differences (means all rows which have different counts in the two tables).
select * from
# those which are contained in the (distinct) union of (col1,col2,...,coln, count) of both tables:
( select empno,ename,job,mgr,hiredate,comm,deptno, count(*) cnt from emp group by empno,ename,job,mgr,hiredate,comm,deptno
union
select empno,ename,job,mgr,hiredate,comm,deptno, count(*) cnt from V group by empno,ename,job,mgr,hiredate,comm,deptno
) as unionOfBoth
where (empno,ename,job,mgr,hiredate,comm,deptno,cnt)
not in
# those which are contained in the intersection of both tables with the equal number of counts:
( select e.empno,e.ename,e.job,e.mgr,e.hiredate,e.comm,e.deptno,e.cnt
from
(select empno, ename,job,mgr,hiredate,comm,deptno, count(*) cnt from emp group by empno,ename,job,mgr,hiredate,comm,deptno) e,
(select empno, ename,job,mgr,hiredate,comm,deptno, count(*) cnt from V group by empno,ename,job,mgr,hiredate,comm,deptno) v
where
e.empno = v.empno
and e.ename = v.ename
and e.job = v.job
and ifnull(e.mgr,0) = ifnull(v.mgr,0)
and e.hiredate = v.mgr
and e.deptno = v.deptno
and ifnull(e.comm,0) = ifnull(v.comm,0)
and e.cnt = v.cnt
);
Basically you count the distinct rows in both tables and do a union (not union all) to get the tmp.table unionBoth. Then you remove those rows, which both tables have in common.
Here two rows r1 from table t1 and r2 from table t2 are considered the same, if
(r1,count of r1 in t1) = (r2, count of r2 in t2), which is equivalent to r1=r2 (on all columns) and (count of r1 in t1) = (count of r2 in t2).

If the tables are small enough, you can export both tables as csv files and then copy one of the tables and paste them side-by-side with the other table. You can just go row by row and see if the outputs are the same that way.

Related

Merge rows in mysql based on condition

I am trying to merge the rows based on condition in mysql.
I have table as shown below :
Looking merge the row 1 into row 2 (where the attendance count is larger)
and need to shown the result as :
I was trying to divide the dataset into 2 parts using the below query
select
a.student_id,a.school_id,a.name,a.grant,a.classification,a.original_classification,,a.consent_type
from (
select * from school_temp where original_classification='all' and availability='implicit')a
join(select * from school_temp where original_classification!='all' and availability!='implicit')b
on a.student_id = b.student_id and a.school_id=b.school_id and a.name=b.name
But unable to merge the rows and get total attendance count .
Please help me ,i am badly stuck in this
Split this into two queries that you combine with UNION.
The first joins the implicit row with the row with the highest attendance among the explicit rows for each student. See Retrieving the last record in each group - MySQL for how that works. Use SUM(attendance_count) to combine the attendances.
The second query in the UNION gets all the rows that don't have the highest attendance.
WITH explicit as (
SELECT *
FROM school_temp
WHERE original_classification!='all' and availability!='implicit'
)
SELECT a.student_id, a.school_id, a.name, SUM(attendance_count) AS attendance_count,
b.grant, b.classification, b.original_classification, b.consent_type
FROM school_temp AS a
JOIN (
SELECT t1.*
FROM explicit AS t1
JOIN (
SELECT student_id, school_id, name, MAX(attendance_count) AS max_attendance
FROM explicit AS t2
GROUP BY student_id, school_id, name
) AS t2 ON t1.student_id = t2.student_id AND t1.school_id = t2.school_id AND t1.name = t2.name AND t1.attendance_count = t2.max_attendance
) AS b ON a.student_id = b.student_id and a.school_id=b.school_id and a.name=b.name
WHERE a.original_classication = 'all' AND a.availability = 'implicit'
UNION ALL
SELECT t1.*
FROM explicit AS t1
JOIN (
SELECT student_id, school_id, name, MAX(attendance_count) AS max_attendance
FROM explicit AS t2
GROUP BY student_id, school_id, name
) AS t2 ON t1.student_id = t2.student_id AND t1.school_id = t2.school_id AND t1.name = t2.name AND t1.attendance_count < t2.max_attendance
I've used a CTE to give a name to the subquery that gets all the explicit rows. If you're using MySQL 5.x, you'll need to replace explicit with that subquery throughout the query. Or you could define it as a view.

How to find rows in mySQL DB which have no duplicates?

I have a table:
--id---name---col1--col2--col3...-colN--created.
--1---myName---col1--col2--col3...-colN--created1.
--2---myName---col1--col2--col3...-colN--created2.
--3---myOtherName---Othercol1--Othercol2--Othercol3...-OthercolN--created3.
id and created fields are unique.
Rest of the rows has duplicates - exact the same set of values (name+col1+col2+col3+..+colN).
However, few rows are completely unique. How could I find them (row 3 in my example)?
You can use NOT EXISTS and a correlated subquery selecting rows from the same table with a different ID but equal values.
SELECT *
FROM elbat t1
WHERE NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.id <> t1.id
AND t2.col1 = t1.col1
AND t2.col2 = t1.col2
AND t2.col3 = t1.col3
...
AND t2.coln = t1.coln);
You can group by fields that must be unique and then select rows where the count equals to one.
SELECT *
FROM
mytable
INNER JOIN (
SELECT id
FROM
mytable
GROUP BY
col1, col2, col3
HAVING
COUNT(*) = 1
) t
ON mytable.id = t.id;
There are a number of solutions. Depending on the amount of data and performance requirements you could add indexes and test a couple of solutions to get the optimal results.

Left Join with Null Script Efficiency Explanation Needed

Why would I use a LEFT JOIN in SQL in a FROM clause and attach a WHERE clause where the entity "is null"? I was told this is a very efficient script and I should learn the methodology behind it.
For example:
FROM
something
LEFT JOIN aRow a AND bRow b AND cRow c AND dRow d
WHERE
bRow.b IS NULL;
This kind of construct is used when you specifically want to know something like "a list of all customers who have never ordered anything" :
SELECT
customer.*
FROM
customers
LEFT JOIN
orders
ON
orders.customerid = customers.id
WHERE
orders.id IS NULL
Or to quote an old manager of mine: "Can you get the database to give me a list of everything that isn't in the database?"
Me> "Sure, can you give me a list of what things the database should tell you it doesn't have?"
Him> "How am I supposed to know that?"
This really is a fairly generic, non-RDBMS-specific question. The logic will apply to pretty much any flavor of SQL. And this is a technique that anyone who works with data queries should be familiar with.
For all intents and purposes (and moving past the flawed syntax in the OP), this is the same query as:
SELECT *
FROM table1
WHERE table1.col1 NOT IN (
SELECT table2.col1 FROM table2 WHERE table2.col2 = <filterHere>
)
When you are dealing with a couple of hundred rows, you may not see a significant difference in performance. But when you're working with just a few million rows in both tables, you will most definitely see a significant performance increase in
SELECT table1.*
FROM table1
LEFT OUTER JOIN table2 ON table1.col1 = table2.col1
AND table2.col2 = 42
WHERE table2.id IS NULL
Let's illustrate what is happening with these queries.
Create test tables.
CREATE TABLE table1 (col1 int, col2 varchar(10)) ;
INSERT INTO table1 ( col1, col2 )
VALUES (1,'a')
, (2,'b')
, (3,'c')
, (4,'d')
CREATE TABLE table2 (col1 int, col2 varchar(10)) ;
INSERT INTO table2 ( col1, col2 )
VALUES (1,'a')
, (3,'c')
This gives us
table1
col1 col2
1 a
2 b
3 c
4 d
table2
col1 col2
1 a
3 c
Now we want the columns that are in table1 but not in table2.
SELECT t1.col1, t1.col2
FROM table1 t1
WHERE t1.col1 NOT IN (
SELECT t2.col1 FROM table2 t2
)
We can't SELECT anything from table2, because that table is just a sub-query and not part of the whole query. It's not available to us.
This breaks down to
SELECT t1.col1, t1.col2
FROM table1 t1
WHERE t1.col1 NOT IN ( 1,3 )
Which further breaks down to
SELECT t1.col1, t1.col2
FROM table1 t1
WHERE t1.col1 <> 1
OR t1.col1 <> 3
These queries give us
col1 col2
2 b
4 d
That's a subquery broken down into 2 different OR statements to filter our results.
So lets look at a JOIN. We want all of the records on the left side, and only include those on the right side that match. So
SELECT t1.col1 AS t1_col1, t1.col2 AS t1_col2, t2.col1 AS t2_col1, t2.col2 AS t2_col2
FROM table1 t1
LEFT OUTER JOIN table2 t2 ON t1.col1 = t2.col1
With a JOIN, both tables are available to our SELECT, so we can see which records in tablel2 match up to those in table1. The above gives us
t1_col1 t1_col2 t2_col1 t2_col2
1 a 1 a
2 b NULL NULL
3 c 3 c
4 d NULL NULL
With the extra data, we can see that col1 for 2 and 4 don't match in the two tables. We can now filter those out with a simple WHERE statement.
SELECT t1.col1, t1.col2
FROM table1 t1
LEFT OUTER JOIN table2 t2 ON t1.col1 = t2.col1
WHERE t2.col1 IS NULL
Giving us
col1 col2
2 b
4 d
There's no subquery and just one statement in the filter. Plus, this allows the engine's optimizer to make a more efficient query plan.
It's impossible to see a difference in performance when we're only dealing with a couple of rows, but multiply these tables by a few million rows, and you will definitely see how much faster a JOIN can be.

Using SQL to find all possible combinations of column variables

I have a table in SQL which has N columns. Call them "Col1", "Col2", ..., "ColN". I can find out how many unique elements there are in Col1 by the query:
select count(distinct Col1) from mytable
and I can do this, independently for each column. Assuming I have M_1 unique elements in Col1, M_2 in Col2, etc., what single command can I use to find the total number of all possible combinations for my dataset? That is, what single query would calculate (M_1*M_2*...*M_N) for me?
PS: very new to SQL here, so I'm not sure if this matters - but I am using MySQL Workbench on Windows.
SELECT COUNT(*)
FROM (SELECT DISTINCT col1 FROM YourTable) AS t1
CROSS JOIN (SELECT DISTINCT col2 FROM YourTable) AS t2
CROSS JOIN (SELECT DISTINCT col3 FROM YourTable) AS t3
...
CROSS JOIN calculates the cross product between the given tables.
Another way to write it would be:
SELECT COUNT(DISTINCT t1.col1, t2.col2, t3.col3, ...)
FROM YourTable AS t1
CROSS JOIN YourTable AS t2
CROSS JOIN YourTable AS t3
...
But probably the simplest would be:
SELECT COUNT(DISTINCT col1)*COUNT(DISTINCT col2)*COUNT(DISTINCT col3)*...
FROM YourTable
This doesn't require computing any cross-products, so it should be most efficient. If you have indexes on the columns, it won't even have to read the table data, it can all be done using the indexes.

Merging two SQL Server tables conditionally into a third table

Clearly, I am not a SQL guy, so I have to ask for help on the following rather simple task.
I have two SQL Server 2008 tables: t1 and t2 with many identical columns and a key column (entry_ID). T2 has rows that do not exist in t1 but should.
I want to merge those rows from t2 that do not exist in t1 but I also do not want any rows from t2 that already exist in t1. I would like the result set to fill a new t3.
I have looked at many solutions online but can't find the solution to the above scenario.
Thank you.
There are a number of ways to do it you could use UNION ALL or OUTER JOIN.
Assuming you are using Entry_ID to find identical records, and Entry_ID is unique within each table, here is a OUTER JOIN method:
This gets you your recordset: T1 and T2 merged:
SELECT
CASE
WHEN T1.Entry_ID IS NULL THEN 'T2'
WHEN T2.Entry_ID IS NULL THEN 'T1'
ELSE 'Both'
END SourceTable,
COALESCE(T1.Entry_ID,T2.Entry_ID) As Entry_ID,
COALESCE(T1.Col1, T2.Col1) As Col1,
COALESCE(T1.Col2, T2.Col2) As Col2,
COALESCE(T1.Col3, T2.Col3) As Col3,
COALESCE(T1.Col4, T2.Col4) As Col4
FROM T1 FULL OUTER JOIN T2
ON T1.Entry_DI = T2.Entry_ID
ORDER BY COALESCE(T1.Entry_DI,T2.Entry_ID)
This inserts it into T3:
INSERT INTO T3 (Entry_ID,Col1, COl2,Col3,Col4)
SELECT
COALESCE(T1.Entry_DI,T2.Entry_ID) As Entry_ID,
COALESCE(T1.Col1, T2.Col1) As Col1,
COALESCE(T1.Col2, T2.Col2) As Col2,
COALESCE(T1.Col3, T2.Col3) As Col3,
COALESCE(T1.Col4, T2.Col4) As Col4
FROM T1 FULL OUTER JOIN T2
ON T1.Entry_DI = T2.Entry_ID
Again you must note that Entry_ID needs to be unique within their tables, and it uses this to match between the tables.
Also note the columns from the select line up with the column list in the insert statement - the order of the columns in the physical table doesn't matter, the INSERT and SELECT just have to line up.