mySQL: How to identify duplicates based on four fields

mySQL: How to identify duplicates based on four fields - mysql

I have read a few posts on SO on how to delete duplicates, by comparing a table with another instance of itself, however I don't want to delete the duplicates I want to compare them.
eg. I have the fields "id", "sold_price", "bruksareal", "kommunenr", "Gårdsnr" ,"Bruksnr", "Festenr", "Seksjonsnr". All fields are int.
I want to identify the rows that are duplicates/identical (the same bruksareal, kommunenr, gårdsnr, bruksnr,festenr and seksjonsnr). If identical then I want to give these rows a unique reference number.
I believe this will make is easier to identify the rows that I later want to compare on other fields (eg. such as "sold_price", "sold_date" etc..)
I'm open to suggestions if you believe my approach is wrong...

Perform a join on the table to itself across all fields, then use an exists, query, such as:
Update Table1
Set reference = UUID()
Where exists (
Select tb1.id
from Table1 tb1 inner join Table1 tb2 on
tb1.Field1 = tb2.Field1 AND
tb1.Field2 = tb2.Field2 AND
etc
Where tb1.Id = Table1.Id
And tb1.Id != tb2.Id
)
actually you can simplify with just a join
Update Table1
Set reference = UUID()
From Table1 inner join Table1 tb2 on
Table1.Field1 = tb2.Field1 AND
Table1.Field2 = tb2.Field2 AND
etc
Where Table1.Id != tb2.Id

Depending on where you want to do that, i would go for a hash implementation. For every insert, calculate the hash of the needed columns when you do the insert (trigger maybe), and after that you should be able to find out very easily what rows are duplicated (if you index that column, the queries should be pretty fast, but remember that that is still not a int column, so it will get a little slower over time).
After this you can do whatever you please with the duplicated records, without very expensive queries on the database.
Later edit: Make sure that you convert the null values into some defined value, since some of the mysql functions like MD5 will just return null if the operand is null. The same goes for concat - if one operand is null, it will return null (the same is not valid for concat_ws though).

Related

MySQL: The fastest way to split a table based on a condition

I have two tables:
1) is a list of all parameter-ids and the info to which set of parameters the parameter-id belongs
2) is data that includes some of the parameter-ids, and some additional data such as timestamp and values.
I'm designing a data-warehouse-like system. But instead of a summary table where i store precalculated values (that doesn't really make sense in my case) i try to decrease the amount of data the different reporting-scripts have to look through to get their results.
I want to transfer every row that is in table2 into a table for each set of parameters so that in the end i have "summary tables", one for each set of parameters. Which parameter belongs to which set is saved in table1.
Is there a faster way than to loop over every entry from table1, get #param_id = ... and #tablename = ... and do a INSERT INTO #tablename SELECT * FROM table2 WHERE parameter_id = #param_id? I read that a "Set based approach" would be faster (and better) than the procedural approach, but I don't quite get how that would work in my case.
Any help is appreciated!

Don't do it. Your 3rd table would be redundant with the original two tables. Instead do a JOIN between the two tables whenever you need pieces from both.
SELECT t1.foo, t2.bar
FROM t1
JOIN t2 ON t1.x = t2.x
WHERE ...;

How to write a mysql join query using IN()

Can you help me write my MySQL join query. Here is what i have so far:
SELECT * FROM table1 LEFT JOIN table2 ON table2.id IN
(table1.comma_separated_ids) WHERE table1.id = [some id]
where table1.comma_separated_ids is a VARCHAR column containing a list of comma separated IDs (integers) that relate to IDs in table2.
The above query returns only one row when it should return every row in table1.comma_separated_ids that has a matching row in table2
What I'm actually trying to do is a little more complex but it's hard to explain so I'm starting here. Any help?

In MySQL, you cannot put a comma-separated list as a single argument to in. It is treated as a string, a single string.
You can use find_in_set():
SELECT *
FROM table1 LEFT JOIN
table2
ON find_in_set(table2.id, table1.comma_separated_ids)
WHERE table1.id = XXX;
However, the bigger issue is that you are storing ids in a comma-separated list. These should be in a separate junction table, with one row per id. It is bad enough to store lists in strings; storing integer ids is even worse.

Is this query well written? I am fairly new at this and am wondering if there is a better way to write it

UPDATE table1
INNER JOIN table2
ON table1.var1=table2.var1
SET table1.var2=table2.var2
My table has about 975,000 rows in it and I know this will take a while no matter what. Is there any better way to write this?
Thanks!

If the standard case is that table1.Var2 already is equal to table2.var2, you may end up with an inflated write count as the database may still update all those rows with no functional change in value.
You may get better performance by updating only those rows which have a different value than the one you desire.
UPDATE table1
INNER JOIN table2
ON table1.var1=table2.var1
SET table1.var2=table2.var2
WHERE (table1.var2 is null and table2.var2 is not null OR
table1.var2 is not null and table2.var2 is null OR
table1.var2 <> table2.var2)
Edit: Nevermind... MySQL only updates on actual changes, unlike some other RDBMS's (MS SQL, for example.)

Your query:
UPDATE table1 INNER JOIN
table2
ON table1.var1 = table2.var1
SET table1.var2 = table2.var2;
A priori, this looks fine. The major issue that I can see would be a 1-many relationship from table1 to table2. In that case, multiple rows from table2 might match a given row from table1. MySQL assigns an arbitrary value in such a case.
You could fix this by choosing one value, such as the min():
UPDATE table1 INNER JOIN
(select var1, min(var2) as var2
from table2
group by var1
) t2
ON table1.var1 = t2.var1
SET table1.var2 = t2.var2;
For performance reasons, you should have an index on table2(var1, var2). By including both columns in the index, the query will be able to use the index only and not have to fetch rows directly from the table.

match multiple columns to multiple-row subquery?

I'm very much still learning about mySQL (am still really only comfortable with basic queries, count, order by etc.). It is very likely that this question has been asked before, however either I don't know what to search for, or I'm too much of a novice to understand the answers:
I have two tables:
tb1 (a,b,path)
tb2 (a,b,value)
I would like to make a query that returns "path" for each row in tb1 whose a,b matches a different query on tb2. In bad mysql, it would be something like:
select
path
from tb1
where
a=(select a from tb2 where value < v1)
and
b=(select b from tb2 where value < v1);
however, this doesn't work, as the subqueries are returning multiple values. Note that exchanging = by in is not good enough, as that would be true for combinations of a,b-values that are not returned by select a,b from tb2 where value < v1
Basically, I have identified an interesting area in (a,b)-space based on tb2, and would like to study the behavior of tb1 within that area (if that makes it any clearer).
thank you :)

This is a job for an INNER JOIN on both a and b:
SELECT
path
FROM
tb1
INNER JOIN tb2 ON tb1.a = tb2.a AND tb1.b = tb2.b
/* add your condition to the WHERE clause */
WHERE tb2.value < v1
The use cases for subqueries in the SELECT list or WHERE clause can very often be handled instead using some type of JOIN. The join will frequently be faster than the subquery, owing to the fact that when using a SELECT or WHERE subquery, the subquery may need to be performed for each row returned, rather than only once.
Beyond the MySQL documentation on JOINs linked above, I would also recommend Jeff Atwood's Visual Explanation of SQL JOINs

INNER JOIN will do the trick.
You just need two ON criteria in order to match both the a and b values, like so:
SELECT path
FROM tb1
INNER JOIN tb2 ON tb1.a = tb2.a AND tb1.b = tb2.b
WHERE tb2.value < v1

You can limit your result set this way:
select
path
from tb1
where
a=(select a from tb2 where value < v1 LIMIT 1)
and
b=(select b from tb2 where value < v1 LIMIT 1);

Use ORDER BY 'x' with a JOIN, but keep rows that don't have a value for 'x'

This is simplified version of a relatively complex problem that myself and my colleagues can't quite get our heads around.
Consider two tables, table_a and table_b. In our CMS table_a holds metadata for all the data stored in the database, and table_b has some more specific information, so for simplicity's sake, a title and date column.
At the moment our query looks like:
SELECT *
FROM `table_a` LEFT OUTER JOIN `table_b` ON (table_a.id = table_b.id)
WHERE table_a.col = 'value'
ORDER BY table_b.date ASC
LIMIT 0,20
This degrades badly when table_a has a large amount of rows. If the JOIN is changed RIGHT OUTER JOIN (which triggers MySQL to use the INDEX set on table_b.date), the query is infinitely quicker, but it doesn't produce the same results (because if table_b.date doesn't have a value, it is ignored).
This becomes an issue in our CMS because if the user sorts on the date column, any rows that don't have a date set yet disappear from the interface, creating a confusing UI experience and makes it difficult to add dates for the rows that missing them.
Is there a solution that will:
Use table_b.date's INDEX so that
the query will scale better
Somehow retain those rows in
table_b that don't have a date
set so that a user can enter the
data

I'm going to second ArtoAle's comment. since the order by applies to a null value in the outer join for missing rows in table_b, those rows will be out of order anyway.
The simulated outer join is the ugly part, so lets look at that first. Mysql doesn't have except, so you need to write the query in terms of exists.
SELECT table_a.col1, table_a.col2, table_a.col3, ... NULL as table_b_col1, NULL as ...
FROM
table_a
WHERE
NOT EXISTS (SELECT 1 FROM table_a INNER JOIN table_b ON table_a.id = table_b.id);
Which should be UNION ALLed with the original query as an inner join. The UNION_ALL is needed to preserve the original order.
This sort of query is probably going to be dog-slow no matter what you do, because there won't be an index that readily supports a "Foreign Key not present" sort of query. This basically boils down to an index scan in table_a.id with a lookup (Or maybe a parallel scan) for the corresponding row in table_b.id.

So we ended up implemented a different solution that while the results were not as good as using an INDEX, it still provided a nice speed boost of around 25%.
We remove the JOIN and instead used an ORDER BY subquery:
SELECT *
FROM `table_a`
WHERE table_a.col = 'value'
ORDER BY (
SELECT date
FROM table_b
WHERE id = table_a.id
) ASC
LIMIT 0,20

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

mySQL: How to identify duplicates based on four fields - mysql

Related

MySQL: The fastest way to split a table based on a condition

How to write a mysql join query using IN()

Is this query well written? I am fairly new at this and am wondering if there is a better way to write it

match multiple columns to multiple-row subquery?

Use ORDER BY 'x' with a JOIN, but keep rows that don't have a value for 'x'

Categories

Resources