Finding "duplicate" rows that differ in one column - mysql

I have a table like the following in MySQL 5.1:
+--------------+----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------------+------+-----+---------+----------------+
| log_id | int(11) | NO | PRI | NULL | auto_increment |
| date | datetime | NO | MUL | NULL | |
| date_millis | int(3) | NO | | NULL | |
| eib_address | varchar(20) | NO | | NULL | |
| ip_address | varchar(15) | NO | | NULL | |
| value | decimal(20,10) | NO | MUL | NULL | |
| application | tinyint(4) | NO | | NULL | |
| phys_address | varchar(20) | NO | | NULL | |
| orig_log_id | bigint(20) | NO | | NULL | |
+--------------+----------------+------+-----+---------+----------------+
In this table, log_id and orig_log_id are always unique. It is possible that two rows may have duplicate values for any of the other fields, though. Ignoring the *log_id fields, our problem is that two rows may be identical in all other columns, but have differing values for value. I am trying to figure out the correct SQL query to identify when two (or more) rows have identical values for date, date_millis and eib_address, but different values for value, log_id and orig_log_id. So far, I've been able to come up with a query that accomplishes the first clause in my previous sentence:
SELECT main.*
FROM sensors_log main
INNER JOIN
(SELECT date, date_millis, eib_address
FROM sensors_log
GROUP BY date, date_millis, eib_address
HAVING count(eib_address) > 1) dupes
ON main.date = dupes.date
AND main.date_millis = dupes.date_millis
AND main.eib_address = dupes.eib_address;
However, I can't seem to figure out when value differs. I at least know that just throwing AND main.value != dupes.value into the ON clause doesn't do it!

I think it's a bit simpler than you're trying to make it. Try this:
SELECT *
FROM SENSORS_LOG s1
INNER JOIN SENSORS_LOG s2
ON (s2.DATE = s1.DATE AND
s2.DATE_MILLIS = s1.DATE_MILLIS AND
s2.EIB_ADDRESS = s1.EIB_ADDRESS)
WHERE s1.VALUE <> s2.VALUE OR
s1.LOG_ID <> s2.LOG_ID OR
s1.ORIG_LOG_ID <> s2.ORIG_LOG_ID;
Share and enjoy.

Maybe I mistook the problem, but can't you just perform a COUNT like this?
SELECT date, date_millis, eib_address, count(*) as nr_dupes
FROM sensors_log
GROUP BY date, date_millis, eib_address
HAVING count(*) > 1
or
SELECT date, date_millis, eib_address,
group_concat(value), group_concat(log_id), group_concat(orig_log_id)
FROM sensors_log
GROUP BY date, date_millis, eib_address
HAVING count(*) > 1

Related

MYSQL (MariaDB) - Invalid use of group function

I have two tables called addresses and house_sales
addresses
+-------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| house_number_name | varchar(150) | NO | | NULL | |
| address_line1 | varchar(150) | NO | MUL | NULL | |
| address_line2 | varchar(150) | YES | | NULL | |
| address_line3 | varchar(150) | YES | MUL | NULL | |
| town_city | varchar(150) | NO | MUL | NULL | |
| district | varchar(150) | YES | MUL | NULL | |
| county | varchar(150) | YES | MUL | NULL | |
| post_code | varchar(8) | NO | MUL | NULL | |
| updated_at | datetime | NO | | NULL | |
| created_at | datetime | NO | | NULL | |
+-------------------+------------------+------+-----+---------+----------------+
house_sales
+---------------+------------------------------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------------------------------------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| address_id | int(11) unsigned | NO | MUL | NULL | |
| price | int(11) unsigned | NO | MUL | NULL | |
| date | datetime | NO | MUL | NULL | |
| updated_at | datetime | NO | | NULL | |
| created_at | datetime | NO | | NULL | |
+---------------+------------------------------------------------------------+------+-----+---------+----------------+
I'm trying to select all the addresses grouped by address_line1 and then getting the average price for that street. The query works but I want to only select where there is more than one house on the same street. However when I add the AND count(*) > 1 I get the error "Invalid use of group function". Below is the query
SELECT count(*) as total_sales, avg(price) as average_price, `address_line1`, `town_city`
FROM `house_sales` `hs`
LEFT JOIN `addresses` `a` ON `hs`.`address_id` = `a`.`id`
WHERE `town_city` = 'London'
AND count(*) > 1
GROUP BY `address_line1`
ORDER BY `average_price` desc
I'm not sure why I'm getting this error. I've tried a sub query so I can use HAVING but haven't got this to work. Any help or pointers would be appreciated
You need a having clause to filter on the aggregate expression:
SELECT count(*) as total_sales, avg(price) as average_price, `address_line1`, `town_city`
FROM `house_sales` `hs`
LEFT JOIN `addresses` `a` ON `hs`.`address_id` = `a`.`id`
WHERE `town_city` = 'London'
GROUP BY `address_line1`, `town_city`
HAVING count(*) > 1
ORDER BY `average_price` desc
MySQL extends the SQL standard by allowing the use of aliases in the having clause, so you can also do:
having total_sales > 1
Side notes:
as commented by jarlh, it is a good practice to qualify (prefix) all column names with the table they belong to
it is also a good practice to put all non-aggregated columns in the group by clause (I added town_city, which was missing in your original query) - newer versions of MySQL do not allow this by default
quoting all identifiers is usually not necessary (unless they contain special characters)
There are two ways to go here. One would be to add town_city to the GROUP BY list:
SELECT
address_line1,
town_city,
COUNT(*) AS total_sales,
AVG(price) AS average_price
FROM house_sales hs
LEFT JOIN addresses a ON hs.address_id = a.id
WHERE town_city = 'London'
GROUP BY address_line1, town_city
HAVING COUNT(*) > 1
ORDER BY average_price DESC;
The other would be to just keep your current query but remove town_city from the select list, since you are restricting to just London anyway.
SELECT
address_line1,
COUNT(*) AS total_sales,
AVG(price) AS average_price
FROM house_sales hs
LEFT JOIN addresses a ON hs.address_id = a.id
WHERE town_city = 'London'
GROUP BY address_line1
HAVING COUNT(*) > 1
ORDER BY average_price DESC;

How do I select rows from my MySql table that occurred according to a max date?

I’m using MySQL 5.5.37. I have a table with the following structure …
mysql> desc event;
+------------------------+-------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+-------------+------+-----+-------------------+-------+
| ID | varchar(32) | NO | PRI | NULL | |
| EVENT_ID | varchar(32) | NO | UNI | NULL | |
| ORGANIZATION_ID | varchar(32) | NO | MUL | NULL | |
| DATE_PROCESSED | timestamp | NO | | CURRENT_TIMESTAMP | |
| EVENT_DATA | longtext | YES | | NULL | |
+------------------------+-------------+------+-----+-------------------+-------+
I would like to select the row corresponding to the last event processed for each organization. I only need the organization_id and the event_id for each organization. However, I’m not sure how to build this query. I have this so far
mysql> select organization_id, max(date_processed) from event group by organization_id
But I’m not sure how to use this to deduce the event_id. Any help is greatly appreciated.
You can check against the event_id column like
Select e.event_id, e.organization_id,
xx.maxdate, xx.maxevent
from event e join (
select organization_id,
max(date_processed) as maxdate,
max(event_id) as maxevent
from event
group by organization_id) xx on e.organization_id = xx.organization_id
and e.event_id = xx.maxevent;
SELECT organization_id, event_id
FROM `event`
WHERE (organization_id, date_processed) IN (
SELECT organization_id, max(date_processed)
FROM `event`
GROUP BY organization_id
)
;
Be wary of the event table name, it is a keyword in more recent versions of MySQL.

HQL/MySQL for listing distincts and duplicates

I have list of 20.000+ objects. These objects have a fk to a table called title. Two tipps are considered duplicate if they are linked to the same title, and they belong to the same package(tipp_pkg_fk, this is a parameter).
I need a list of all objects, with the duplicates listed together. For example:
tippA.title.name = "One"
tippB.title.name = "Two"
tippC.title.name = "Two"
Ideally from the above I will get a list result like this: [[tippA],[tippB,tippC]]
I am not sure how to do this, I have made an attempt (first in Mysql so I can test it, then ill change it to HQL):
select tipp.tipp_id, 1 as sortOrder
from (select distinct a.tipp_id as id
from title_instance_package_platform a, title_instance_package_platform b
where a.tipp_pkg_fk= 1 and b.tipp_pkg_fk = 1 and a.tipp_ti_fk = b.tipp_ti_fk) duplicates,
title_instance_package_platform tipp
where tipp.tipp_id != duplicates.id
union all
select duplicates.id, 2 as sortOrder
from (select distinct a.tipp_id as id
from title_instance_package_platform a , title_instance_package_platform b
where a.tipp_pkg_fk = 1 and b.tipp_pkg_fk=1 and a.tipp_ti_fk = b.tipp_ti_fk) duplicates
order by sortOrder, id;
This executed for 330 seconds, then I got the message fetching in MySQL workbench, and computer started dying at that point. So the idea is that first I select all the IDs that are not duplicate, then I select all the IDS that are duplicate, and then I merge them and order them so that they appear together. I am looking for the most efficient way to do this, as I will be executing this query several times during an overnight job.
For my TIPP model, the following are part of the mapping:
static mapping = {
pkg column:'tipp_pkg_fk', index: 'tipp_idx'
title column:'tipp_ti_fk', index: 'tipp_idx'
}
+-----------------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------------------+--------------+------+-----+---------+----------------+
| tipp_id | bigint(20) | NO | PRI | NULL | auto_increment |
| tipp_version | bigint(20) | NO | | NULL | |
| tipp_pkg_fk | bigint(20) | NO | MUL | NULL | |
| tipp_plat_fk | bigint(20) | NO | MUL | NULL | |
| tipp_ti_fk | bigint(20) | NO | MUL | NULL | |
| date_created | datetime | NO | | NULL | |
| last_updated | datetime | NO | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+-----+---------+----------------+
| ti_id | bigint(20) | NO | PRI | NULL | auto_increment |
| ti_version | bigint(20) | NO | | NULL | |
| date_created | datetime | NO | | NULL | |
| ti_imp_id | varchar(255) | NO | MUL | NULL | |
| last_updated | datetime | NO | | NULL | |
| ti_title | varchar(1024) | YES | | NULL | |
| ti_key_title | varchar(1024) | YES | | NULL | |
| ti_norm_title | varchar(1024) | YES | | NULL | |
| sort_title | varchar(1024) | YES | | NULL | |
+-----------------+---------------+------+-----+---------+----------------+
Update
After some changes it is working:
select tipp.tipp_id as id, 1 as sortOrder
from
title_instance_package_platform tipp
where tipp.tipp_id not in (select distinct a.tipp_id as id
from title_instance_package_platform a, title_instance_package_platform b
where a.tipp_pkg_fk= 1 and b.tipp_pkg_fk = 1 and a.tipp_ti_fk = b.tipp_ti_fk)
union all
select duplicates.id as id, 2 as sortOrder
from (select distinct a.tipp_id as id
from title_instance_package_platform a , title_instance_package_platform b
where a.tipp_pkg_fk = 1 and b.tipp_pkg_fk=1 and a.tipp_ti_fk = b.tipp_ti_fk) duplicates
order by sortOrder, id;
I still haven't got the duplicates grouped together though, instead everything comes as a list, which means I still need to group them.
Can you do your select from the other side?
select all titles and packages and list all tipps to these, only if a tipp exists (count > 0) and bundle these together to get the array you showed?
Seems like you could compute both the dups and the non-dups at the same time. Something like
SELECT ( a.tipp_ti_fk = b.tipp_ti_fk ) AS sortOrder,
a.tipp_id as id
from title_instance_package_platform a ,
title_instance_package_platform b
where a.tipp_pkg_fk = 1
and b.tipp_pkg_fk = 1
You might need a DISTINCT.
This composite index would help:
INDEX(tipp_pkg_fk, tipp_ti_fk, tipp_id)

Calculate average of values between 2 columns sql

I have a table called validation_errors that looks like this:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| link | varchar(200) | NO | MUL | NULL | |
| message | varchar(500) | NO | | | |
| explanation | mediumtext | NO | | NULL | |
| type | varchar(50) | NO | | | |
| subtype | varchar(50) | NO | | | |
| message_id | varchar(50) | NO | | | |
+-------------+--------------+------+-----+---------+----------------+
Link table looks like this:
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| link | varchar(200) | NO | PRI | NULL | |
| visited | tinyint(1) | NO | | 0 | |
| validated | tinyint(1) | NO | | 0 | |
+-----------+--------------+------+-----+---------+-------+
I wish to calculate the average number of validation errors per page per topdomain.
I have a query that can fetch the amount of pages per topdomain:
SELECT substr(link, - instr(reverse(link), '.')) as domain , count(*) as count
FROM links
GROUP BY domain
ORDER BY count desc
limit 30;
And have a sql query that can fetch the amount of validation errors per top domain:
SELECT substr(link, - instr(reverse(link), '.')) as domain ,count(*) as count
FROM validation_errors
GROUP BY domain
ORDER BY count desc
limit 30;
What i now need to do is combine them into a query and divise the results of one column with the other and i can't figure out how to do it.
Any help would be greatly apriciated.
First, use substring_index(), rather than your construct. Here is the query to join them together:
select domain, sum(numviews) as numviews, sum(numerrors) as numerrors,
sum(numerrors) / nullif(sum(numviews), 0) as error_rate
from ((SELECT substring_index(link, '.', -1) as domain , count(*) as numviews, 0 as numerrors
FROM links
GROUP BY domain
) UNION ALL
(SELECT substring_index(link, '.', -1) as domain , 0, count(*)
FROM validation_errors
GROUP BY domain
)
) d
GROUP BY domain;
With both variables, I don't know which 30 you want to choose, so I haven't included an order by.
Note that this doesn't use a join, it uses union all with aggregation. This ensures that you will get all domains, even those with no views and those with no errors.

MySQL query to match date and null between two tables

I have two MySQL-tables like this:
desc students;
+---------------------------+---------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+---------------------------+---------------+------+-----+---------+
| student_id | int(11) | NO | PRI | NULL |
| student_firstname | varchar(255) | NO | | NULL |
| student_lasttname | varchar(255) | NO | | NULL |
+---------------------------+---------------+------+-----+---------+
desc studentabsence;
+---------------------------+-------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+---------------------------+-------------+------+-----+---------+
| student_absence_id | int(11) | NO | PRI | NULL |
| student_id | int(11) | YES | | NULL |
| student_absence_startdate | date | YES | | NULL |
| student_absence_enddate | date | YES | | NULL |
| student_absence_type | varchar(45) | YES | | NULL |
+---------------------------+-------------+------+-----+---------+
Then I have this MySQL- query to list students.
Query:
SELECT s.student_id, s.student_firstname, s.student_lastname,
a.student_absence_startdate, a.student_absence_enddate, a.student_absence_type
FROM students s LEFT JOIN studentabsence a ON a.student_id = s.student_id
Whenever a student has absence information this is displayed in the columns
a.student_absence_startdate a.student_absence_enddatea.student_absence_type
Sometimes a student has two or more rows in the table studentabsence then he is listed two times.
My question is if there is any way to be more specific in the query. I would like to list all students from db.students and if there is a row in db.studentabsence with a date between startdate and enddate (for example 2012-07-30) list the student one time with this absence information. Only if there is a match on date.
So something like...
... WHERE (a.student_absence_startdate OR a.student_absence_enddate) IS NULL OR
'2012-07-30' BETWEEN a.student_absence_startdate AND
a.student_absence_enddate ...
It's kinda hard to explain so let me know if you need more information...
I think that you can arrange it with a JOIN on a subselect/subview :
SELECT s.student_id, s.student_firstname, s.student_lastname,
a.student_absence_startdate, a.student_absence_enddate, a.student_absence_type
FROM students s
LEFT JOIN
(SELECT * FROM studentabsence a1 WHERE ('2012-07-30' BETWEEN a1.student_absence_startdate AND a1.student_absence_enddate) ) a
ON a.student_id = s.student_id
I'd use parameters with default values (01/01/1900 00:00:00), like this:
AND ( a.student_absence_startdate >= #P_startdate OR #P_startdate = '01/01/1900 00:00:00' )
AND ( a.student_absence_enddate <= #P_enddate OR #P_enddate = '01/01/1900 00:00:00' )