Mysql deduplicate records in single query

Mysql deduplicate records in single query - mysql

I have the following table:
CREATE TABLE `relations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`relationcode` varchar(25) DEFAULT NULL,
`email_address` varchar(100) DEFAULT NULL,
`firstname` varchar(100) DEFAULT NULL,
`latname` varchar(100) DEFAULT NULL,
`last_contact_date` varchar(25) DEFAULT NULL,
PRIMARY KEY (`id`)
)
In this table there are duplicates, these are relation with exact the same relationcode and email_address. They can be in there twice or even 10 times.
I need a query that selects the id's of all records, but excludes the ones that are in there more than once. Of those records, I only would like to select the record with the most recent last_contact_id only.
I'm more into Oracle than Mysql, In Oracle I would be able to do it this way:
select * from (
select row_number () over (partition by relationcode order by to_date(last_contact_date,'dd-mm-yyyy')) rank,
id,
relationcode,
email_address ,
last_contact_date
from RELATIONS)
where rank = 1
But I can't figure out how to modify this query to work in MySql. I'm not even dure it's possible to do the same thing in a single query in MySQl.
Any ideas?

Normal way to do this is a sub query to get the latest record and then join that against the table:-
SELECT id, relationcode, email_address, firstname, latname, last_contact_date
FROM RELATIONS
INNER JOIN
(
SELECT relationcode, email_address, MAX(last_contact_date) AS latest_contact_date
FROM RELATIONS
GROUP BY relationcode, email_address
) Sub1
ON RELATIONS.relationcode = Sub1.relationcode
AND RELATIONS.email_address = Sub1.email_address
AND RELATIONS.last_contact_date = Sub1.latest_contact_date
It is possible to manually generate the kind of rank that your Oracle query uses using variables. Bit messy though!
SELECT id, relationcode, email_address, firstname, latname, last_contact_date
FROM
(
SELECT id, relationcode, email_address, firstname, latname, last_contact_date, #seq:=IF(#relationcode = relationcode AND #email_address = email_address, #seq + 1, 1) AS seq, #relationcode := relationcode, #email_address := email_address
(
SELECT id, relationcode, email_address, firstname, latname, last_contact_date
FROM RELATIONS
CROSS JOIN (SELECT #seq:=0, #relationcode := '', #email_address :='') Sub1
ORDER BY relationcode, email_address, last_contact_date DESC
) Sub2
) Sub3
WHERE seq = 1
This uses a sub query to initialise the variables. The sequence number is added to if the relation code and email address are the same as the previous row, if not they are reset to 1 and stored in a field. Then the outer select check the sequence number (as a field, not as the variable name) and records only returned if it is 1.
Note that I have done this as multiple sub queries. Partly to make it clearer to you, but also to try to force the order that MySQL executes it is. There are a couple of possible issues with how MySQL says it may order the execution of things that could cause an issue. They never have done for me, but with sub queries I would hope for force the order.

Here is a method that will work in both MySQL and Oracle. It rephrases the question as: Get me all rows from relations where the relationcode has no larger last_contact_date.
It works something like this:
select r.*
from relations r
where not exists (select 1
from relations r2
where r2.relationcode = r.relationcode and
r2.last_contact_date > r.last_contact_date
);
With the appropriate indexes, this should be pretty efficient in both databases.
Note: This assumes that last_contact_date is stored as a date not as a string (as in your table example). Storing dates as strings is just a really bad idea and you should fix your data structure

Related

MYSQL ERROR CODE: 1288 - can't update with join statement

Thanks for past help.
While doing an update using a join, I am getting the 'Error Code: 1288. The target table _____ of the UPDATE is not updatable' and figure out why. I can update the table with a simple update statement (UPDATE sales.customerABC Set contractID = 'x';) but can't using a join like this:
UPDATE (
SELECT * #where '*' contains columns a.uniqueID and a.contractID
FROM sales.customerABC
WHERE contractID IS NULL
) as a
LEFT JOIN (
SELECT uniqueID, contractID
FROM sales.tblCustomers
WHERE contractID IS NOT NULL
) as b
ON a.uniqueID = b.uniqueID
SET a.contractID = b.contractID;
If changing that update statement a SELECT such as:
SELECT * FROM (
SELECT *
FROM opwSales.dealerFilesCTS
WHERE pcrsContractID IS NULL
) as a
LEFT JOIN (
SELECT uniqueID, pcrsContractID
FROM opwSales.dealerFileLoad
WHERE pcrsContractID IS NOT NULL
) as b
ON a."Unique ID" = b.uniqueID;
the result table would contain these columns:
a.uniqueID, a.contractID, b.uniqueID, b.contractID
59682204, NULL, NULL, NULL
a3e8e81d, NULL, NULL, NULL
cfd1dbf9, NULL, NULL, NULL
5ece009c, , 5ece009c, B123
5ece0d04, , 5ece0d04, B456
5ece7ab0, , 5ece7ab0, B789
cfd21d2a, NULL, NULL, NULL
cfd22701, NULL, NULL, NULL
cfd23032, NULL, NULL, NULL
I pretty much have all database privileges and can't find restrictions with the table reference data. Can't find much information online concerning the error code, either.
Thanks in advance guys.

You cannot update a sub-select because it's not a "real" table - MySQL cannot easily determine how the sub-select assignment maps back to the originating table.
Try:
UPDATE customerABC
JOIN tblCustomers USING (uniqueID)
SET customerABC.contractID = tblCustomers.contractID
WHERE customerABC.contractID IS NULL AND tblCustomers.contractID IS NOT NULL
Notes:
you can use a full JOIN instead of a LEFT JOIN, since you want uniqueID to exist and not be null in both tables. A LEFT JOIN would generate extra NULL rows from tblCustomers, only to have them shot down by the clause requirement that tblCustomers.contractID be not NULL. Since they allow more stringent restrictions on indexes, JOINs tend to be more efficient than LEFT JOINs.
since the field has the same name in both tables you can replace ON (a.field1 = b.field1) with the USING (field1) shortcut.
you obviously strongly want a covering index with (uniqueID, customerID) on both tables to maximize efficiency
this is so not going to work unless you have "real" tables for the update. The "tblCustomers" may be a view or a subselect, but customerABC may not. You might need a more complicated JOIN to pull out a complex WHERE which might be otherwise hidden inside a subselect, if the original 'SELECT * FROM customerABC' was indeed a more complex query than a straight SELECT. What this boils down to is, MySQL needs a strong unique key to know what it needs to update, and it must be in a single table. To reliably update more than one table I think you need two UPDATEs inside a properly write-locked transaction.

SQL - Column in field list is ambiguous

I have two tables BOOKINGS and WORKER. Basically there is table for a worker and a table to keep track of what the worker has to do in a time frame aka booking. I’m trying to check if there is an available worker for a job, so I query the booking to check if requested time has available workers between the start end date. However, I get stuck on the next part. Which is returning the list of workers that do have that time available. I read that I could join the table passed on a shared column, so I tried doing an inner join with the WORKER_NAME column, but when I try to do this I get a ambiguous error. This leads me to believe I misunderstood the concept. Does anyone understand what I;m trying to do and knows how to do it, or knows why I have the error below. Thanks guys !!!!
CREATE TABLE WORKER (
ID INT NOT NULL AUTO_INCREMENT,
WORKER_NAME varchar(80) NOT NULL,
WORKER_CODE INT,
WORKER_WAGE INT,
PRIMARY KEY (ID)
)
CREATE TABLE BOOKING (
ID INT NOT NULL AUTO_INCREMENT,
WORKER_NAME varchar(80) NOT NULL,
START DATE NOT NULL,
END DATE NOT NULL,
PRIMARY KEY (ID)
)
query
SELECT *
FROM WORKERS
INNER JOIN BOOKING
ON WORKER_NAME = WORKER_NAME
WHERE (START NOT BETWEEN '2010-10-01' AND '2010-10-10')
ORDER BY ID
#1052 - Column 'WORKER_NAME' in on clause is ambiguous

In your query, the column "worker_name" exists in two tables; in this case, you must reference the tablename as part of the column identifer.
SELECT *
FROM WORKERS
INNER JOIN BOOKING
ON workers.WORKER_NAME = booking.WORKER_NAME
WHERE (START NOT BETWEEN '2010-10-01' AND '2010-10-10')
ORDER BY ID

In your query, the column WORKER_NAME and ID columns exists in both tables, where WORKER_NAME retains the same meaning and ID is re-purposed; in this case, you must either specify you are using WORKER_NAME as the join search condition or 'project away' (rename or omit) the duplicate ID problem.
Because the ID columns are AUTO_INCREMENT, I assume (hope!) they have no business meaning. Therefore, they could both be omitted, allowing a natural join that will cause duplicate columns to be 'projected away'. This is one of those situations where one wishes SQL had a WORKER ( ALL BUT ( ID ) ) type syntax; instead, one is required to do it longhand. It might be easier in the long run to to opt for a consistent naming convention and rename the columns to WORKER_ID and BOOKING_ID respectively.
You would also need to identify a business key to order on e.g. ( START, WORKER_NAME ):
SELECT *
FROM
( SELECT WORKER_NAME, WORKER_CODE, WORKER_WAGE FROM WORKER ) AS W
NATURAL JOIN
( SELECT WORKER_NAME, START, END FROM BOOKING ) AS B
WHERE ( START NOT BETWEEN '2010-10-01' AND '2010-10-10' )
ORDER BY START, WORKER_NAME;
This is good, but its returning the start and end times as well. I'm just wanting the WOKER ROWS. I cant take the start and end out, because then sql doesn’t recognize the where clause.
Two approaches spring to mind: push the where clause to the subquery:
SELECT *
FROM
( SELECT WORKER_NAME, WORKER_CODE, WORKER_WAGE FROM WORKER ) AS W
NATURAL JOIN
( SELECT WORKER_NAME, START, END
FROM BOOKING
WHERE START NOT BETWEEN '2010-10-01' AND '2010-10-10' ) AS B
ORDER BY START, WORKER_NAME;
Alternatively, replace SELECT * with a list of columns you want to SELECT:
SELECT WORKER_NAME, WORKER_CODE, WORKER_WAGE
FROM
( SELECT WORKER_NAME, WORKER_CODE, WORKER_WAGE FROM WORKER ) AS W
NATURAL JOIN
( SELECT WORKER_NAME, START, END FROM BOOKING ) AS B
WHERE START NOT BETWEEN '2010-10-01' AND '2010-10-10'
ORDER BY START, WORKER_NAME;

This error comes after you attempt to call a field which exists in both tables, therefore you should make a reference. For instance in example below I first say cod.coordinator so that DBMS know which coordinator I want
SELECT project__number, surname, firstname,cod.coordinator FROMcoordinatorsAS co JOIN hub_applicants AS ap ON co.project__number = ap.project_id JOIN coordinator_duties AS cod ON co.coordinator = cod.email

MySQL Query Fixing/Optimisation for my configuration table

I got a mySQL table, that holds the configuration of my project, each configuration change creates a new entry, so that i have a history of all changes, and who changed it.
CREATE TABLE `configurations` (
`name` varchar(255) NOT NULL,
`value` text NOT NULL,
`lastChange` datetime NOT NULL,
`changedBy` bigint(32) NOT NULL,
KEY `lastChange` (`lastChange`),
KEY `name` (`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
INSERT INTO `configurations` (`name`, `value`, `lastChange`, `changedBy`) VALUES
('activePageLimit', 'activePageLimit-old-value', '2016-01-06 12:25:05', 1096775260340178),
('activePageLimit', 'activePageLimit-new-value', '2016-01-06 12:27:57', 1096775260340178),
('customerLogo', 'customerLogo-old-value', '2016-02-06 00:00:00', 1096775260340178),
('customerLogo', 'customerLogo-new-value', '2016-01-07 00:00:00', 1096775260340178);
Right now i have a problem with my select query, that should return all names and their latest value (ordered by lastChange).
| name | value | lastChange |
|-----------------|---------------------------|---------------------------|
| customerLogo | customerLogo-new-value | January, 07 2016 00:00:00 |
| activePageLimit | activePageLimit-new-value | January, 06 2016 12:27:57 |
My current Query is:
SELECT `name`, `value`, `lastChange`
FROM (
SELECT `name`, `value`, `lastChange`
FROM `configurations`
ORDER BY `lastChange` ASC
) AS `c`
GROUP BY `name` DESC
But unfortunately this does not always return the right values, and i don't like to use a subquery, there has to be a cleaner and faster way to do this.
I also created a SQL-Fiddle for you as a playground: http://sqlfiddle.com/#!9/f1dc9/1/0
Is there any other clever solution i missed?

Your method is documented to return indeterminate results (because you have columns in the select that are not in the group by).
Here are three alternatives. The first is standard SQL, using an explicit aggregation to get the most recent change.
SELECT c.*
FROM configurations c JOIN
(SELECT `name`, MAX(`lastChange`) as maxlc
FROM `configurations`
GROUP BY name
) mc
ON c.name = mc.name and c.lasthange = mc.maxlc ;
The second is also standard SQL, using not exists:
select c.*
from configurations c
where not exists (select 1
from configurations c2
where c2.name = c.name and c2.lastchange > c.lastchange
);
The third uses a hack which is available in MySQL (and it assumes that the value does not have any commas in this version and is not too long):
select name, max(lastchange),
substring_index(group_concat(value order by lastchange desc), ',', 1) as value
from configurations
order by name;
Use this version carefully, because it is prone to error (for instance, the intermediate group_concat() result could exceed a MySQL parameter, which would then have to be re-set).
There are other methods -- such as using variables. But these three should be sufficient for you to consider your options.

If we want to avoid SUBQUERY the only other option is JOIN
SELECT cc.name, cc.value, cc.lastChange FROM configurations cc
JOIN (
SELECT name, value, lastChange
FROM configurations
ORDER BY lastChange ASC
) c on c.value = cc.value
GROUP BY cc.name DESC

You have two requirements: a historical log, and a "state". Keep them in two different tables, in spite of that providing redundant information.
That is, have one table that faithfully records who changed what when.
Have another table that faithfully specifies the current state for the configuration.
Plan A: INSERT into the Log and UPDATE the `State whenever anything happens.
Plan B: UPDATE the State and use a TRIGGER to write to the Log.

SQL alternative to sub-query in FROM

I have a table containing user to user messages. A conversation has all messages between two users. I am trying to get a list of all the different conversations and display only the last message sent in the listing.
I am able to do this with a SQL sub-query in FROM.
CREATE TABLE `messages` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`from_user_id` bigint(20) DEFAULT NULL,
`to_user_id` bigint(20) DEFAULT NULL,
`type` smallint(6) NOT NULL,
`is_read` tinyint(1) NOT NULL,
`is_deleted` tinyint(1) NOT NULL,
`text` longtext COLLATE utf8_unicode_ci NOT NULL,
`heading` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at_utc` datetime DEFAULT NULL,
`read_at_utc` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
);
SELECT * FROM
(SELECT * FROM `messages` WHERE TYPE = 1 AND
(from_user_id = 22 OR to_user_id = 22)
ORDER BY created_at_utc DESC
) tb
GROUP BY from_user_id, to_user_id;
SQL Fiddle:
http://www.sqlfiddle.com/#!2/845275/2
Is there a way to do this without a sub-query?
(writing a DQL which supports sub-queries only in 'IN')

You seem to be trying to get the last contents of messages to or from user 22 with type = 1. Your method is explicitly not guaranteed to work, because the extra columns (not in the group by) can come from arbitrary rows. As explained in the [documentation][1]:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.
The query that you want is more along the lines of this (assuming that you have an auto-incrementing id column for messages):
select m.*
from (select m.from_user_id, m.to_user_id, max(m.id) as max_id
from message m
where m.type = 1 and (m.from_user_id = 22 or m.to_user_id = 22)
) lm join
messages m
on lm.max_id = m.id;
Or this:
select m.*
from message m
where m.type = 1 and (m.from_user_id = 22 or m.to_user_id = 22) and
not exists (select 1
from messages m2
where m2.type = m.type and m2.from_user_id = m.from_user_id and
m2.to_user_id = m.to_user_id and
m2.created_at_utc > m.created_at_utc
);
For this latter query, an index on messages(type, from_user_id, to_user_id, created_at_utc) would help performance.

Since this is a rather specific type of data query which goes outside common ORM use cases, DQL isn't really fit for this - it's optimized for walking well-defined relationships.
For your case however Doctrine fully supports native SQL with result set mapping. Using a NativeQuery with ResultSetMapping like this you can easily use the subquery this problem requires, and still map the results on native Doctrine entities, allowing you to still profit from all caching, usability and performance advantages.
Samples found here.

If you mean to get all conversations and all their last messages, then a subquery is necessary.
SELECT a.* FROM messages a
INNER JOIN (
SELECT
MAX(created_at_utc) as max_created,
from_user_id,
to_user_id
FROM messages
GROUP BY from_user_id, to_user_id
) b ON a.created_at_utc = b.max_created
AND a.from_user_id = b.from_user_id
AND a.to_user_id = b.to_user_id
And you could append the where condition as you like.
THE SQL FIDDLE.

I don't think your original query was even doing this correctly. Not sure what the GROUP BY was being used for other than maybe try to only return a single (unpredictable) result.
Just add a limit clause:
SELECT * FROM `messages`
WHERE `type` = 1 AND
(`from_user_id` = 22 OR `to_user_id` = 22)
ORDER BY `created_at_utc` DESC
LIMIT 1
For optimum query performance you need indexes on the following fields:
type
from_user_id
to_user_id
created_at_utc

Getting multiple values from a query

I write a query to get values from 3 tables but this is returning multiple values so can any one tell where i went wrong
select c.CompanyName,cd.FedTaxID,cd.EmailAddress,cd.PhoneNumber
from tblcustomerdetail cd,tblcustomer c
where c.FedTaxID in (
select FedTaxID
from tblcustomer
where CustomerID in (
select LOginID
from tbluserlogindetail
where UserName like "pa%" and RoleTypeID='20'
)
)
and cd.FedTaxID in (
select FedTaxID
from tblcustomer
where CustomerID in (
select LOginID
from tbluserlogindetail
where UserName like "pa%" and RoleTypeID='20'
)
);
My relation is here
My 3 tables are `tbluserlogindetails, tblcustomerdetails and tblCustomer'
1) Initially i will get `Login ID` from `tblUserLoginDetail ` based on the `user name`.
2) Next based on `LoginID` i will get `FedTaxID` from tblcustomerDetail`
3) Next based on 'FedTaxID' i will get the the required details from `tblcustomer'

SELECT
tblcustomer.CompanyName,
tblcustomerdetail.FedTaxID,
tblcustomerdetail.EmailAddress,
tblcustomerdetail.PhoneNumber
FROM tbluserlogindetail, tblcustomer, tblcustomerdetail
WHERE
tbluserlogindetail.LOginID = tblcustomer.CustomerID
AND tblcustomer.FedTaxID = tblcustomerdetail.FedTaxID
AND tbluserlogindetail.UserName LIKE 'pa%'
AND tbluserlogindetail.RoleTypeID = '20'
Try something like this.
Subqueries have a slow perfomance.
MySQL - SELECT WHERE field IN (subquery) - Extremely slow why?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008