I'm experimenting with a query that I'll use for pruning two related mysql tables. I'll be using it to delete all but the most recent entries.
This query behaves exactly as I expect:
SELECT
O.id AS O_id,
T.id AS T_id
FROM
rt.ObjectCustomFieldValues AS O
LEFT JOIN rt.Transactions AS T
ON O.id = T.NewReference
WHERE
O.Disabled = 1
AND O.CustomField = 58
AND O.ObjectId = 202784
AND T.id NOT IN (
SELECT
id
FROM
(
SELECT
id
FROM
Transactions
WHERE
Field = 58
AND ObjectId = 202784
ORDER BY
Created DESC
LIMIT 5
) Test
)
For the rows containing ObjectId 202784, I get the ObjectCustomFieldValues ids and the Transactions ids for all but the most recent 5 items.
Now how do I turn this into a general query that I can run over all rows instead of specifying the ObjectId manually?
To summarize, for field id 58, I want to iterate all ObjectId values and for each one, delete all but the most recent ObjectCustomFieldValues and Transactions.
You can view schema details here:
https://github.com/bestpractical/rt/blob/stable/etc/schema.mysql#L112
and here:
https://github.com/bestpractical/rt/blob/stable/etc/schema.mysql#L328
If your structure is not INSERTing data with a UNIX_TIMESTAMP(), depending on your entire database structure order, this could be difficult. If you add a UNIX_TIMESTAMP() you can use ORDER BY correctly no matter what.
Related
I am trying to make the following query run faster than 180 secs:
SELECT
x.di_on_g AS deviceid, SUM(1) AS amount
FROM
(SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN ("many (~1500) comma separated IDs coming from my code"))
GROUP BY g.device_id , g.guide_type_id) x
GROUP BY x.di_on_g
ORDER BY amount;
Screenshot from EXPLAIN:
https://ibb.co/da5oAF
Even if I run the subquery as separate query it is still very slow...:
SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN (("many (~1500) comma separated IDs coming from my code")
Screenshot from EXPLAIN:
ibb.co/gJHRVF
I have indexes on g.device_id and on other appropriate places.
Indexes:
SHOW INDEX FROM guide;
ibb.co/eVgmVF
SHOW INDEX FROM operator_guide_type;
ibb.co/f0TTcv
SHOW INDEX FROM operator_device;
ibb.co/mseqqF
I tried creating a new temp table for the ids and using a JOIN to replace the slow IN clause but that didn't make the query much faster.
All IDs are Integers and I tried creating a new temp table for the ids that come from my code and JOIN that table instead of the slow IN clause but that didn't make the query much faster. (10 secs faster)
None of the tables have more then 300,000 rows and the mysql configuration is good.
And the visual plan:
Query Plan
Any help will be appreciated !
Let's focus on the subquery. The main problem is "inflate-deflate", but I will get to that in a moment.
Add the composite index:
INDEX(locale_id, operator_id, device_id)
Why the duplicated "1" in
g.operator_id IN (1 , 1)
Why does the GROUP BY have 2 columns, when you select only 1? Is there some reason for using GROUP BY instead of DISTINCT. (The latter seems to be your intent.)
The only reason for these
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
would be to verify that there are guides and devices in those other table. Is that correct? Are these the PRIMARY KEYs, hence unique?: ogt.guide_type_id and od.device_id. If so, why do you need the GROUP BY? Based on the EXPLAIN, it sounds like both of those are related 1:many. So...
SELECT g.device_id AS di_on_g
FROM guide g
WHERE EXISTS( SELECT * FROM operator_guide_type WHERE guide_type_id = g.guide_type_id )
AND EXISTS( SELECT * FROM operator_device WHERE device_id = g.device_id
AND g.operator_id IN (1)
AND g.locale_id = 1
AND g.device_id IN (...)
Notes:
The GROUP BY is no longer needed.
The "inflate-deflate" of JOIN + GROUP BY is gone. The Explain points this out -- 139K rows inflated to 61M -- very costly.
EXISTS is a "semijoin", meaning that it does not collect all matches, but stops when it finds any match.
"the mysql configuration is good" -- How much RAM do you have? What Engine is the table? What is the value of innodb_buffer_pool_size?
I'm attempting to build a list of results based on three joins
I have created a table of leads, as my sales team takes action on the leads they attach event note records to the leads. 1 lead can have many notes. each note has a timestamp and also a date/time field where they can set a future date in order to schedule call backs and appointments.
I have no trouble building the list, with all my leads associated with their respective event notes, but what I want to do in this particular case is query a smaller list of leads that are associated with only the event note containing the "newest"/highest value in the date_time column.
I've been digging about especially here on stack for the last couple days attempting to get the desired result from my statements. I get either all of the lead records with all of their associated event note records or I get 1, no matter what I utilize ( GROUP BY date_time ASC LIMIT 1) or (ORDER BY date_time ASC LIMIT 1) I've even tried to build a view with only the highest scheduled record for each lead.id.
SELECT
rr_leads.id AS 'Lead',
rr_leads.first,
rr_leads.last,
rr_leads.company,
rr_leads.phone,
rr_leads.email,
rr_leads.city,
rr_leads.zip,
rr_leads.status,
z.noteid,
z.taskid,
z.scheduled,
z.event
FROM rr_leads
LEFT JOIN
(
SELECT
rr_lead_notes.lead_id,
rr_lead_notes.id AS 'noteid',
rr_lead_tasks.id AS 'taskid',
rr_lead_notes.date_time AS 'scheduled',
rr_lead_notes.task_note,
rr_lead_tasks.task_step AS 'event'
FROM rr_lead_notes
LEFT JOIN rr_lead_tasks
ON rr_lead_notes.task_note = rr_lead_tasks.task_step
AND rr_lead_notes.id IS NOT NULL
AND rr_lead_notes.task_note IS NOT NULL
GROUP BY rr_lead_notes.id DESC
) z
ON rr_leads.id = z.lead_id
WHERE rr_leads.id IS NOT NULL
AND z.noteid IS NOT NULL
ORDER BY rr_leads.id DESC
Here is the general idea of getting data associated with a most recent event. You can adjust for your particular situation.
select yourfields
from table1 join othertables etc
join
(select id, max(time_stamp) maxts
from table1
where whatever
group by id) temp on table1.id = temp.id
and table1.time_stamp = maxts
where whatever
Make sure the where clauses in your main query and subquery are the same.
Ok, so i have the following schema and query which is very slow (when using real data) because of the ORDER BY:
http://sqlfiddle.com/#!2/5e7bb/10
As per mysql man : "You are joining many tables, and the columns in the ORDER BY are not all from the first nonconstant table that is used to retrieve rows. (This is the first table in the EXPLAIN output that does not have a const join type.) "
but i still need to sort by that column. How would i need to do this ?
UPDATE: since the fiddle was updated :
SELECT
cpa.product_id,
cp.product_internal_ref,
cp.product_name,
cpa.product_sale_price,
cpa.is_product_service,
cpa.product_service_price
FROM
catalog_products_attributes cpa
JOIN
catalog_products cp ON cp.product_id = cpa.product_id
WHERE
cpa.product_id IN (
SELECT
product_id
FROM
catalog_products_categories
WHERE
category_id = 41
)
ORDER BY
cpa.product_service_price DESC
I've been working with this SQL problem for about 2 days now and suspect I'm very close to resolving the issue but just can't seem to find a solution that completely works.
What I'm attempting to do is a selective join on two tables called application_info and application_status that are used to store information about open access journal article funding requests.
application_info has general information about the applicant and uses an auto indexing field called Application_ID as a key field. application_status is used to track the ongoing information about the status of the application (received, under review, funded, denied, withdrawn, etc.) as well as status of the journal article (submitted, accepted, resubmitted, published or rejected) and contains both an Application_ID field and an auto indexing field called Status_ID along with a status text and status date field.
Because we want to keep a running log of application, article, and funding status changes we don't want to overwrite existing rows in the application_status with updated values, but instead want to only show the most recent status values. Because an application will eventually have more than one status change this creates a need to apply some sort of limit on the inner join of the status data to the application data so that only one row is returned for each application ID.
Here's an example of what I am attempting to do in a query that currently throws an error:
-- simplified example
SELECT
application_info.*,
artstatus.Status_ID AS Article_Status_ID,
artstatus.Application_ID AS Article_Application_ID,
artstatus.Status_State_Date AS Article_Status_State_Date,
artstatus.Status_State_Text AS Article_Status_State_Text
FROM application_info
LEFT JOIN (
SELECT
Status_ID,
Application_ID,
Status_State_Text,
Status_State_Date,
Status_State_InitiatedBy,
Status_State_ChangebBy,
Status_State_Notes
FROM application_status
WHERE Status_State_Text LIKE 'Article Status%'
AND Application_ID = application_info.Application_ID -- how to pass the current application_info.Application_ID from the ON clause to here?
-- and Application_ID = 29 -- this would be an option for specific IDs, but not an option for getting a complete list of application IDs with status
-- GROUP BY Application_ID -- reduces the sub query to 1 row (Yeah!) but returns the first row encountered before the ORDER BY comes into play
ORDER BY Status_ID DESC
-- a GROUP BY after the ORDER BY might resolve the issue if we could do a sort first
LIMIT 1 -- only want to get the first (most recent) row, only works correctly if passing an Application_ID
) AS artstatus
ON application_info.Application_ID = artstatus.Application_ID
-- WHERE application_info.Application_ID = 29 -- need to get all IDs with statu values as well as for specific ID requests
;
Eliminating the AND Application_ID = application_info.Application_ID and portion of the sub query along with the LIMIT causes the select to work, but returns a row for every status for a given application ID. I've tried messing with using MIN/MAX operators but have noticed that they return unpredictable rows from the application_status table when they work.
I've also attempted to do sub selects in the ON section of the join, but don't know how to make that work because the end result would always need to return an Application_ID (can both Application_ID and Status_ID be returned and used?).
Any hints on how to get this to work as I'm intending? Can this even be done?
Further edit: working query below. The key was to move the sub query in the join one level deeper and then return just a single status ID.
-- simplified example (now working)
SELECT
application_info.*,
artstatus.Status_ID AS Article_Status_ID,
artstatus.Application_ID AS Article_Application_ID,
artstatus.Status_State_Date AS Article_Status_State_Date,
artstatus.Status_State_Text AS Article_Status_State_Text
FROM application_info
LEFT JOIN (
SELECT
Status_ID,
Application_ID,
Status_State_Text,
Status_State_Date,
Status_State_InitiatedBy,
Status_State_ChangebBy,
Status_State_Notes
FROM application_status AS artstatus_int
WHERE
-- sub query moved one level deeper so current join Application_ID can be passed
-- order by and limit can now be used
Status_ID = (
SELECT status_ID FROM application_status WHERE Application_ID = artstatus_int.Application_ID
AND status_State_Text LIKE 'Article Status%'
ORDER BY Status_ID DESC
LIMIT 1
)
ORDER BY Application_ID, Status_ID DESC
-- no need for GROUP BY or LIMIT here because only one row is returned per Application_ID
) AS artstatus
ON application_info.Application_ID = artstatus.Application_ID
-- WHERE application_info.Application_ID = 29 -- works for specific application ID as well
-- more LEFT JOINS follow
;
You can't have a correlated subquery in the from clause.
Try this idea instead:
select <whatever>
from (select a.*,
(select max(status_id) as maxstatusid
from application_status aps
where aps.application_id = a.application_id
) as maxstatusid
from application
) left outer join
application_status aps
on aps.status_id = a.maxstatusid
. . .
That is, put the correlated subquery in the select clause to get the most recent status. Then join this in to the status table to get other information. And, finish the query with other details.
You seem pretty adept at your SQL skills, so it doesn't seem necessary to rewrite the whole query for you.
I have a table in my database to store user data. I found a defect in the code that adds data to this table database where if a network timeout occurs, the code updated the next user's data with the previous user's data. I've addressed this defect but I need to clean the database. I've added a flag to indicate the rows that need to be ignored and my goal is to mark these flags accordingly for duplicates. In some cases, though, duplicate values may actually be legitimate so I am more interested in finding several user's with the same data (i.e, u> 2).
Here's an example (tablename = Data):
id---- user_id----data1----data2----data3----datetime-----------flag
1-----usr1--------3---------- 2---------2---------2012-02-16..-----0
2-----usr2--------3---------- 2---------2---------2012-02-16..-----0
3-----usr3--------3---------- 2---------2---------2012-02-16..-----0
In this case, I'd like to mark the 1 and 2 id flags as 1 (to indicate ignore). Since we know usr1 was the original datapoint (assuming the oldest dates are earlier in the list).
At this point there are so many entries in the table that I'm not sure the best way to identify the users that have duplicate entries.
I'm looking for a mysql command to identify the problem data first and then I'll be able to mark the entries. Could someone guide me in the right direction?
Well, first select duplicate data with their min user id:
CREATE TEMPORARY TABLE duplicates
SELECT MIN(user_id), data1,data2,data3
FROM data
GROUP BY data1,data2,data3
HAVING COUNT(*) > 1 -- at least two rows
AND COUNT(*) = COUNT(DISTINCT user_id) -- all user_ids must be different
AND TIMESTAMPDIFF( MINUTE, MIN(`datetime`), MAX(`datetime`)) <= 45;
(I'm not sure, if I used TIMESTAMPDIFF properly.)
Now we can update the flag in those rows where user_id is different:
UPDATE duplicate
INNER JOIN data ON data.data1 = duplicate.data1
AND data.data2 = duplicate.data2
AND data.data3 = duplicate.data3
AND data.user_id != duplicate.user_id
SET data.flag = 1;
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);
If there are duplicate times involved, maybe try this
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,,`datetime`,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3,`datetime`
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);