Mysql only keep the highest value per id per value - mysql

I have a database that looks like this:
post metrics minutes(There is only data for post id 1 in this example)
| post id | date updated local | reach |
|1 | 2018-01-01 01:00:00 | 10 |
|1 | 2018-01-01 01:05:00 | 20 |
|1 | 2018-01-01 01:15:00 | 22 |
|1 | 2018-01-01 16:05:00 | 100 |
|1 | 2018-01-02 03:00:00 | 121 |
|1 | 2018-01-02 21:00:00 | 140 |
|1 | 2018-01-04 01:00:00 | 147 |
My system is designed to fetch data for all posts every 5 minutes and put the results in the above table if the reach is not the same as the last time it was stored for that post (this to prevent getting a shitload of data that is exactly the same).
Now there are thousands of posts and the table start to grow out of control making my website lot slower when loading the data from this table.
So I decided that I can decrease the data by only keeping the last row per post per day, so I want to delete all rows that are not the max date updated local for that post. The result would be:
| post id | date updated local | reach |
|1 | 2018-01-01 16:05:00 | 100 |
|1 | 2018-01-02 21:00:00 | 140 |
|1 | 2018-01-04 01:00:00 | 147 |
I came up with:
DELETE FROM `post metrics minutes`
WHERE EXISTS (
SELECT *
FROM `post metrics minutes` pmmtemp
WHERE pmmtemp.`post id` = `post metrics minutes`.`post id`
AND pmmtemp.`date updated local` > `post metrics minutes`.`date updated local`
AND DATE(pmmtemp.`date updated local`) = DATE(`post metrics minutes`.`date updated local`)
);
But this gives me the following error:
Error Code: 1093. Table 'post metrics minutes' is specified twice, both as a target for 'DELETE' and as a separate source for data
Hope anyone can help me out!

One cannot delete or update on the same table as a subquery one.
One could create a temp table of post_ids to delete.
But marking the records first does it too. This way both queries do not interfere with each other.
For the nested table, instead of FROM tablename I do FROM (SELECT * FROM tablename) for the temp table.
Here I abused the column reach.
UPDATE `post metrics minutes` p
SET p.reach = -1
WHERE EXISTS (
SELECT *
FROM (SELECT * FROM `post metrics minutes`) pmmtemp
WHERE pmmtemp.`post id` = p.`post id`
AND pmmtemp.`date updated local` > p.`date updated local`
AND DATE(pmmtemp.`date updated local`) = DATE(p.`date updated local`)
);
DELETE FROM `post metrics minutes`
WHERE reach = -1;

As per my comment, it's often quicker to create a new table with the desired dates, then delete the old table and replace it with the newer one.
My column/table names may be very slightly different from yours, but something like...
CREATE TABLE my_new_table AS
SELECT x.*
FROM my_old_table x
JOIN
( SELECT post_id,MAX(dt) dt FROM my_old_table GROUP BY post_id,DATE(dt)) y ON y.post_id = x.post_id
AND y.dt = x.dt;

Related

How to optimize an update query for multiple rows using MySQL and PHP

I have a table that has around 80.000 records. It has 4 columns:
| id | code | size | qty |
+--------+--------+-------+------+
| 1 | 4735 | M | 5 |
| 2 | 8452 | L | 2 |
...
| 81456 | 9145 | XS | 13 |
The code column is unique.
I have to update the qty twice a day.
For that i'm using this query:
UPDATE stock SET qty = CASE id
WHEN 1 THEN 10
WHEN 2 THEN 8
...
WHEN 2500 THEN 20
END
WHERE id IN (1,2,...,2500);
I am splitting the query to update 2500 stocks at a time using PHP.
Here is (in seconds) how much it takes for each 2500 stocks to update:
[0]7.11
[1]11.30
[2]19.86
[3]27.01
[4]36.25
[5]44.21
[6]51.44
[7]61.03
[8]71.53
[9]81.14
[10]89.12
[11]99.99
[12]111.46
[13]121.86
[14]131.19
[15]136.94
[END]137
As you can see it takes between 5 - 9 seconds to update 2500 products which i think is quiet a lot.
What can i change to speed things up?
Thank you!
Because the times seem to be getting longer the further along you get, I'd expect you need an index on the id field, as it looks suspiciously like it's doing a full table scan. You can create the index something like this
CREATE INDEX my_first_index ON table(id);
(I am having to add this as an answer because I can't make comments, I know it is more of a comment!!)
** EDIT **
I re-read and see your issue is bigger. I still think there is a chance that putting an index on id would fix it but a better solution would be to have a new table for the id to quantity mappings, lets call it qty_mapping
| id | qty |
+--------+------+
| 1 | 10 |
| 2 | 8 |
...
| 2500 | 20 |
make sure to index id and then you can change your update to
update stock set qty = (select qm.qty from qty_mapping qm where qm.id = stock.id)
It should be able to update the whole 80,000 records in next to no time.

Skipping row for each unique column value

I have a table from which I would like to extract all of the column values for all rows. However, the query needs to be able to skip the first entry for each unique value of id_customer. It can be assumed that there will always be at least two rows containing the same id_customer.
I've compiled some sample data which can be found here: http://sqlfiddle.com/#!9/c85b73/1
The results I would like to achieve are something like this:
id_customer | id_cart | date
----------- | ------- | -------------------
1 | 102 | 2017-11-12 12:41:16
2 | 104 | 2015-09-04 17:23:54
2 | 105 | 2014-06-05 02:43:42
3 | 107 | 2011-12-01 11:32:21
Please let me know if any more information/better explanation is required, I expect it's quiet a niche solution.
One method is:
select c.*
from carts c
where c.date > (select min(c2.date) from carts c2 where c2.id_customer = c.id_customer);
If your data is large, you want an index on carts(id_customer, date).

Detecting variations in a data set

I have a data set with this structure:
ContractNumber | MonthlyPayment | Duration | StartDate | EndDate
One contract number can occur many times as this data set is a consolidation of different reports with the same structure.
Now I want to filter / find the contract numbers in which MonthlyPayment and/or Duration and/or StartDate and/or EndDate differ.
Example (note that Contract Number is not a Primary key):
ContractNumber | MonthlyPayment | Duration | StartDate | EndDate
001 | 500 | 12 | 01.01.2015 | 31.12.2015
001 | 500 | 12 | 01.01.2015 | 31.12.2015
001 | 500 | 12 | 01.01.2015 | 31.12.2015
002 | 1500 | 24 | 01.01.2014 | 31.12.2017
002 | 1500 | 24 | 01.01.2014 | 31.12.2017
002 | 1500 | 24 | 01.01.2014 | 31.12.2018
With this sample data set, I would need to retrieve 002 with a specific query. 001 is the the same and does not Change, but 002 changes over time.
Besides of writing a VBA script running over an Excel, I don't have any solid idea on how to solve this with SQL
My first idea would be a SQL Approach with grouping, where same values are grouped together, but not the different ones. I am currently experimenting on this one. My attempt is currently:
1.) Have the usual table
2.) Create a second table / query with this structure:
ContractNumber | AVG(MonthlyPayment) | AVG(Duration) | AVG(StartDate) | AVG(EndDate)
Which I created with Grouping.
E.G.
Table 1.)
ContractNumber | MonthlyPayment
1 | 10
1 | 10
1 | 20
2 | 300
2 | 300
2 | 300
Table 2.)
ContractNumber | AVG(MonthlyPayment)
1 | 13.3
2 | 300
3) Now I want to find the distinct contract number where - in this example only the MonthlyPayment - does not equal to the average (it should be the same - otherwise we have a variation which I need to find).
Do you have any idea how I could solve this? I would otherwise start writing a VBA or Python script. I have the data set in CSV, so for now I could also do it with MySQL, Power Bi or Excel.
I need to perform this Analysis once, so I would not Need a full approach, so the queries can be splitted into different steps.
Very appreciated! Thank you very much.
To find all contract numbers with differences, use:
select ContractNumber
from
(
select distinct ContractNumber, MonthlyPayment , Duration , StartDate , EndDate
from MyTable
) x
group by ContractNumber
having count(*) >1

MySQL Append-Only Model - Clean up query using DELETE

Currently moving our tables over to an append only model to increase write performance by avoiding UPDATE and DELETE, with a memcached front end for SELECT's.
All rows are timestamped with the latest row being selected using MAX(timestamp).
This works well although after time the table will be full of old irrelevant data, we could write a simple
DELETE FROM table WHERE timestamp < XXXX
Although that will delete rows which may have not been updated in the last XX amount of time, and therefore remove that ID from the table completely not just old rows.
A very simple example schema and data to demonstrate is provided below
---------------------------
| id | INT |
| name | VARCHAR |
| timestamp | TIMESTAMP |
---------------------------
Initial data
-------------------------------------------
| id | name | timestamp |
-------------------------------------------
| 1 | Trevor | 1 |
| 2 | Mike | 1 |
-------------------------------------------
Should a users name be updated a row will be appended, not updated, with the users new name.
-------------------------------------------
| id | name | timestamp |
-------------------------------------------
| 1 | Trevor | 1 |
| 2 | Mike | 1 |
| 1 | Trev | 60 |
-------------------------------------------
Using a simple DELETE query to remove rows older than 60 seconds (Real case would be more like an hour or even a day) would delete Trevor on row 1 as intended but it will also delete the only record of Mike.
-------------------------------------------
| id | name | timestamp |
-------------------------------------------
| 1 | Trev | 60 |
-------------------------------------------
We need it to only DELETE distinct ID rows which are older than XX, so we would be left with both users even though Mike hasn't updated his name and his timestamp is older than XX amount of time.
-------------------------------------------
| id | name | timestamp |
-------------------------------------------
| 2 | Mike | 1 |
| 1 | Trev | 60 |
-------------------------------------------
We could go through each ID, get the latest timestamp, then DELETE all rows older than that timestamp however as the table gets more users this process will take longer.
Is there any SQL query which could, preferably in one or 2 queries clean up the table as described above?
Thanks
I'm not an expert on MySQL but I believe this query should do the trick:
DELETE t1 FROM
table1 AS t1, table1 AS t2
WHERE
t1.id = t2.id
AND
t1.timestamp < t2.timestamp;
You could add those 60 minutes to t1.timestamp so it will only delete rows older than 60 minutes.
SQL Fiddle

MySql subselects and counts based on counts

Here is model of my table structure. Three tables.
---------------- ---------------------------- -------------------------------
|possibilities | |realities | |measurements |
|--------------| |--------------------------| |-----------------------------|
|pid| category | |rid | pid | item | status | |mid | rid | meas | date |
|--------------| |--------------------------| |-----------------------------|
|1 | animal | |1 | 1 | dog | 1 (yes)| |1 | 1 | 3 | 2012-01-01|
|2 | vegetable| |2 | 1 | fox | 1 | |2 | 3 | 2 | 2012-01-05|
|3 | mineral | |3 | 1 | cat | 1 | |3 | 1 | 13 | 2012-02-02|
---------------- |4 | 2 | apple| 2 (no) | |4 | 3 | 24 | 2012-02-15|
|5 | 1 | mouse| 1 | |5 | 2 | 5 | 2012-02-16|
|7 | 1 | bat | 2 | |6 | 6 | 4 | 2012-02-17|
---------------------------- -------------------------------
What I'm after is a result that will show me a series of counts based on measurement ranges for a particular entry from the "possibilities" table where the status of the related "realities" is 1 (meaning it's currently being tracked), BUT the only measurement that is relevant is the most recent one.
Here is an example result I'm looking for using animal as the possibility.
-----------------------
| 0-9 | 10-19 | 20-29 |
|---------------------|
| 2 | 1 | 1 |
-----------------------
So, in this example, the apple row was not used to make a count because it isn't an animal, nor was bat because it's status was set to no (meaning don't measure), and only the most recent measurements were used to determine the count.
I currently have a workaround in my real world use but it doesn't follow good database normalization. In my realities table I have a current_meas column that get updated when a new measurement is taken and entered in the measurements table. Then I only need to use the first two tables and I have a single SELECT statement with a bunch of embedded SUM statements that use an IF the value is between 0-9 for example. It gives me exactly what I want, however my app has evolved to the point where this convenience has become a problem in other areas.
So, the question is, is there a more elegant way to do this in one statement? Subselects? temporary tables? Getting the counts is the heart of what the app is about.
This is a PHP5, MySQL5, JQuery 1.8 based webapp, in case that give me some more options. Thanks in advance. I love the stack and hope to help back as much as it has helped me.
Here's one approach
Create a temp table to get the recent measurements
CREATE TEMPORARY TABLE RecentMeasurements
SELECT * FROM Measurements m
INNER JOIN (SELECT max(mid) max_id,date FROM Measurements GROUP BY DATE ORDER BY DATE ) x
ON x.max_id=m.mid
then do you query:
SELECT *, your counting logic
FROM Realities
WHERE status = 1 AND pid = 1
INNER JOIN RecentMeasurements
Here is what I ended up doing based on the two answers suggested.
First I created a temporary table that generates a table of
realities that are based on one possibility (animals) and whose
status is 1 (yes).
Second I created a temporary table that generates a table of the
individual realities from the first temp table, and finds the most
recent measurement for each one.
From this second table I do a select that gives me the breakdown of
counts in ranges.
When I tried it with just one temp table the query would take 5-10 seconds per possibility. In my real-world use I currently have 30 possibilities (a script loops through each one and generates these temp tables and selects), well over 1,000 realities (600 active on any given day, 100 added per month), and over 21,000 measurements (20-30 added daily). That just wasn't working for me. So breaking it up into smaller pools to draw from reduced it to the whole report running in under 3-4 seconds.
Here is the MySQL stuff with my real-world table and column names.
//Delete the temporary tables in advance
$delete_np_prod = 'DROP TABLE IF EXISTS np_infreppool';
mysql_query($delete_np_prod) or die ("Drop NP Prod Error " . mysql_error ());
$delete_np_max = 'DROP TABLE IF EXISTS np_maxbrixes';
mysql_query($delete_np_max) or die ("Drop NP Max Error " . mysql_error ());
//Make a temporary table to hold the totes of this product at North Plains that are active
$create_np_prod_pool_statement = 'CREATE TEMPORARY TABLE np_infreppool
SELECT inf_row_id FROM infusion WHERE formid = ' . $active_formids["formid"] . ' AND location = 1 AND status = 1';
mysql_query($create_np_prod_pool_statement) or die ("Prod Error " . mysql_error ());
//Make a temporary table to hold the tote with its most recent brix value attached to it.
$create_np_maxbrix_pool_statement = 'CREATE TEMPORARY TABLE np_maxbrixes
SELECT b.inf_row_id AS inf_row_id, b.brix AS brix from brix b, np_infreppool pool WHERE b.inf_row_id = pool.inf_row_id AND b.capture_date = (SELECT max(capture_date) FROM brix WHERE inf_row_id = pool.inf_row_id )';
mysql_query($create_np_maxbrix_pool_statement) or die ("Brix Error " . mysql_error ());
//Get the counts for slected form from NP
$get_report_np = "SELECT
SUM(IF(brix BETWEEN 0 AND 4,1,0)) as '0-4',
SUM(IF(brix BETWEEN 5 AND 9,1,0)) as '5-9',
SUM(IF(brix BETWEEN 10 AND 14,1,0)) as '10-14',
SUM(IF(brix BETWEEN 15 AND 19,1,0)) as '15-19',
SUM(IF(brix BETWEEN 20 AND 24,1,0)) as '20-24',
SUM(IF(brix BETWEEN 25 AND 29,1,0)) as '25-29',
SUM(IF(brix BETWEEN 30 AND 34,1,0)) as '30-34',
SUM(IF(brix BETWEEN 35 AND 39,1,0)) as '35-39',
SUM(IF(brix BETWEEN 40 AND 44,1,0)) as '40-44',
SUM(IF(brix BETWEEN 45 AND 49,1,0)) as '45-49',
SUM(IF(brix BETWEEN 50 AND 54,1,0)) as '50-54',
SUM(IF(brix BETWEEN 55 AND 59,1,0)) as '54-49',
SUM(IF(brix BETWEEN 60 AND 64,1,0)) as '60-64',
SUM(IF(brix BETWEEN 65 AND 69,1,0)) as '65-69',
SUM(IF(brix >=70, 1, 0)) as 'Over 70'
FROM np_maxbrixes";
$do_get_report_np = mysql_query($get_report_np);
$got_report_np = mysql_fetch_array($do_get_report_np);
UPDATE
I got it to work in a single SELECT statement without using temporary tables and it works faster. Using my sample schema above, here is how it looks.
SELECT
SUM(IF(m.meas BETWEEN 0 AND 4,1,0)) as '0-4',
SUM(IF(m.meas BETWEEN 5 AND 9,1,0)) as '5-9',
SUM(IF(m.meas BETWEEN 10 AND 14,1,0)) as '10-14',
SUM(IF(m.meas BETWEEN 15 AND 19,1,0)) as '15-19',
SUM(IF(m.meas BETWEEN 20 AND 24,1,0)) as '20-24',
SUM(IF(m.meas BETWEEN 25 AND 29,1,0)) as '25-29',
SUM(IF(m.meas BETWEEN 30 AND 34,1,0)) as '30-34',
SUM(IF(m.meas BETWEEN 35 AND 39,1,0)) as '35-39',
SUM(IF(m.meas BETWEEN 40 AND 44,1,0)) as '40-44',
SUM(IF(m.meas BETWEEN 45 AND 49,1,0)) as '45-49',
SUM(IF(m.meas BETWEEN 50 AND 54,1,0)) as '50-54',
SUM(IF(m.meas BETWEEN 55 AND 59,1,0)) as '54-49',
SUM(IF(m.meas BETWEEN 60 AND 64,1,0)) as '60-64',
SUM(IF(m.meas BETWEEN 65 AND 69,1,0)) as '65-69',
SUM(IF(m.meas >=70, 1, 0)) as 'Over 70'
FROM measurement m, realities r
WHERE r.status = 1 AND r.pid = " . $_GET['pid'] . " AND r.rid = m.rid AND m.date = (SELECT max(date) FROM measurements WHERE rid = r.rid)