Optimizing query that selects on the result of a group by - mysql

I have a table that contains pipeline jobs data. A pipeline is composed of many jobs that run independently, and each of them can finish at it's own pace. Once the pipelines are finished, they are archived by setting one of the columns to 1. I want to get the list of jobs of the pipelines whose state is "Done" for all their jobs.
Let's say that my table looks like (sample data shown):
mysql> select id, pipeline, archived, state from jobs where archived=0 limit 4;
+---------+-----------+----------+-------+
| id | pipeline | archived | state |
+---------+-----------+----------+-------+
| 8572387 | pipeline1 | 0 | Done |
| 8572388 | pipeline1 | 0 | Done |
| 8572389 | pipeline2 | 0 | Done |
| 8572390 | pipeline2 | 0 | Fail |
+---------+-----------+----------+-------+
4 rows in set (0.00 sec)
I managed to get the list of failed pipelines:
mysql> select distinct(pipeline) from jobs where archived=0 group by pipeline, state having state!='Done';
+-----------+
| pipeline |
+-----------+
| pipeline2 |
+-----------+
1 row in set (0.01 sec)
And I even managed to get the answer I'm looking for (real data shown):
select j1.id
from jobs j1
where j1.archived=0
and j1.pipeline not in ( select distinct(j2.pipeline)
from jobs j2
where j2.archived=0
group by j2.pipeline, j2.state having j2.state!='Done'
);
+---------+
| id |
+---------+
| 8583200 |
| 8583201 |
| 8583202 |
| 8583203 |
.
.
.
| 8584305 |
| 8584306 |
+---------+
1107 rows in set (18.77 sec)
My issue is that the first query runs in 0.01s for the real data, but as soon as I add the second select, the time goes up dramatically. This last query took 19s having a total of 2 failed pipelines out of a total of 4, each one having around 500 jobs.
When I'm doing this with a full dataset with hundreds of pipelines... it takes too much time.
I'm sure it can be done a lot quicker, in less than 1s. But I cannot manage to get it right :-( Where is my query being stuck?
For reference, the query plan is:
mysql> describe select j1.id from jobs j1 where j1.archived=0 and j1.pipeline not in (select distinct(j2.pipeline) from jobs j2 where j2.archived=0 group by j2.pipeline, j2.state having j2.state!='Done');
+----+--------------------+-------+------+---------------+----------+---------+-------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+------+---------------+----------+---------+-------+------+----------------------------------------------+
| 1 | PRIMARY | j1 | ref | archived | archived | 2 | const | 2306 | Using where |
| 2 | DEPENDENT SUBQUERY | j2 | ref | archived | archived | 2 | const | 2306 | Using where; Using temporary; Using filesort |
+----+--------------------+-------+------+---------------+----------+---------+-------+------+----------------------------------------------+
2 rows in set (0.00 sec)

You could rewrite it to something like this
A combined INDEX on (pipeline,archived ,state) should speed this up.
The order of the Index column are vital and depend on the granularity of data, so you can play with it, to see which gives better results
SELECT
j1.id
FROM
jobs j1
WHERE
j1.archived = 0
AND NOT EXISTS
(SELECT 1 FROM jobs j2 WHERE j2.pipeline = j1.pipeline
AND
j2.archived = 0
AND j2.state != 'Done')

Related

When inserting new record in existing table tt goes up instead of down

I have already created table I want to add extra row when adding extra row the created extra row goes up. I want that row at the bottom.
MariaDB [armydetails]> insert into armydetails values('r05','Shishir','Bhujel','Jhapa','9845678954','male','1978-6-7','1994-1-3','ran5','Na11088905433');
Query OK, 1 row affected (0.17 sec)
MariaDB [armydetails]> select * from armydetails;
+-------+---------+---------+-----------+------------+--------+------------+------------+--------+----------------+
| regNo | fName | lName | address | number | gender | DOB | DOJ | rankID | accountNo |
+-------+---------+---------+-----------+------------+--------+------------+------------+--------+----------------+
| r05 | Shishir | Bhujel | Jhapa | 9845678954 | male | 1978-06-07 | 1994-01-03 | ran5 | Na11088905433 |
| ro1 | Milan | Katwal | Dharan | 9811095122 | Male | 1970-01-03 | 1990-01-01 | ran1 | Na11984567823 |
| ro2 | Hari | Yadav | Kathmandu | 9810756436 | male | 1980-06-07 | 2000-05-06 | ran2 | Na119876678543 |
| ro3 | Khrisna | Neupane | Itahari | 9864578934 | male | 1980-02-02 | 2001-01-07 | ran3 | Na11954437890 |
| ro4 | Lalit | Rai | Damak | 9842376547 | male | 1989-05-09 | 2005-01-02 | ran4 | Na11064553221 |
+-------+---------+---------+-----------+------------+--------+------------+------------+--------+----------------+
5 rows in set (0.00 sec)
MariaDB [armydetails]>
The SQL 2011 publication from ISO/IEC 9075 says:
In general, rows in a table are unordered; however, rows in a table are ordered if the table is the result of a that immediately contains an « order by clause ».
In a SQL database, there is no underlying, default ordering for records. A relational database basically stores a table as a bunch of unordered records.
When records are SELECTed without an ORDER BY clause, they come out in an undefined order, that in no way is guaranteed to be consistent over subsequent queries (including the very same query being executed several times). This is true for MySQL and for other RDBMS.
The only way to properly order records is to use an ORDER BY clause, like:
select * from armydetails order by regNo
Suggested lecture: Tom Kyte Blog : Order in the Court!.
You can simply add an ORDER BY clause to your statment as follows:
SELECT * FROM armydetails ORDER BY regNO DESC;

How to speed up Group by query

I have a mysql query that take 30 seconds to run. there are more than 3millions rows in the table
here is the db structure :
text (VARCHAR(64)),
kpi1 (INT),
kpi2 (INT),
position (DECIMAL),
date(DATE)
device (VARCHAR(32))
Here is the query :
select date, sum(kpi1), sum(kpi2) FROM `table_name` GROUP BY date ;
Explain method gives me this result :
ID | select type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtred | extra
1 | SIMPLE | table_name | NULL | index | UNIQUE,DATE | DATE | 3 | NULL | 3316480 | 100.00 | NULL
I have an index on date.
Here the result with profiling :
mysql> show profile for query 1;
+----------------------+-----------+
| Status | Duration |
+----------------------+-----------+
| starting | 0.000080 |
| checking permissions | 0.000011 |
| Opening tables | 0.000021 |
| init | 0.000023 |
| System lock | 0.000011 |
| optimizing | 0.000007 |
| statistics | 0.000021 |
| preparing | 0.000019 |
| Sorting result | 0.000007 |
| executing | 0.000005 |
| Sending data | 32.814836 |
| end | 0.000011 |
| query end | 0.000009 |
| closing tables | 0.000009 |
| freeing items | 0.000082 |
| cleaning up | 0.000013 |
+----------------------+-----------+
16 rows in set, 1 warning (0,00 sec)
Any idea ?
If the data on historical dates is static (as in, not changing because the date / activity is already done), then this is a perfect example of when to use a summary table. Create a separate table that is nothing but the date and the aggregates as you need them. Do that for all days prior to the current, so only at the end of the day, you insert (such as some daily trigger) the sum of the prior day. You could even include the count of records, something like
insert into MyDailySummaryTable
( Date, kpi1Sum, kpi2Sum, numRecs )
select date,
sum(kpi1) kpi1Sum,
sum(kpi2) kpi2Sum,
count(*) numRecs
FROM
`table_name`
where
date < curdate()
GROUP BY
date ;
then for each day after the initial load
insert into MyDailySummaryTable
( Date, kpi1Sum, kpi2Sum, numRecs )
select date,
sum(kpi1) kpi1Sum,
sum(kpi2) kpi2Sum,
count(*) numRecs
FROM
`table_name`
where
date = date_add( curdate(), interval -1 day )
GROUP BY
date ;
If your "date" field has timestamp information too, you may need to adjust the query to ignore the time portions.
Then, when trying to run your totals, you can just query from the MyDailySummaryTable directly and have instant results.
You could even expand the query aggregate table to include the counts per device in case you ever wanted to find tracking info for that one specific device too.

mysql query, different performance between = and IN

why there is this difference of time execution between these two queries even if they retrieve the same amount of rows from the same table?
select cognome, nome, lingua, count(*)
from archivio.utente
where cognome in ('rossi','pecchia','pirono')
group by cognome, nome, lingua;
…
…
…
| Rossi | Mario | it | 1 |
| Pironi | Luigi | it | 1 |
| Pecchia | Fabio | it | 1 |
+----------------------+---------+--------+----------+
779 rows in set (0.03 sec)
select cognome, nome, lingua, count(*)
from archivio.utente
where nome='corrado'
group by cognome, nome, lingua;
…
…
…
| Rossi | Mario | it | 1 |
| Pironi | Luigi | it | 1 |
| Pecchia | Fabio | it | 1 |
+----------------------+---------+--------+----------+
737 rows in set (0.47 sec)
from mysql documentation :
https://dev.mysql.com/doc/refman/5.7/en/explain-output.html#explain-join-types
when we use in
Only rows that are in a given range are retrieved, using an index to select the rows.
The key column in the output row indicates which index is used.
when we use =
A full table scan is done for each combination of rows
So in one case all lines are retrieved and compared, in another case just a range.

Optimizing / improving a slow mysql query - indexing? reorganizing?

First off, I've looked at several other questions about optimizing sql queries, but I'm still unclear for my situation what is causing my problem. I read a few articles on the topic as well and have tried implementing a couple possible solutions, as I'll describe below, but nothing has yet worked or even made an appreciable dent in the problem.
The application is a nutrition tracking system - users enter the foods they eat and based on an imported USDA database the application breaks down the foods to the individual nutrients and gives the user a breakdown of the nutrient quantities on a (for now) daily basis.
here's
A PDF of the abbreviated database schema
and here it is as a (perhaps poor quality) JPG. I made this in open office - if there are suggestions for better ways to visualize a database, I'm open to suggestions on that front as well! The blue tables are directly from the USDA, and the green and black tables are ones I've made. I've omitted a lot of data in order to not clutter things up unnecessarily.
Here's the query I'm trying to run that takes a very long time:
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM
(SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (nutr_no <100000
OR nutr_no IN
(SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'))
) as listing
LEFT JOIN
(SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no ) as data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units
So I know that's rather complex - The first select gets a listing of all the nutrients that the user consumed within the given date range, and the second fills in all the quantities.
When I implement them separately, the first query is really fast, but the second is slow and gets very slow when the date ranges get large. The join makes the whole thing ridiculously slow. I know that the 'main' problem is the join between these two derived tables, and I can get rid of that and do the join by hand basically in php much faster, but I'm not convinced that's the whole story.
For example: for 1 month of data, the query takes about 8 seconds, which is slow, but not completely terrible. Separately, each query takes ~.01 and ~2 seconds respectively. 2 seconds still seems high to me.
If I try to retrieve a year's worth of data, it takes several (>10) minutes to run the whole query, which is problematic - the client-server connection sometimes times out, and in any case we don't want I don't want to sit there with a spinning 'please wait' icon. Mainly, I feel like there's a problem because it takes more than 12x as long to retrieve 12x more information, when it should take less than 12x as long, if I were doing things right.
Here's the 'explain' for each of the slow queries: (the whole thing, and just the second half).
Whole thing:
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5053 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 4341 | |
| 4 | DERIVED | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 4 | DERIVED | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 4 | DERIVED | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 4 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 4 | DERIVED | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
| 2 | DERIVED | meals | range | day_ind | day_ind | 9 | NULL | 30 | Using where |
| 2 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | Using where |
| 3 | DEPENDENT SUBQUERY | nutr_rights | index_subquery | users_userid,nutr_def_nutr_no | nutr_def_nutr_no | 19 | func | 1 | Using index; Using where |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
10 rows in set (2.82 sec)
Second chunk (data):
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 1 | SIMPLE | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 1 | SIMPLE | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 1 | SIMPLE | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
5 rows in set (0.00 sec)
I've 'analyzed' all the tables involved in the query, and added an index on the datetime field that is joining meals and food entries. I called it 'day_ind'. I hoped that would accelerate things, but it didn't seem to make a difference. I also tried removing the 'sum' function, as I understand that having a function in the query will frequently mean a full table scan, which is obviously much slower. Unfortunately removing the 'sum' didn't seem to make a difference either (well, about 3-5% or so, but not the order magnitude that I'm looking for).
I would love any suggestions and will be happy to provide any more information you need to help diagnose and improve this problem. Thanks in advance!
There are a few type All in your explain suggest full table scan. and hence create temp table. You could re-index if it is not there already.
Sort and Group By are usually the performance killer, you can adjust Mysql memory settings to avoid physical i/o to tmp table if you have extra memory available.
Lastly, try to make sure the data type of the join attributes matches. Ie data.date_time = listing.date_time has same data format.
Hope that helps.
Okay, so I eventually figured out what I'm gonna end up doing. I couldn't make the 'data' query any faster - that's still the bottleneck. But now I've made it so the total query process is pretty close to linear, not exponential.
I split the query into two parts and made each one into a temporary table. Then I added an index for each of those temp tables and did the join separately afterwards. This made the total execution time for 1 month of data drop from 8 to 2 seconds, and for 1 year of data from ~10 minutes to ~30 seconds. Good enough for now, I think. I can work with that.
Thanks for the suggestions. Here's what I ended up doing:
create table listing (
SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (
nutr_no <100000 OR nutr_no IN (
SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'
)
)
);
create table data (
SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no
);
create index joiner on data(nutr_no, date_time);
create index joiner on listing(nutr_no, date_time);
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM listing
LEFT JOIN data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units;

Time complexity/MySQL performance analysis

Set up(MySQL):
create table inRelation(
party1 integer unsigned NOT NULL,
party2 integer unsigned NOT NULL,
unique (party1,party2)
);
insert into inRelation(party1,party2) values(1,2),(1,3),(2,3),(1,4),(2,5),(3,5),(1,6),(1,7),(2,7),(5,7);
mysql> select * from inRelation a
-> join inRelation b on a.party2=b.party1
-> join inRelation c on b.party2=c.party1
-> where a.party1=1 and c.party2=7;
+--------+--------+--------+--------+--------+--------+
| party1 | party2 | party1 | party2 | party1 | party2 |
+--------+--------+--------+--------+--------+--------+
| 1 | 2 | 2 | 5 | 5 | 7 |
| 1 | 3 | 3 | 5 | 5 | 7 |
+--------+--------+--------+--------+--------+--------+
2 rows in set (0.00 sec)
mysql> explain select * from inRelation a
-> join inRelation b on a.party2=b.party1
-> join inRelation c on b.party2=c.party1
-> where a.party1=1 and c.party2=7;
+----+-------------+-------+--------+---------------+--------+---------+---------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+--------+---------+---------------------+------+-------------+
| 1 | SIMPLE | b | index | party1 | party1 | 8 | NULL | 10 | Using index |
| 1 | SIMPLE | a | eq_ref | party1 | party1 | 8 | const,news.b.party1 | 1 | Using index |
| 1 | SIMPLE | c | eq_ref | party1 | party1 | 8 | news.b.party2,const | 1 | Using index |
+----+-------------+-------+--------+---------------+--------+---------+---------------------+------+-------------+
This is a BFS solution for my previous post:
Challenge,how to implement an algorithm for six degree of separation?
But what's the complexity of it?Suppose there are totally n records .
Assuming there are N vertices and E edges. For every table there can be a join between every pair of vertices and need to check all the vertices for equality. So worst case performance will be O(|V| + |E|)
Updated:
If you are considering Mysql, there are lot of things that affect the complexity, if you have primary key index on the field, b-tree index will be used. If its a normal unclustered index, hash index will be used. There are different costs for each of these data structures.
From your other question, I see this is your requirements
1. Calculate the path from UserX to UserY
2. For UserX,calculate all users that is no more than 3 steps away.
For the first one, best thing is to apply djikstra algorithm and construct a table in java and then update it in the table. Note that, adding every new node, needs complete processing.
Other solution to this will be to use recursive SQL introduced in SQL 1999 standard to create a view containing the path from UserX to UserY. Let me know if you need some references for recursive queries.
For the second one, the query you have written works perfectly.