Optimize range query with group by

Optimize range query with group by - mysql

Having trouble with a query. Here is the outline -
Table structure:
CREATE TABLE `world` (
`placeRef` int NOT NULL,
`forenameRef` int NOT NULL,
`surnameRef` int NOT NULL,
`incidence` int NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb3;
ALTER TABLE `world`
ADD KEY `surnameRef_forenameRef` (`surnameRef`,`forenameRef`),
ADD KEY `forenameRef_surnameRef` (`forenameRef`,`surnameRef`),
ADD KEY `forenameRef` (`forenameRef`,`placeRef`);
COMMIT;
This table contains data like and has over 600,000,000 rows:
placeRef forenameRef surnameRef incidence
1 1 2 100
2 1 3 600
This represents the number of people with a given forename-surname combination in a place.
I would like to be able to query all the forenames that a surname is attached to; and then perform another search for where those forenames exist, with a count of the sum incidence. For Example: get all the forenames of people who have the surname "Smith"; then get a list of all those forenames, grouped by place and with the sum incidence. I can do this with the following query:
SELECT placeRef, SUM( incidence )
FROM world
WHERE forenameRef IN
(
SELECT DISTINCT forenameRef
FROM world
WHERE surnameRef = 214488
)
GROUP BY world.placeRef
However, this query takes about a minute to execute and will take more time if the surname being searched for is common.
The root problem is: performing a range query with a group doesn't utilize the full index.
Any suggestions how the speed could be improved?

In my experience, if your query has a range condition (i.e. any kind of predicate other than = or IS NULL), the column for that condition is the last column in your index that can be used to optimize search, sort, or grouping.
In other words, suppose you have an index on columns (a, b, c).
The following uses all three columns. It is able to optimize the ORDER BY c, because since all rows matching the specific values of a and b will by definition be tied, and then those matching rows will already be in order by c, so the ORDER BY is a no-op.
SELECT * FROM mytable WHERE a = 1 AND b = 2 ORDER BY c;
But the next example only uses columns a, b. The ORDER BY needs to do a filesort, because the index is not in order by c.
SELECT * FROM mytable WHERE a = 1 AND b > 2 ORDER BY c;
A similar effect is true for GROUP BY. The following uses a, b for row selection, and it can also optimize the GROUP BY using the index, because each group of values per distinct value of c is guaranteed to be grouped together in the index. So it can count the rows for each value of c, and when it's done with one group, it is assured there will be no more rows later with that value of c.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b = 2 GROUP BY c;
But the range condition spoils that. The rows for each value of c are not grouped together. It's assumed that the rows for each value of c may be scattered among each of the higher values of b.
SELECT c, COUNT(*) FROM mytable WHERE a = 1 AND b > 2 GROUP BY c;
In this case, MySQL can't optimize the GROUP BY in this query. It must use a temporary table to count the rows per distinct value of c.
MySQL 8.0.13 introduced a new type of optimizer behavior, the Skip Scan Range Access Method. But as far as I know, it only applies to range conditions, not ORDER BY or GROUP BY.
It's still true that if you have a range condition, this spoils the index optimization of ORDER BY and GROUP BY.

Unless I don't understand the task, it seems like this works:
SELECT placeRef, SUM( incidence )
FROM world
WHERE surnameRef = 214488
GROUP BY placeRef;
Give it a try.
It would benefit from a composite index in this order:
INDEX(surnameRef, placeRef, incidence)
Is incidence being updated a lot? If so, leave it off my Index.
You should consider moving from MyISAM to InnoDB. It will need a suitable PK, probably
PRIMARY KEY(placeRef, surnameRef, forenameRef)
and it will take 2x-3x the disk space.

Related

How to optimize two queries with union in my sql

I have created the below query. but the below query taken lots of time to fetch result. i have added two queries and combined with union statement. how to optimize the below query in my sql. it consume much amount of time.
select count(*) as count
from(
select id,createdUser,patientName,patientFirstName,patientLastName,tagnames
FROM vw_tagged_forms v1 where v1.tenant_id = 91 AND
CASE WHEN v1.patsiteId IS NOT NULL THEN v1.patsiteId IN
(151,2937,1450,1430,2746,1431,1472,1438,2431,1428) ELSE v1.patsiteId IS NULL END group by
COALESCE(`message_grp_id`, `id`)
UNION
select
id,createdUser,patientName,patientFirstName,patientLastName,tagnames
FROM vw_tagged_forms_logs v2 where tenant_id = 91 AND CASE WHEN v2.patsiteId IS NOT
NULL THEN v2.patsiteId IN (151,2937,1450,1430,2746,1431,1472,1431) ELSE v2.patsiteId IS
NULL END) a

Simplification: I think that
CASE WHEN x IS NOT NULL
THEN x IN (...)
ELSE x IS NULL
END
can be simplified to either of these:
x IN (...) OR x IS NULL
or
COALESCE(x IN (...), TRUE)
Bug? (The IN lists are not the same; was this an oversight?)
INDEX (for performance): Add this composite index, with the columns in the order given:
INDEX(tenant_id, patsiteId)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.
DISTINCT, id, etc:
Are the id values different for the two tables? If they are AUTO_INCREMENT and the rows were independently INSERTed, I would expect them to be different. If, on the other hand, you copy a row from v1 to v2, then I would expect the ids to match.
That leads to the question of "what are you counting"? If the ids are expected to match, then do only SELECT id. If they they were independently generated, then leave id out of the SELECTs; keep the other 5 columns.

MySQL max() followed by ORDER BY

I'm working with a MySql table. The table contains numeric values and unix date values only. Each row has a unique id, columns containing ids related to other parts of the db, totals (like downloads per day) and the date that the row was inserted. My queries need to get the latest date for each id combination, in order to get the downloads for that day. One row is inserted each day for each id combination, and the index spans all of the ids and the date.
I have found through testing that it is quicker to perform two queries to get the exact row I want under certain circumstances. I would like a second opinion about this.
Here is a scenario that is very fast and uses the index:
SELECT * FROM foo WHERE A = 1 AND B = 1 AND mydate BETWEEN 123456789 AND 134567890 ORDER BY mydate DESC LIMIT 1
(The index is A, B, mydate)
Here is one that is very slow and doesn't use the index:
SELECT * FROM foo WHERE A IN (1, 2) AND B = 1 AND mydate BETWEEN 123456789 AND 134567890 GROUP BY A, B ORDER BY mydate DESC
This returns the correct result but doesn't use the index and is very slow. In reality, this simple example might use the index, but something like A IN(1,2,3,4,5,6,7,8,....10000) AND B IN (1,2,3,4,5,... 10000) doesn't, and that's what I need to cater for.
Here is where it gets interesting.
The following uses the index and is very fast:
SELECT *, MAX(mydate) FROM foo WHERE A IN (1,2,3,4,5,6,7,8,....10000) AND B IN (1,2,3,4,5,6,7,8,....10000) AND mydate BETWEEN 123456789 AND 134567890 GROUP BY A, B
The rows returned contain each unique combination of ids and the MAX of mydate for each combination. But, the row returned for each combination isn't necessarily the one with the corresponding MAX(mydate), and therefore does not necessarily give the correct downloads of that day. The MAX value is the correct value for that specific combination though, so my second query can be specific and use the index. Assuming A was 1, B was 1 and the MAX(mydate) equalled 1235555555 for that specific id combination, then I can execute
SELECT * FROM foo WHERE A = 1 AND B = 1 AND mydate = 1235555555
This second query returns the specific row I want, uses the index and is therefore fast.
I do have to do a foreach with php, so there's a processing overhead there, but it's still significantly quicker than trying to get MySQL to do all the work.
Another benefit is that all of these simple queries execute as seperate MySQL processes.
It just doesn't feel right, am I missing something?

Optimise query sql on mysql

I have a query like this:
DELETE FROM doublon WHERE id in
( Select id from `doublon` where `id` not in
( Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) > 1 and Count(amenities_id) > 1
union
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
)
)
My table 'doublon' is structured like that:
id
etablissement_id
amenities_id
The structure table it's like this:
http://hpics.li/bbb5eda
I have 2 millions rows and the query is to slow , many hours..
Anybody know how to optimize this query to execute that faster ?
SqlFiddle

Your query is not correct, in the first place. But keep reading, it's possible that by the end of the answer I discovered the reason you need such a strange query.
Let's discuss the last subquery:
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
You can use a column in the SELECT clause of a query that has GROUP BY only if at least one of the following happens:
it is present in the GROUP BY clause too;
it is used as an argument of an aggregate function;
the value of that column is functionally dependent of the values of the columns that are present in the GROUP BY clause; for example, if a column that has an UNIQUE index is present (or all the columns that are present in an UNIQUE index of the table).
The column id doesn't fit in any of the cases above1. This makes the query illegal according to the SQL specification.
MySQL, however, accepts it and struggles to produce a result set for it but it says in the documentation:
... the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
The HAVING clause contains Count(etablissement_id) and Count(amenities_id). When etablissement_id and amenities_id are both not-NULL then these two expressions have the same value and that is the same as COUNT(*) (the number of rows in the group). And it is always greater than 0 (a group cannot contain 0 rows).
For the groups generated when etablissement_id or amenities_id is NULL the corresponding COUNT() returns 0. This applies also when both are NULL on the same time.
Using this information, this query returns the ids of rows whose combination (etablissement_id, amenities_id) is unique in the table (the groups contain only one row) and both fields are not NULL.
The other GROUP BY query (that is UNION-ed with this one) returns indeterminate values from the groups of rows whose combination (etablissement_id, amenities_id) is not unique in the table (and both fields are not NULL), as explained in the fragment quoted from the documentation.
It seems the UNION picks one (random) id from each group of (etablissement_id, amenities_id) where both etablissement_id and amenities_id are not-NULL. The outer SELECT intends to ignore the ids chosen by the UNION and provide to DELETE the rest of them.
(I think the intermediate SELECT is not even needed, you could use its WHERE clause in the DELETE query).
The only reason I could imagine you need to run this query is that table doublon is the correspondence table of a many-to-many relationship that was created without an UNIQUE index on (etablissement_id, amenities_id)(the FOREIGN KEY columns imported from the related tables).
If this is your intention then there are simpler ways to achieve this goal.
I would create a duplicate of the doublon table, with the correct structure then I would use an INSERT ... SELECT query with DISTINCT to get from the old table the needed values. Then I would swap the tables and remove the old one.
The queries:
# Create the new table
CREATE TABLE `doublon_fixed` LIKE `doublon`;
# Add the needed UNIQUE INDEX
ALTER TABLE `doublon_fixed`
ADD UNIQUE INDEX `etablissement_amenities`(`etablissement_id`, `amenities_id`);
# Copy the needed values
INSERT INTO `doublon_fixed` (`etablissement_id`, `amenities_id`)
SELECT DISTINCT `etablissement_id`, `amenities_id`
FROM `doublon`;
# Swap the tables
RENAME TABLE `doublon` TO `doublon_old`, `doublon_fixed` TO `doublon`;
# Remove the old table
DROP TABLE `doublon_old`;
The RENAME query atomically operates the renames, from left to right. It is useful to avoid downtime.
Notes:
1 If the id column is functionally dependent on the (etablissement_id, amenities_id) pair then all the groups produced by the UNION-ed queries contain a single row. The first SELECT won't produce any result and the second SELECT will return the entire table).

If am not wrong this should work
DELETE FROM doublon
WHERE id IN (SELECT id
FROM doublon
WHERE id NOT IN (SELECT id
FROM doublon
GROUP BY etablissement_id,
amenities_id
HAVING Count(etablissement_id) >= 1
AND Count(amenities_id) >= 1))

Can MySQL use index in a RANGE QUERY with ORDER BY?

I have a MySQL table:
CREATE TABLE mytable (
id INT NOT NULL AUTO_INCREMENT,
other_id INT NOT NULL,
expiration_datetime DATETIME,
score INT,
PRIMARY KEY (id)
)
I need to run query in the form of:
SELECT * FROM mytable
WHERE other_id=1 AND expiration_datetime > NOW()
ORDER BY score LIMIT 10
If I add this index to mytable:
CREATE INDEX order_by_index
ON mytable ( other_id, expiration_datetime, score);
Would MySQL be able to use the entire order_by_index in the query above?
It seems like it should be able to, but then according to MySQL's documentation: "The index can also be used even if the ORDER BY does not match the index exactly, as long as all of the unused portions of the index and all the extra ORDER BY columns are constants in the WHERE clause."
The above passage seems to suggest that index would only be used in a constant query while mine is a range query.
Can anyone clarify if index would be used in this case? If not, any way I could force the use of index?
Thanks.

MySQL will use the index to satisfy the where clause, and will use a filesort to order the results.
It can't use the index for the order by because you are not comparing expiration_datetime to a constant. Therefore, the rows being returned will not always all have a common prefix in the index, so the index can't be used for the sort.
For example, consider a sample set of 4 index records for your table:
a) [1,'2010-11-03 12:00',1]
b) [1,'2010-11-03 12:00',3]
c) [1,'2010-11-03 13:00',2]
d) [2,'2010-11-03 12:00',1]
If I run your query at 2010-11-03 11:00, it will return rows a,c,d which are not consecutive in the index. Thus MySQL needs to do the extra pass to sort the results and can't use an index in this case.

Can anyone clarify if index would be used in this case? If not, any way I could force the use of index?
You have a range in filtering condition and the ORDER BY not matching the range.
These conditions cannot be served with a single index.
To choose which index to create, you need to run these queries
SELECT COUNT(*)
FROM mytable
WHERE other_id = 1
AND (score, id) <
(
SELECT score, id
FROM mytable
WHERE other_id = 1
AND expiration_datetime > NOW()
ORDER BY
score, id
LIMIT 10
)
and
SELECT COUNT(*)
FROM mytable
WHERE other_id = 1
AND expiration_datetime >= NOW()
and compare their outputs.
If the second query yields about same or more values as the first one, then you should use an index on (other_id, score) (and let it filter on expiration_datetime).
If the second query yields significantly less values than the first one, you should use an index on (other_id, expiration_datetime) (and let it sort on score).
This article might be interesting to you:
Choosing index

Sounds like you've already checked the documentation and setup the index. Use EXPLAIN and see...
EXPLAIN SELECT * FROM mytable
WHERE other_id=1 AND expiration_datetime > NOW()
ORDER BY score LIMIT 10

How can I speed up a MySQL query with a large offset in the LIMIT clause?

I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?

Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;

If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)

There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.

I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.

Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...

I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Optimize range query with group by - mysql

Related

How to optimize two queries with union in my sql

MySQL max() followed by ORDER BY

Optimise query sql on mysql

Can MySQL use index in a RANGE QUERY with ORDER BY?

How can I speed up a MySQL query with a large offset in the LIMIT clause?

Categories

Resources