Optimizing Select SQL request with millions of entries - mysql

I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:

This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.

I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.

Related

Can I "Order" table "ranks" without locking it ? (don't care for errors)

I have a big table for users with "credits". every few min I want to rank the users by credits.
The problem is that this operation takes time and locks the entire table.
Since I don't care if there is a temporary error in ranks is there a way to perform this function without locking the table ?
UPDATE users SET userrank= #r:= (#r+1) ORDER BY credits DESC
I wouldn't recomment storing the rank in the table itself. This is derived information, that can be easily computed on the fly when needed. Maintaining such information is also expensive: everytime the table is modified (either updated, inserted into or deleted from), you potentially need to re-rank all the rows.
An alternative option is to use a view. If you are running MySQL 8.0, you can use window functions:
create view v_users as
select u.*, rank() over(order by credits) rn
from users u
Note that rank() assigns the same number to ties.
In earlier versions, one alternative is a correlated subquery (user variables are not supported in views):
create view v_users as
select u.*, 1 + (select count(*) from users u1 where u1.credits > u.credits) rn
from users u
How many columns in that table? Are you doing some queries on that table that need don't need 'rank' and they are being blocked? If so, consider having a table with just 3 columns -- userid, rank, credits.
That way the blocking query would be working with a smaller table (hence somewhat faster) and not be blocking as many things.
Please provide SHOW CREATE TABLE; there may be other tips to help with your problem.

SUM of differences between selective rows in table

I have a table with call records. Each call has a 'state' CALLSTART and CALLEND, and each call has a unique 'callid'. Also for each record there is a unique autoincrement 'id'. Each row has a MySQL TIMESTAMP field.
In a previous question I asked for a way to calculate the total of seconds of phone calls. This came to this SQL:
SELECT SUM(TIME_TO_SEC(differences))
FROM
(
SELECT SEC_TO_TIME(TIMESTAMPDIFF(SECOND,MIN(timestamp),MAX(timestamp)))as differences
FROM table
GROUP BY callid
)x
Now I would like to know how to do this, only for callid's that also have a row with the state CONNECTED.
Screenshot of table: http://imgur.com/gmdeSaY
Use a having clause:
SELECT SUM(difference)
FROM (SELECT callid, TIMESTAMPDIFF(SECOND, MIN(timestamp), MAX(timestamp)) as difference
FROM table
GROUP BY callid
HAVING SUM(state = 'Connected') > 0
) c;
If you only want the difference in seconds, I simplified the calculation a bit.
EDIT: (for Mihai)
If you put in:
HAVING state in ('Connected')
Then the value of state comes from an arbitrary row for each callid. Not all the rows, just an arbitrary one. You might or might not get lucky. As a general rule, avoid using the MySQL extension that allows "bare" columns in the select and having clauses, unless you really use the feature intentionally and carefully.

How to query a table with over 200 million rows?

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.

How to avoid running the same expensive MySql query twice in pagination?

Assume a houses table with lot's of fields, related images tables, and 3 other related tables. I have an expensive query that retrieves all houses data, with all data from the related tables. Do I need to run the same expensive MySql query twice in the case of pagination: once for current result page and once to get the total number of records?
I'm using server-side pagination with Limit 0,10, and need to return the total number of houses along with the data. It doesn't make sense to me to run the same expensive query with the count(*) function, just because I'm limiting the result-set for pagination.
Is there another way to instruct MySQL to count the whole query, but bring back only the current pagination data?
I hope my question is clear...
thanks
I don't know MySql but for many dbs, I think you'll find that the cost of running it twice isn't as high as you'd suspect - if you do it in such a way that the db's optimization engine sees the two queries as having a lot in common.
Running
select count(1) from (
select some_fields, row_number over (order by field) as rownum
from some_table
)
and then
select * from (
select some_fields, row_number over (order by field) as rownum
from some_table
)
where rownum between :startRow and :endRow
order by row_number
This also has the advantage of you being able to maintain the query in just one place with two different wrappers around it, 1 for paging and 1 for getting the total count.
Just as a side note, the best optimization you can do is make sure you send the exact same query to the db every time. In other words, if the user can change the sort or change what fields they can query on, bake it all into the same query. E.g:
select some_fields,
case
when :sortField = 'ID' and :sortType = 'asc'
then row_number over (order by id)
when :sortField = 'ID' and :sortType = 'desc'
then row_number over (order by id desc)
end as rownum
from some_table
where (:searchType = 'name'
and last_name like :lastName and first_name like :firstName)
or (:searchType = 'customerType'
and customer_type = :customer_type)
cfquery has a recordcount variable that might be useful. You can also use the startrow and maxrows attributes of cfoutput to control how many records get displayed. Finally, you can cache the query results in coldfusion so you don't have to run it against the database each time.

How can I speed up a MySQL query with a large offset in the LIMIT clause?

I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?
Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;
If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)
There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.
I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.
Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...
I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.