How to query a table with over 200 million rows?

How to query a table with over 200 million rows? - mysql

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia

For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.

Related

SQl query optimization in workbench

Running a query in SQL takes a lot of time.
There are 240000000 total rows and 7700000 unique rows.
Try to calculate the average daily step count of the user between 3000 and 4000.
select count(distinct user_id) from (SELECT user_id,ROUND(AVG(IF(steps>'0',steps,NULL)),0) AS `Average Steps`
FROM `step_activity`.`step_activities` where user_id between '1100001' and '9999999' group by user_id
having `Average Steps` between '3000' and '4000') as custlt3k;
just want to know the total number of users.###

The GROUP BY in the derived table (inner query) makes the user_id distinct. Hence, the DISTINCT is not needed. CHange to simply COUNT(*).
Please provide SHOW CREATE TABLE so we can see if you have an index starting with user_id, which might be beneficial. Also how many rows in step_activities and how many rows with user_id between '1100001' AND '9999999'. Comparing those will determine whether the index will even be used.
The task requires all the rows, at least the rows with that range of users, to be read, and "grouped".
This index may help: INDEX(user_id, steps) because that would be a "covering" index.
Another thing to consider -- don't store any rows with steps = 0. After all, they are being thrown out in this query. (Maybe there are other columns in the row that you need to keep?)

Optimizing Select SQL request with millions of entries

I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:

This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.

I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.

Distinct Count Index - MYSQL

I have simple table (MYSQL - MyISAM):
EDIT: This is a 50M record table, adding a new index isn't really something we can do.
actions (PRIMARY action_id, user_id, item_id, created_at)
Indicies:
action_id (action_id, PRIMARY)
user (user_id)
item_user (user_id, item)
created_user (user_id, created_at)
And the query:
SELECT count(distinct item_id) as c_c from actions where user_id = 1
The explain:
1 SIMPLE action ref user_id,created_user user_id 4 const 1415
This query takes around 7 seconds to run for users with over 1k entries. Any way to improve this?
I've tried the following and they are all worse:
SELECT count(*) from actions where user_id =1 group by item_id
SELECT count(item_id) from actions USE INDEX (item_user) where user_id = 1 group by item_Id

Can you test the following:
SELECT count(*) as c_c
from (
SELECT distinct item_id
from actions where user_id = 1
) as T1

In case you're using PHP, can you simplify your query to the following:
SELECT distinct item_id
FROM actions
WHERE user_id = 1
and then use mysql_num_rows to get the number of rows in your result?
Another option you could try, although it requires more work, is to:
1- create another table that will hold the total number of rows found for each user_id. meaning you'll have to create a table with two columns, one is the user_id and the 2nd is the total of items found in your previous table.
2- schedule a job to run ,every 1 hour for instance, and update the table with the total returned from the 'actions` table. At this point you can just query your newly created table like this:
SELECT total
FROM actions_total
WHERE user_id = 1
This will be much faster when you need your final result because you're dealing with a single row instead of thousands. The drawback here is that you may not get an accurate result depending on how ofter you need to run your job.
3- In case you decide not to use a job. You can actually still use the newly created table but you will need to update (increment/decrement) its total each time you insert/delete into the `actions' table.
N.B: Just trying to help

Select last row in MySQL

How can I SELECT the last row in a MySQL table?
I'm INSERTing data and I need to retrieve a column value from the previous row.
There's an auto_increment in the table.

Yes, there's an auto_increment in there
If you want the last of all the rows in the table, then this is finally the time where MAX(id) is the right answer! Kind of:
SELECT fields FROM table ORDER BY id DESC LIMIT 1;

Keep in mind that tables in relational databases are just sets of rows. And sets in mathematics are unordered collections. There is no first or last row; no previous row or next row.
You'll have to sort your set of unordered rows by some field first, and then you are free the iterate through the resultset in the order you defined.
Since you have an auto incrementing field, I assume you want that to be the sorting field. In that case, you may want to do the following:
SELECT *
FROM your_table
ORDER BY your_auto_increment_field DESC
LIMIT 1;
See how we're first sorting the set of unordered rows by the your_auto_increment_field (or whatever you have it called) in descending order. Then we limit the resultset to just the first row with LIMIT 1.

You can combine two queries suggested by #spacepille into single query that looks like this:
SELECT * FROM `table_name` WHERE id=(SELECT MAX(id) FROM `table_name`);
It should work blazing fast, but on INNODB tables it's fraction of milisecond slower than ORDER+LIMIT.

on tables with many rows are two queries probably faster...
SELECT #last_id := MAX(id) FROM table;
SELECT * FROM table WHERE id = #last_id;

Almost every database table, there's an auto_increment column(generally id )
If you want the last of all the rows in the table,
SELECT columns FROM table ORDER BY id DESC LIMIT 1;
OR
You can combine two queries into single query that looks like this:
SELECT columns FROM table WHERE id=(SELECT MAX(id) FROM table);

Make it simply use: PDO::lastInsertId
http://php.net/manual/en/pdo.lastinsertid.php

Many answers here say the same (order by your auto increment), which is OK, provided you have an autoincremented column that is indexed.
On a side note, if you have such field and it is the primary key, there is no performance penalty for using order by versus select max(id). The primary key is how data is ordered in the database files (for InnoDB at least), and the RDBMS knows where that data ends, and it can optimize order by id + limit 1 to be the same as reach the max(id)
Now the road less traveled is when you don't have an autoincremented primary key. Maybe the primary key is a natural key, which is a composite of 3 fields...
Not all is lost, though. From a programming language you can first get the number of rows with
SELECT Count(*) - 1 AS rowcount FROM <yourTable>;
and then use the obtained number in the LIMIT clause
SELECT * FROM orderbook2
LIMIT <number_from_rowcount>, 1
Unfortunately, MySQL will not allow for a sub-query, or user variable in the LIMIT clause

If you want the most recently added one, add a timestamp and select ordered in reverse order by highest timestamp, limit 1. If you want to go by ID, sort by ID. If you want to use the one you JUST added, use mysql_insert_id.

You can use an OFFSET in a LIMIT command:
SELECT * FROM aTable LIMIT 1 OFFSET 99
in case your table has 100 rows this return the last row without relying on a primary_key

Without ID in one query:
SELECT * FROM table_name LIMIT 1 OFFSET (SELECT COUNT(*) - 1 FROM table_name)

SELECT * FROM adds where id=(select max(id) from adds);
This query used to fetch the last record in your table.

How can I speed up a MySQL query with a large offset in the LIMIT clause?

I'm getting performance problems when LIMITing a mysql SELECT with a large offset:
SELECT * FROM table LIMIT m, n;
If the offset m is, say, larger than 1,000,000, the operation is very slow.
I do have to use limit m, n; I can't use something like id > 1,000,000 limit n.
How can I optimize this statement for better performance?

Perhaps you could create an indexing table which provides a sequential key relating to the key in your target table. Then you can join this indexing table to your target table and use a where clause to more efficiently get the rows you want.
#create table to store sequences
CREATE TABLE seq (
seq_no int not null auto_increment,
id int not null,
primary key(seq_no),
unique(id)
);
#create the sequence
TRUNCATE seq;
INSERT INTO seq (id) SELECT id FROM mytable ORDER BY id;
#now get 1000 rows from offset 1000000
SELECT mytable.*
FROM mytable
INNER JOIN seq USING(id)
WHERE seq.seq_no BETWEEN 1000000 AND 1000999;

If records are large, the slowness may be coming from loading the data. If the id column is indexed, then just selecting it will be much faster. You can then do a second query with an IN clause for the appropriate ids (or could formulate a WHERE clause using the min and max ids from the first query.)
slow:
SELECT * FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
fast:
SELECT id FROM table ORDER BY id DESC LIMIT 10 OFFSET 50000
SELECT * FROM table WHERE id IN (1,2,3...10)

There's a blog post somewhere on the internet on how you should best make the selection of the rows to show should be as compact as possible, thus: just the ids; and producing the complete results should in turn fetch all the data you want for only the rows you selected.
Thus, the SQL might be something like (untested, I'm not sure it actually will do any good):
select A.* from table A
inner join (select id from table order by whatever limit m, n) B
on A.id = B.id
order by A.whatever
If your SQL engine is too primitive to allow this kind of SQL statements, or it doesn't improve anything, against hope, it might be worthwhile to break this single statement into multiple statements and capture the ids into a data structure.
Update: I found the blog post I was talking about: it was Jeff Atwood's "All Abstractions Are Failed Abstractions" on Coding Horror.

I don't think there's any need to create a separate index if your table already has one. If so, then you can order by this primary key and then use values of the key to step through:
SELECT * FROM myBigTable WHERE id > :OFFSET ORDER BY id ASC;
Another optimisation would be not to use SELECT * but just the ID so that it can simply read the index and doesn't have to then locate all the data (reduce IO overhead). If you need some of the other columns then perhaps you could add these to the index so that they are read with the primary key (which will most likely be held in memory and therefore not require a disc lookup) - although this will not be appropriate for all cases so you will have to have a play.

Paul Dixon's answer is indeed a solution to the problem, but you'll have to maintain the sequence table and ensure that there is no row gaps.
If that's feasible, a better solution would be to simply ensure that the original table has no row gaps, and starts from id 1. Then grab the rows using the id for pagination.
SELECT * FROM table A WHERE id >= 1 AND id <= 1000;
SELECT * FROM table A WHERE id >= 1001 AND id <= 2000;
and so on...

I have run into this problem recently. The problem was two parts to fix. First I had to use an inner select in my FROM clause that did my limiting and offsetting for me on the primary key only:
$subQuery = DB::raw("( SELECT id FROM titles WHERE id BETWEEN {$startId} AND {$endId} ORDER BY title ) as t");
Then I could use that as the from part of my query:
'titles.id',
'title_eisbns_concat.eisbns_concat',
'titles.pub_symbol',
'titles.title',
'titles.subtitle',
'titles.contributor1',
'titles.publisher',
'titles.epub_date',
'titles.ebook_price',
'publisher_licenses.id as pub_license_id',
'license_types.shortname',
$coversQuery
)
->from($subQuery)
->leftJoin('titles', 't.id', '=', 'titles.id')
->leftJoin('organizations', 'organizations.symbol', '=', 'titles.pub_symbol')
->leftJoin('title_eisbns_concat', 'titles.id', '=', 'title_eisbns_concat.title_id')
->leftJoin('publisher_licenses', 'publisher_licenses.org_id', '=', 'organizations.id')
->leftJoin('license_types', 'license_types.id', '=', 'publisher_licenses.license_type_id')
The first time I created this query I had used the OFFSET and LIMIT in MySql. This worked fine until I got past page 100 then the offset started getting unbearably slow. Changing that to BETWEEN in my inner query sped it up for any page. I'm not sure why MySql hasn't sped up OFFSET but between seems to reel it back in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to query a table with over 200 million rows? - mysql

Related

SQl query optimization in workbench

Optimizing Select SQL request with millions of entries

Distinct Count Index - MYSQL

Select last row in MySQL

How can I speed up a MySQL query with a large offset in the LIMIT clause?

Categories

Resources