SQl query optimization in workbench - mysql

Running a query in SQL takes a lot of time.
There are 240000000 total rows and 7700000 unique rows.
Try to calculate the average daily step count of the user between 3000 and 4000.
select count(distinct user_id) from (SELECT user_id,ROUND(AVG(IF(steps>'0',steps,NULL)),0) AS `Average Steps`
FROM `step_activity`.`step_activities` where user_id between '1100001' and '9999999' group by user_id
having `Average Steps` between '3000' and '4000') as custlt3k;
just want to know the total number of users.###

The GROUP BY in the derived table (inner query) makes the user_id distinct. Hence, the DISTINCT is not needed. CHange to simply COUNT(*).
Please provide SHOW CREATE TABLE so we can see if you have an index starting with user_id, which might be beneficial. Also how many rows in step_activities and how many rows with user_id between '1100001' AND '9999999'. Comparing those will determine whether the index will even be used.
The task requires all the rows, at least the rows with that range of users, to be read, and "grouped".
This index may help: INDEX(user_id, steps) because that would be a "covering" index.
Another thing to consider -- don't store any rows with steps = 0. After all, they are being thrown out in this query. (Maybe there are other columns in the row that you need to keep?)

Related

Is there a way to avoid sorting for a query with WHERE and ORDER BY?

I am looking to understand how a query with both WHERE and ORDER BY can be indexed properly. Say I have a query like:
SELECT *
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3
With an index on date_created, it seems like the execution plan will prefer to use the PRIMARY key and then sort the results itself. This seems to be very slow when it needs to sort a large amount of results.
I was reading through this guide on indexing for ordered queries which mentions an almost identical example and it mentions:
If the database uses a sort operation even though you expected a pipelined execution, it can have two reasons: (1) the execution plan with the explicit sort operation has a better cost value; (2) the index order in the scanned index range does not correspond to the order by clause.
This makes sense to me but I am unsure of a solution. Is there a way to index my particular query and avoid an explicit sort or should I rethink how I am approaching my query?
The Optimizer is caught between a rock and a hard place.
Plan A: Use an index starting with id; collect however many rows that is; sort them; then deliver only 3. The downside: If the list is large and the ids are scattered, it could take a long time to find all the candidates.
Plan B: Use an index starting with date_created filtering on id until it gets 3 items. The downside: What if it has to scan all the rows before it finds 3.
If you know that the query will always work better with one query plan than the other, you can use an "index hint". But, when you get it wrong, it will be a slow query.
A partial answer... If * contains bulky columns, both approaches may be hauling around stuff that will eventually be tossed. So, let's minimize that:
SELECT u.*
FROM ( SELECT id
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3 -- not repeated
) AS x
JOIN users AS u USING(id)
ORDER BY date_created; -- repeated
Together with
INDEX(date_created, id),
INDEX(id, date_created)
Hopefully, the Optimizer will pick one of those "covering" indexes to perform the "derived table" (subquery). If so that will be somewhat efficiently performed. Then the JOIN will look up the rest of the columns for the 3 desired rows.
If you want to discuss further, please provide
SHOW CREATE TABLE.
How many ids you are likely to have.
Why you are not already JOINing to another table to get the ids.
Approximately how many rows in the table.
You best bet might to to write this in a more complicated way:
SELECT u.*
FROM ((SELECT u.*
FROM users u
WHERE id = 1
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 2
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 3
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 4
ORDER BY date_created
LIMIT 3
)
) u
ORDER BY date_created
LIMIT 3;
Each of the subqueries will now use an index on users(id, date_created). The outer query is then sorting at most 12 rows, which should be trivial from a performance perspective.
You could create a composite index on (id, date_created) - that will give the engine the option of using an index for both steps - but the optimiser may still choose not to.
If there aren't many rows in your table or it thinks the resultset will be small it's quicker to sort after the fact than it is to traverse the index tree.
If you really think you know better than the optimiser (which you don't), you can use index hints to tell it what to do, but this is almost always a bad idea.

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).
All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

How to query a table with over 200 million rows?

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.

SELECT COUNT(*) with GROUP BY slow for large table

I have a table with around 100 million rows consisting of three columns (all INT):
id | c_id | l_id
Even though I use indices even a basic
select count(*), c_id
from table
group by c_id;
takes 16 seconds (MYISAM) to 25 seconds (InnoDB) to complete.
Is there any way to speed up this process without tracking the count in a seperate table (e.g. by using triggers)?
/edit: all colums have indices
See execution plan for possible ways to do the same queries SqlFiddle,
SELECT COUNT(id) will be faster if c_id is not indexed on the test set i have provided.
otherwise you should use COUNT(*) since optimization of index may not be used in the query.
It is also dependent of the number of rows in the DB and the ENGINE type , since mysql will decide what is better based on this fact also.
You should always see the execution plan of the query before executing it by typing EXPLAIN before the select.
I have to say that in most cases on big datasets, COUNT(*) and COUNT(id) should result in the same execution plan.
It's not the Count(*) that gives the performance issue but grouping on 100 million rows.
You should add an index on the c_id column

Best approach to getting count of distinct values in MySQL

I have this query:
select count(distinct User_ID) from Web_Request_Log where Added_Timestamp like '20110312%' and User_ID Is Not Null;
User_ID and Added_Timestamp are indexed.
The query is painfully slow (we have millions of records and the table is growing fast).
I've read all the posts I could find about count and distinct, here, but they seem to be mostly syntax related. I'm interested in optimization and I'm wondering if I'm using the right tool for the job.
I can use an intermediate counter table to summarize overall hits, but I'd like a way to do this that would allow me to easily generate ad-hoc 'range' queries; i.e., what is the distinct visitor count for last week, or last month.
Did some tests to see if GROUP BY can help and it seems it can.
On table A with ~8M records and ~340K distinct records for a given non-indexed field:
GROUP BY 17 seconds
COUNT(DISTINCT ..) 21 seconds
On table A with ~2M records and ~50K distinct records for a given indexed field:
GROUP BY 200 ms
COUNT(DISTINCT ..) 2.5 seconds
This is MySql with InnoDB engine, BTW.
I can't find any relevant documentation though, and I wonder if that comparison is dependent on the data (how many duplicates there are).
For your table, the GROUP BY query will look like this:
SELECT COUNT(t.c)
FROM (SELECT 1 AS c
FROM Web_Request_Log
WHERE Added_Timestamp LIKE '20110312%'
AND User_ID IS NOT NULL
GROUP BY User_ID
) AS t
Try it and let us know if it's quicker :)