Distinct Count Index - MYSQL - mysql

I have simple table (MYSQL - MyISAM):
EDIT: This is a 50M record table, adding a new index isn't really something we can do.
actions (PRIMARY action_id, user_id, item_id, created_at)
Indicies:
action_id (action_id, PRIMARY)
user (user_id)
item_user (user_id, item)
created_user (user_id, created_at)
And the query:
SELECT count(distinct item_id) as c_c from actions where user_id = 1
The explain:
1 SIMPLE action ref user_id,created_user user_id 4 const 1415
This query takes around 7 seconds to run for users with over 1k entries. Any way to improve this?
I've tried the following and they are all worse:
SELECT count(*) from actions where user_id =1 group by item_id
SELECT count(item_id) from actions USE INDEX (item_user) where user_id = 1 group by item_Id

Can you test the following:
SELECT count(*) as c_c
from (
SELECT distinct item_id
from actions where user_id = 1
) as T1

In case you're using PHP, can you simplify your query to the following:
SELECT distinct item_id
FROM actions
WHERE user_id = 1
and then use mysql_num_rows to get the number of rows in your result?
Another option you could try, although it requires more work, is to:
1- create another table that will hold the total number of rows found for each user_id. meaning you'll have to create a table with two columns, one is the user_id and the 2nd is the total of items found in your previous table.
2- schedule a job to run ,every 1 hour for instance, and update the table with the total returned from the 'actions` table. At this point you can just query your newly created table like this:
SELECT total
FROM actions_total
WHERE user_id = 1
This will be much faster when you need your final result because you're dealing with a single row instead of thousands. The drawback here is that you may not get an accurate result depending on how ofter you need to run your job.
3- In case you decide not to use a job. You can actually still use the newly created table but you will need to update (increment/decrement) its total each time you insert/delete into the `actions' table.
N.B: Just trying to help

Related

Optimizing Select SQL request with millions of entries

I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:
This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.
I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.

MySQL - Get previous row with a same identifier

I need help in constructing an MySQL Statement where I need to find previous rows in the same table.
My data looks like this:
history_id (auto increment), object_id (exists multiple times), timestamp, ...
example:
1, 2593, 2018-08-07 09:37:21
2, 2593, 2018-08-07 09:52:54
3, 15, 2018-08-07 10:41:15
4, 2593, 2018-08-07 09:57:36
Some properties of this data:
the higher the auto increment gets the later the timestamp is for the same object id
it is possible that there is only one row for one object_id at all
the combination of object_id and timestamp is always unique, no duplicates are possible
For every row I need to find the most previous row with the same object_id.
I found this post: https://dba.stackexchange.com/questions/24014/how-do-i-get-the-current-and-next-greater-value-in-one-select and worked through the examples but I was not able to solve my problem.
I just tested around a bit and got to this point:
SELECT
i1.history_id,
i1.object_id,
i1.timestamp AS state_time,
i2.timestamp AS previous_time
FROM
history AS i1
LEFT JOIN (
select timestamp as timestamp,history_id as history_id,object_id as object_id
from history
group by object_id
) AS i2 on i2.object_id = i1.object_id and i2.history_id < i1.history_id
Now I only need to cut of the subquery that I only get the highest value of history_id for each row but its not working when I use limit 1, because then I will get only one value at all.
Do you have any Idea on how to solve this problem? Or you may have better and more efficient techniques?
Performance is a point here because I have 3.1 million rows growing higher..
Thank you!
The best direction is to use window function. Simple lag(timestamp) would do the job with proper order by clause. See here: https://dev.mysql.com/doc/refman/8.0/en/window-function-descriptions.html#function_lag
But if all You need is
to cut of the subquery that I only get the highest value of history_id for each row but its not working when I use limit 1
Then change subquery from
select timestamp as timestamp,history_id as history_id,object_id as object_id
from history
group by object_id
to
select object_id as object_id, MAX(history_id) as history_id, MAX(timestamp) as timestamp
from history
group by object_id
In general You should not SELECT more columns, than You have in GROUP BY clause, unless they are enclosed with aggregate function.

sql order by not working with group by only

I have one table stock activity where i have multiple records attached with single item_id. note item_id is playing foreign key role here in stock activity table . so actually i am tracking the item(in,out) of inventory. now i want to retrieve the last record activity stored in the table. i have written query which is supposed to be returning the last record from the table but it is returning the first record ..
Columns are :
activity_id pk
item_id fk
balance int(11)
Here is my query:
SELECT DISTINCT(item_id),balance
FROM `stock_activity`
GROUP BY (item_id)
ORDER BY(activity_id) DESC
Remember if a column that doesn't belongs to the grouping key is being referenced without any sort of aggregation so such statement is impossible.
So remember a little formula to came our this problem.
SELECT * FROM
(
SELECT * FROM `table`
ORDER BY AnotherColumn
) t1
GROUP BY SomeColumn
;
Modify your query like this and hope it will work fine!!!.
SELECT * FROM(
SELECT DISTINCT(item_id),balance
FROM `stock_activity`
ORDER BY(activity_id) DESC
) t1
GROUP BY (item_id)
This is a common problem folks have an issue with. you want the GROUPWISE MAXIMUM (or MINIMUM) of a column. Fortunately such an example exists right in the tutorial section of the manual

Suggest an optimised mysql query

I have table with user transactions.I need to select users who made total transactions more than 100 000 in a single day.Currently what I'm doing is gather all user ids and execute
SELECT sum ( amt ) as amt from users where date = date("Y-m-d") AND user_id=id;
for each id and checking weather the amt > 100k or not.
Since it's a large table, it's taking lot of time to execute.Can some one suggest an optimised query ?
This will do:
SELECT sum ( amt ) as amt, user_id from users
where date = date("Y-m-d")
GROUP BY user_id
HAVING sum ( amt ) > 1; ' not sure what Lakh is
What about filtering the record 1st and then applying sum like below
select SUM(amt),user_id from (
SELECT amt,user_id from users where user_id=id date = date("Y-m-d")
)tmp
group by user_id having sum(amt)>100000
What datatype is amt? If it's anything but a basic integral type (e.g. int, long, number, etc.) you should consider converting it. Decimal types are faster than they used to be, but integral types are faster still.
Consider adding indexes on the date and user_id field, if you haven't already.
You can combine the aggregation and filtering in a single query...
SELECT SUM(Amt) as amt
FROM users
WHERE date=date(...)
AND user_id=id
GROUP BY user_id
HAVING amt > 1
The only optimization that can be done in your query is by applying primary key on user_id column to speed up filtering.
As far as other answers posted which say to apply GROUP BY on filtered records, it won't have any effect as WHERE CLAUSE is executed first in SQL logical query processing phases.
Check here
You could use MySql sub-queries to let MySql handle all the iterations. For example, you could structure your query like this:
select user_data.user_id, user_data.total_amt from
(
select sum(amt) as total_amt, user_id from users where date = date("Y-m-d") AND user_id=id
) as user_data
where user_data.total_amt > 100000;

How to query a table with over 200 million rows?

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.