Best approach to getting count of distinct values in MySQL - mysql

I have this query:
select count(distinct User_ID) from Web_Request_Log where Added_Timestamp like '20110312%' and User_ID Is Not Null;
User_ID and Added_Timestamp are indexed.
The query is painfully slow (we have millions of records and the table is growing fast).
I've read all the posts I could find about count and distinct, here, but they seem to be mostly syntax related. I'm interested in optimization and I'm wondering if I'm using the right tool for the job.
I can use an intermediate counter table to summarize overall hits, but I'd like a way to do this that would allow me to easily generate ad-hoc 'range' queries; i.e., what is the distinct visitor count for last week, or last month.

Did some tests to see if GROUP BY can help and it seems it can.
On table A with ~8M records and ~340K distinct records for a given non-indexed field:
GROUP BY 17 seconds
COUNT(DISTINCT ..) 21 seconds
On table A with ~2M records and ~50K distinct records for a given indexed field:
GROUP BY 200 ms
COUNT(DISTINCT ..) 2.5 seconds
This is MySql with InnoDB engine, BTW.
I can't find any relevant documentation though, and I wonder if that comparison is dependent on the data (how many duplicates there are).
For your table, the GROUP BY query will look like this:
SELECT COUNT(t.c)
FROM (SELECT 1 AS c
FROM Web_Request_Log
WHERE Added_Timestamp LIKE '20110312%'
AND User_ID IS NOT NULL
GROUP BY User_ID
) AS t
Try it and let us know if it's quicker :)

Related

SQl query optimization in workbench

Running a query in SQL takes a lot of time.
There are 240000000 total rows and 7700000 unique rows.
Try to calculate the average daily step count of the user between 3000 and 4000.
select count(distinct user_id) from (SELECT user_id,ROUND(AVG(IF(steps>'0',steps,NULL)),0) AS `Average Steps`
FROM `step_activity`.`step_activities` where user_id between '1100001' and '9999999' group by user_id
having `Average Steps` between '3000' and '4000') as custlt3k;
just want to know the total number of users.###
The GROUP BY in the derived table (inner query) makes the user_id distinct. Hence, the DISTINCT is not needed. CHange to simply COUNT(*).
Please provide SHOW CREATE TABLE so we can see if you have an index starting with user_id, which might be beneficial. Also how many rows in step_activities and how many rows with user_id between '1100001' AND '9999999'. Comparing those will determine whether the index will even be used.
The task requires all the rows, at least the rows with that range of users, to be read, and "grouped".
This index may help: INDEX(user_id, steps) because that would be a "covering" index.
Another thing to consider -- don't store any rows with steps = 0. After all, they are being thrown out in this query. (Maybe there are other columns in the row that you need to keep?)

How to find the repeat value in a mysql table with 30 million rows

In Mysql, I have a table with two columns (id, uuid). Then I inserted 30 million values into it. (ps: the uuid can repeated)
Now, I want to find the repeat value in the table by using Mysql grammar, but the sql spent too much time.
I want to search all columns, but it takes much time, so I tried querying first million rows, the it spent 8 seconds.
Then I tried with 10 million rows, it spend 5mins,
then with 20 million rows, the server seem died.
select count(uuid) as cnt
from uuid_test
where id between 1
and 1000000
group by uuid having cnt > 1;
Anyone can help me to optimized the sql, thanks
Try this query,
SELECT uuid, count(*) cnt FROM uuid_test GROUP BY 1 HAVING cnt>1;
Hope it helps.
Often the fastest way to find duplicates uses a correlated subquery rather than aggregation:
select ut2.*
from uuid_test ut2
where exists (select 1
from uuid_test ut2
where ut2.uuid = ut.uuid and
ut2.id <> ut.id
);
This can take advantage of an index on uuid_test(uuid, id).

Mysql Group By and Sum performance issues

We have just one table with millions of rows where this query, as it stands takes 138 seconds to run on a server with a buffer pool size of 25G, the server itself linux with SSD drives.
I am wondering if anyone could suggest any improvements in MySQL settings or in the query itself that would reduce run time. We only have about 8 large member_id's that have this performance problem, the rest run under 5 seconds. We run multiple summary tables like this for rollup reporting.
select *
from (
SELECT distinct account_name AS source,SUM(royalty_amount) AS total_amount
FROM royalty_stream
WHERE member_id = '1050705'
AND deleted = 0
AND period_year_quarter >= '2016_Q1'
AND period_year_quarter <= '2016_Q2'
GROUP BY account_name
ORDER BY total_amount desc
LIMIT 1
) a
I see a few obvious improvements.
Subselects
Don't use a subselect. This isn't a huge deal, but it makes little sense to add the overhead here.
Using Distinct
Is the distinct really needed here? Since you're grouping, it should be unnecessary overhead.
Data Storage Practices
Your period_year_quarter evaluation is going to be a hurdle. String comparisons are one of the slower things you can do, unfortunately. If you have the ability to update the data structure, I would highly recommend that you break period_year_quarter into two distinct, integer fields. One for the year, one for the quarter.
Is royalty_amount actually stored as a number, or are you making the database implicitly convert it every time? If so (surprisingly common mistake) converting that to a number will also help.
Indexing
You haven't explained what indexes are on this table. I'm hoping that you at least have one on member_id. If not, it should certainly be indexed.
I would further recommend an index on (member_id, period_year_quarter). If you took my advice from the previous section, that should be (member_id, year, quarter).
select
account_name as source
, sum(royalty_amount) as total_amount
from
royalty_stream
where
member_id = '1050705'
and deleted = 0
and period_year_quarter between '2016_Q1' and '2016_Q2'
group by
account_name
order by
total_amount desc
limit 1

What SQL indexes to put for big table

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

mysql query using where clause with 24 million rows

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.