What SQL indexes to put for big table - mysql

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)

You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'

add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

Related

How to count related id on second table quicker in mysql?

I need to know how many orders made to each product within a day by their ids. I tried select all the product_today.id. And count each of them from the second table - product_today_order.hid. I'm now have 20k+ rows of data. It took me 10s+ only this query.
Is there any way to make the query faster?
SELECT t.id,(select count(o.hid) from product_today_order o where o.hid=t.id) as zid
FROM product_today t
where date(t.dtime)='2021-11-26'
group by t.id
5 tips:
Probably the main slowdown is the un-sargable date(t.dtime)='...'. Change that to
WHERE t.dtime >= '2021-11-26'
AND t.dtime < '2021-11-26' + INTERVAL 1 DAY
Also, get rid of the GROUP BY. It is unnecessary (if t.id is the PRIMARY KEY).
Do you have an index on t that starts with dtime?
Do you need to check o.hid for being not-NULL? If not, simply say COUNT(*).
Do you have an index on o that starts with hid?

Fastest way to do MySQL query select with WHERE clauses on an unindexed table

I have a very big unindexed table called table with rows like this:
IP entrypoint timestamp
171.128.123.179 /page-title/?kw=abc 2016-04-14 11:59:52
170.45.121.111 /another-page/?kw=123 2016-04-12 04:13:20
169.70.121.101 /a-third-page/ 2016-05-12 09:43:30
I want to make the fastest query that, given 30 IPs and one date, will search rows as far back a week before that date and return the most recent row that contains "?kw=" for each IP. So I want DISTINCT entrypoints but only the most recent one.
I'm stuck by this I know it's a relatively simple INNER JOIN but I don't know the fastest way to do it.
By the way: I can't add the index right now because it's very big and on a db that serves a website. I'm going to replace it with an indexed table don't worry.
Rows from the table
SELECT ...
FROM very_big_unindexed_table t
only within the past week...
WHERE t.timestamp >= NOW() + INTERVAL - 1 WEEK
that contains '?kw=' in the entry point
AND t.entrypoint LIKE '%?kw=%'
only the latest row for each IP. There's a couple of approaches to that. A correlated subquery on a very big unindexed table is going to eat your lunch and your lunch box. And without an index, there's no getting around a full scan of the table and a "Using filesort" operation.
Given the unfortunate circumstances, our best bet for performance is likely going to be getting the set whittled down as small as we can, and then perform the sort, and avoid any join operations (back to that table) and avoid correlated subqueries.
So, let's start with something like this, to return all of the rows from the past week with '?kw=' in entry point. This is going to be full scan of the table, and a sort operation...
SELECT t.ip
, t.timestamp
, t.entry_point
FROM very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
We can use an unsupported trick with user-defined variables. (The MySQL Reference Manual specifically warns against using a pattern like this, because the behavior is (officially) undefined. Unofficially, the optimizer in MySQL 5.1 and 5.5 (at least) is very predictable.
I think this is going to be about as good as you are going to get, if the number of rows from the past week are significant subset of the entire table. This is going to create a sizable intermediate resultset (derived table), if there are lot of rows that satisfy the predicates.
SELECT q.ip
, q.entrypoint
, q.timestamp
FROM (
SELECT IF(t.ip = #prev_ip, 0, 1) AS new_ip
, #prev_ip := t.ip AS ip
, t.timestamp AS timestamp
, t.entrypoint AS entrypoint
FROM (SELECT #prev_ip := NULL) i
CROSS
JOIN very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
) q
WHERE q.new_ip
Execution of that query will require (in terms of what's going to take the time)
a full scan of the table (there's no way to get around that)
a sort operation (again, there's no way around that)
materializing a derived table containing all of the rows that satisfy the predicates
a pass through the derived table to pull out the "latest" row for each IP

Maximize efficiency of SQL SELECT statement

Assum that we have a vary large table. for example - 3000 rows of data.
And we need to select all the rows that thire field status < 4.
We know that the relevance rows will be maximum from 2 months ago (of curse that each row has a date column).
does this query is the most efficient ??
SELECT * FROM database.tableName WHERE status<4
AND DATE< '".date()-5259486."' ;
(date() - php , 5259486 - two months.)...
Assuming you're storing dates as DATETIME, you could try this:
SELECT * FROM database.tableName
WHERE status < 4
AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS)
Also, for optimizing search queries you could use EXPLAIN ( http://dev.mysql.com/doc/refman/5.6/en/explain.html ) like this:
EXPLAIN [your SELECT statement]
Another point where you can tweak response times is by carefully placing appropriate indexes.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.
Here are some explanations & tutorials on MySQL indexes:
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
However, keep in mind that using TIMESTAMP instead of DATETIME is more efficient; the former is 4 bytes; the latter is 8. They hold equivalent information (except for timezone issues).
3,000 rows of data is not large for a database. In fact, it is on the small side.
The query:
SELECT *
FROM database.tableName
WHERE status < 4 ;
Should run pretty quickly on 3,000 rows, unless each row is very, very large (say 10k). You can always put an index on status to make it run faster.
The query suggested by cassi.iup makes more sense:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS);
It will perform better with a composite index on status, date. My question is: do you want all rows with a status of 4 or do you want all rows with a status of 4 in the past two months? In the first case, you would have to continually change the query. You would be better off with:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < date('2013-06-19');
(as of the date when I am writing this.)

mysql query using where clause with 24 million rows

SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.

Best approach to getting count of distinct values in MySQL

I have this query:
select count(distinct User_ID) from Web_Request_Log where Added_Timestamp like '20110312%' and User_ID Is Not Null;
User_ID and Added_Timestamp are indexed.
The query is painfully slow (we have millions of records and the table is growing fast).
I've read all the posts I could find about count and distinct, here, but they seem to be mostly syntax related. I'm interested in optimization and I'm wondering if I'm using the right tool for the job.
I can use an intermediate counter table to summarize overall hits, but I'd like a way to do this that would allow me to easily generate ad-hoc 'range' queries; i.e., what is the distinct visitor count for last week, or last month.
Did some tests to see if GROUP BY can help and it seems it can.
On table A with ~8M records and ~340K distinct records for a given non-indexed field:
GROUP BY 17 seconds
COUNT(DISTINCT ..) 21 seconds
On table A with ~2M records and ~50K distinct records for a given indexed field:
GROUP BY 200 ms
COUNT(DISTINCT ..) 2.5 seconds
This is MySql with InnoDB engine, BTW.
I can't find any relevant documentation though, and I wonder if that comparison is dependent on the data (how many duplicates there are).
For your table, the GROUP BY query will look like this:
SELECT COUNT(t.c)
FROM (SELECT 1 AS c
FROM Web_Request_Log
WHERE Added_Timestamp LIKE '20110312%'
AND User_ID IS NOT NULL
GROUP BY User_ID
) AS t
Try it and let us know if it's quicker :)