Maximize efficiency of SQL SELECT statement - mysql

Assum that we have a vary large table. for example - 3000 rows of data.
And we need to select all the rows that thire field status < 4.
We know that the relevance rows will be maximum from 2 months ago (of curse that each row has a date column).
does this query is the most efficient ??
SELECT * FROM database.tableName WHERE status<4
AND DATE< '".date()-5259486."' ;
(date() - php , 5259486 - two months.)...

Assuming you're storing dates as DATETIME, you could try this:
SELECT * FROM database.tableName
WHERE status < 4
AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS)
Also, for optimizing search queries you could use EXPLAIN ( http://dev.mysql.com/doc/refman/5.6/en/explain.html ) like this:
EXPLAIN [your SELECT statement]
Another point where you can tweak response times is by carefully placing appropriate indexes.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.
Here are some explanations & tutorials on MySQL indexes:
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
However, keep in mind that using TIMESTAMP instead of DATETIME is more efficient; the former is 4 bytes; the latter is 8. They hold equivalent information (except for timezone issues).

3,000 rows of data is not large for a database. In fact, it is on the small side.
The query:
SELECT *
FROM database.tableName
WHERE status < 4 ;
Should run pretty quickly on 3,000 rows, unless each row is very, very large (say 10k). You can always put an index on status to make it run faster.
The query suggested by cassi.iup makes more sense:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS);
It will perform better with a composite index on status, date. My question is: do you want all rows with a status of 4 or do you want all rows with a status of 4 in the past two months? In the first case, you would have to continually change the query. You would be better off with:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < date('2013-06-19');
(as of the date when I am writing this.)

Related

AWS RDS MySql Simple query with composite index and date range took too long to execute out of ~8 millions data

Query is really simple i.e
SELECT
col1 , date_col
FROM table USE INDEX (device_date_col)
WHERE
device_id = "some_value"
AND date_col BETWEEN "2020-03-16 00:00:00" and "2020-04-16 00:00:00"
limit 1000000 ;
but it takes 30 to 60 seconds to finally returns the result, when running first time. And then it returns result under 10 seconds. And another problem is that when I change the device_id it again takes long time. I cannot understand why it's happening beside using proper indexing.
We know that, API Gateway has 30 seconds limit due to this our API encounter timeout. It happens suddenly from today.
Main goal is to retrieve minutely data, it returns less data but also takes long time i.e
....
AND col1 IS NOT NULL
GROUP BY
DATE(date_col),
HOUR(date_col),
MINUTE(date_col)
Below are some useful info
AWS RDS having instance db.m4.large (vCPU 2 and RAM 8GB).
MySql version 5.6.x
composite index on date_col and device_col
using InnoDB
table has no id field (primary key)
total rows in table are 7.5 million
each device has data every 3 seconds
query return rows around 600k
explain query shows it is using indexing
UPDATE
MySql Workbench shows that when I run query without group by it takes 2 seconds to execute but > 30 seconds to retrieve and when I use group by then server takes > 30 to execute but 2 seconds to retrieve.
I think we need to more
CPU for processed data using group by
More RAM for extracting all data (without group by)
Below Image is showing the query response without group by. Look at the duration/Fetch time
(original query)
SELECT col1 , date_col
FROM table USE INDEX (device_date_col)
WHERE device_id = "some_value"
AND date_col BETWEEN "2020-03-16 00:00:00"
AND "2020-04-16 00:00:00"
limit 1000000 ;
Discussion of INDEX(device_id, date_col, col1)
Start an index with = column(s), name,y device_id. This focuses the search somewhat.
Within that, further focus on the date range. So, add date_col to the index. You now have the optimal index for the WHERE
Tack on all the other columns showing up anywhere in the SELECT if it is not too many columns and includes no TEXT columns. Now you have a "covering" index. This allows the query to be performed using just the index's BTree, thereby giving a further boost in speed.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Other notes
LIMIT without ORDER BY is usually not meaningful -- you are at risk of getting a random set of rows.
That BETWEEN includes an extra midnight. I suggest
AND date_col >= "2020-03-16"
AND date_col < "2020-03-16" + INTERVAL 1 MONTH
Remove the USE INDEX -- It may help today, but it could hurt tomorrow, when the data changes or the constants change.
LIMIT 1000000 -- This could choke some clients. Do you really need that many rows? Perhaps more processing could be done in the database?
Adding on the GROUP BY -- Could there be two values for col1 within some of the minutes? Which value of col1 will you get? Consider MAX(col1), ANY_VALUE(col1), or GROUP_CONCAT(DISTINCT col1).

Optimizing MySQL table for selecting many rows in date range

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.

SQL query using a date field to ensure I use my index

I have a table with date column named day. The way I have it indexed is using multiple keys:
KEY user_id (user_id,day)
I want to make sure I use the index properly when I make a query that selects every row for a user_id from the beginning of the month to a given day in the month. For example, let's say I want to query for every day since the beginning of the month until today, what's the best way to write my query to ensure that I hit the index, here's what I have so far:
select * from table_name
WHERE user_id = 1
AND (day between DATE_FORMAT(NOW() ,'%Y-%m-01') AND NOW() )
use Explain to ensure if your query using the index you've created
For example
explain
select * from table_name
WHERE user_id = 1
AND (day between DATE_FORMAT(NOW() ,'%Y-%m-01') AND NOW() )
should give you the details of the query plan by the mysql optimizer
In the query plan in possible keys there user_id(index) should be present --- > which states that the possible indexes that can be used to fetch the result set from a particular table
keys field user_id(index) should be present --- > which shows that the index that is being used to get the results
Extra ---> Additional information (using index--the whole result set can be fetched from the index file itself, using where ---> query uses index for filter the criteria and so on...)
In Your query you have mentioned user_id = 1 constant which will help more in reducing the number of records that are to be scanned, even though there is a range check on day, so the query will use index provided that you have less percentage of duplicate values in user_id column

What SQL indexes to put for big table

I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification

Faster way of retrieving aggregate data from large table?

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.
The indices on the table are on url and datetime.
I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;
The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?
Tens of millions of rows per day is quite a lot.
Assuming:
only 10 million new records per day;
your table contains only the columns that you mention in your question;
url is of type TEXT with a mean (Punycode) length of ~77 characters;
pageviews is of type INT;
int_views is of type INT;
ext_views is of type INT; and
datetime is of type DATETIME
then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.
MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!
As #Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.
I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as #Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:
CREATE TABLE news_2012 (
INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;
CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):
DROP TABLE news;
CREATE TABLE news (
date DATE,
INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;
You can then run your above query (along with any other) on this merge table:
SELECT url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM news
WHERE date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT 10;
A few points:
Also, as the only predicate that you're filtering on you should
probably have an index with datetime as the first column.
You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= '2012-08-29 00:00:00'
AND datetime < '2012-08-30 00:00:00'
GROUP BY url
ORDER BY sum(pageviews) DESC
LIMIT 10;
You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.
As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.