I have a semi-large (10,000,000+ record) credit card transaction database that I need to query regularly. I have managed to optimise most queries to be sub 0.1 seconds but I'm struggling to do the same for sub-queries.
The purpose of the following query is to obtain the number of "inactive" credit cards (credit cards that have not made a card transaction in the last x days / weeks) for both the current user's company, and all companies (so as to form a comparison).
The sub-query first obtains the last card transaction of all credit cards, and then the parent query removes any expired credit cards, and groups the card based on their associated company and whether or not the they are deemed "inactive" (the (UNIX_TIMESTAMP() - (14 * 86400)) is used in place of a PHP time calculation.
SELECT
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllActiveCards,
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyActiveCards
FROM CardTransactions
JOIN
(
SELECT
CardSerialNumberID,
MAX(CardTransactions.Timestamp) AS LastActivity,
CardTransactions.CompanyID
FROM CardTransactions
GROUP BY
CardTransactions.CardSerialNumberID, CardTransactions.CompanyID
) lastCardTransactions
ON
CardTransactions.CardSerialNumberID = lastCardTransactions.CardSerialNumberID AND
CardTransactions.Timestamp = lastCardTransactions.LastActivity AND
CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()
The indexes in use are on CardSerialNumberID, CompanyID, Timestamp for the inner query, and CardSerialNumberID, Timestamp, CardExpiryTimestamp, CompanyID for the outer query.
The query takes around 0.4 seconds to execute when done multiple times, but the initial run can be as slow as 0.9 - 1.1 seconds, which is a big problem when loading a page with 4-5 of these types of query.
One thought I did have was to calculate the overall inactive card number in a routine separate to this, perhaps run daily. This would allow me to adjust this query to only pull records for a single company, thus reducing the dataset and bringing the query time down. However, this is only really a temporary fix, as the database will continue to grow until the same amount of data is being analysed anyway.
Note: The query above's fields have been modified to make them more generic, as the specific subject this query is used on is quite complex. As such there is no DB schema to give (and if there was, you'd need a dataset of 10,000,000+ records anyway to test the query I suppose). I'm more looking for a conceptual fix than for anyone to actually give me an adjusted query.
Any help is very much appreciated!
You're querying the table transactions two times, so your query has a size of Transactions x Transactions, which might be big.
One idea would be to monitor all credit cards for the last x days/weeks and save them in an extra table INACTIVE_CARDS that gets updated every day (add a field with the number of days of inactivity). Then you could limit the SELECT in your subquery to just search in INACTIVE_CARDS
SELECT
CardSerialNumberID,
MAX(Transactions.Timestamp) AS LastActivity,
Transactions.CompanyID
FROM Transactions
WHERE CardSerialNumberID in INACTIVE_CARDS
GROUP BY
Transactions.CardSerialNumberID, Transactions.CompanyID
Of course a card might have become active in the last hour, but you don't need to check all transactions for that.
Please use different "aliases" for the two instances of Transactions. What you have is confusing to read.
The inner GROUP BY:
SELECT card_sn, company, MAX(ts)
FROM Trans
GROUP BY card_sn, company
Now this index is good ("covering") for the inner:
INDEX(CardSerialNumberID, CompanyID, Timestamp)
Recommend testing (timing) the subquery by itself.
For the outside query:
INDEX(CardSerialNumberID, Timestamp, -- for JOINing (prefer this order)
CardExpiryTimestamp, CompanyID) -- covering (in this order)
Please move CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP() to a WHERE clause. It is helpful to the reader that the ON clause contain only the conditions that tie the two tables together. The WHERE contains any additional filtering. (The Optimizer will run this query the same, regardless of where you put that clause.)
Oh. Can that filter be applied in the subquery? It will make the subquery run faster. (It may impact the optimal INDEX, so I await your answer.)
I have assumed that most rows have not "expired". If they have, then other techniques may be better.
For much better performance, look into building and maintaining summary tables of the info. Or, perhaps, rebuild (daily) a table with these stats. Then reference the summary table instead of the raw data.
If that does not work, consider building a temp table with the "4-5" info at the start of the web page, then feed off it the tmp table.
Rather than repetitively calculating - 14 days and current UNIX_TIMESTAMP(), follow advice of
https://code.tutsplus.com/tutorials/top-20-mysql-best-practices--net-7855
then prior to SELECT .....
code similar to:
$uts_14d = UNIX_TIMESTAMP() - (14 * 86400);
$uts = UNIX_TIMESTAMP();
and substitute the ($uts_14d and $uts) variables result in 5 lines of your code?
Related
I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification
Assum that we have a vary large table. for example - 3000 rows of data.
And we need to select all the rows that thire field status < 4.
We know that the relevance rows will be maximum from 2 months ago (of curse that each row has a date column).
does this query is the most efficient ??
SELECT * FROM database.tableName WHERE status<4
AND DATE< '".date()-5259486."' ;
(date() - php , 5259486 - two months.)...
Assuming you're storing dates as DATETIME, you could try this:
SELECT * FROM database.tableName
WHERE status < 4
AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS)
Also, for optimizing search queries you could use EXPLAIN ( http://dev.mysql.com/doc/refman/5.6/en/explain.html ) like this:
EXPLAIN [your SELECT statement]
Another point where you can tweak response times is by carefully placing appropriate indexes.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data.
Here are some explanations & tutorials on MySQL indexes:
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
However, keep in mind that using TIMESTAMP instead of DATETIME is more efficient; the former is 4 bytes; the latter is 8. They hold equivalent information (except for timezone issues).
3,000 rows of data is not large for a database. In fact, it is on the small side.
The query:
SELECT *
FROM database.tableName
WHERE status < 4 ;
Should run pretty quickly on 3,000 rows, unless each row is very, very large (say 10k). You can always put an index on status to make it run faster.
The query suggested by cassi.iup makes more sense:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < DATE_SUB(NOW(), INTERVAL 2 MONTHS);
It will perform better with a composite index on status, date. My question is: do you want all rows with a status of 4 or do you want all rows with a status of 4 in the past two months? In the first case, you would have to continually change the query. You would be better off with:
SELECT *
FROM database.tableName
WHERE status < 4 AND DATE < date('2013-06-19');
(as of the date when I am writing this.)
SELECT DISTINCT `Stock`.`ProductNumber`,`Stock`.`Description`,`TComponent_Status`.`component`, `TComponent_Status`.`certificate`,`TComponent_Status`.`status`,`TComponent_Status`.`date_created`
FROM Stock , TBOM , TComponent_Status
WHERE `TBOM`.`Component` = `TComponent_Status`.`component`
AND `Stock`.`ProductNumber` = `TBOM`.`Product`
Basically table TBOM HAS :
24,588,820 rows
The query is ridiculously slow, i'm not too sure what i can do to make it better. I have indexed all the other tables in the query but TBOM has a few duplicates in the columns so i can't even run that command. I'm a little baffled.
To start, index the following fields:
TBOM.Component
TBOM.Product
TComponent_Status.component
Stock.ProductNumber
Not all of the above indexes may be necessary (e.g., the last two), but it is a good start.
Also, remove the DISTINCT if you don't absolutely need it.
The only thing I can really think of is having an index on your Stock table on
(ProductNumber, Description)
This can help in two ways. Since you are only using those two fields in the query, the engine wont be required to go to the full data row of each stock record since both parts are in the index, it can use that. Additionally, you are doing DISTINCT, so having the index available to help optimize the DISTINCTness, should also help.
Now, the other issue for time. Since you are doing a distinct from stock to product to product status, you are asking for all 24 million TBOM items (assume bill of materials), and each BOM component could have multiple status created, you are getting every BOM for EVERY component changed.
If what you are really looking for is something like the most recent change of any component item, you might want to do it in reverse... Something like...
SELECT DISTINCT
Stock.ProductNumber,
Stock.Description,
JustThese.component,
JustThese.certificate,
JustThese.`status`,
JustThese.date_created
FROM
( select DISTINCT
TCS.Component,
TCS.Certificate,
TCS.`staus`,
TCS.date_created
from
TComponent_Status TCS
where
TCS.date_created >= 'some date you want to limit based upon' ) as JustThese
JOIN TBOM
on JustThese.Component = TBOM.Component
JOIN Stock
on TBOM.Product = Stock.Product
If this is a case, I would ensure an index on the component status table, something like
( date_created, component, certificate, status, date_created ) as the index. This way, the WHERE clause would be optimized, and distinct would be too since pieces already part of the index.
But, how you currently have it, if you have 10 TBOM entries for a single "component", and that component has 100 changes, you now have 10 * 100 or 1,000 entries in your result set. Take this and span 24 million, and its definitely not going to look good.
I'm working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
I'm considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
I get data from the database table using a series of SELECT COUNT queries for a particular day. An example is below:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
My question is:
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I'm afraid of overloading the server with queries for the same data even when there are no changes)?!
Any assitance would be appreciated.
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.
I've spent a few hours playing around with this one, without success so far.
I'm outputting a very large query, and trying to split it into chunks before processing the data. This query will basically run every day, and one of the fields ('last_checked') will be used to ensure the same data isn't processed more than once a day.
Here's my existing query;
<cfquery name="getprice" maxrows="100">
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
ORDER BY ID ASC
</cfquery>
I then run a cfoutput query on the results to do various updates. The table currently holds just over 100,000 records and is starting to struggle to process everything in one hit, hence the need to split it into chunks.
My intention is to cfschedule it to run every so often (I'll increase the maxrows and probably have it run every 15 minutes, for example). However, I need it to only return results that haven't been updated within the last 24 hours - this is where I'm getting stuck.
I know MySQL has it's own DateDiff and TimeDiff functions, but I don't seem to be able to grasp the syntax for that - if indeed its applicable for my use (docs seem to contradict themselves in that regard - or, at the least the ones I've read).
Any pointers very much appreciated!
Try this with MySQL first:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 24 HOUR
ORDER BY ID ASC
I would caution you against using maxrows=100 in your cfquery. This will still return the full recordset to CF from the database, and only then will CF filter out all but the first 100 rows. When you are dealing with a 100,000 row dataset, then this is going to be hugely expensive. Presumably, your filter for only the last 24 hours will dramatically reduce the size of your base result set, so perhaps this won't really be a big problem. However, if you find that even by limiting your set to those changed within the last 24 hours you still have a very large set of records to work with, you could change the way you do this to work much more efficiently. Instead of using CF to filter your results, have MySQL do it using the LIMIT keyword in your query:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 1 DAY
ORDER BY ID ASC
LIMIT 0,100
You could also easily set between "pages" of 100 rows by adding the offset value before the LIMIT: LIMIT 300, 100 would be rows 300-400 from your result set. Doing the paging this way will be much faster than offloading it to CF.