I'm working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
I'm considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
I get data from the database table using a series of SELECT COUNT queries for a particular day. An example is below:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
My question is:
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I'm afraid of overloading the server with queries for the same data even when there are no changes)?!
Any assitance would be appreciated.
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.
Related
I am running below simple query,execution time is 1sec but fetch time is 30 sec. It contains totally 100 000 records
SELECT id, referrer, timestamp
FROM masterstats_innodb
WHERE video = 1869 AND timestamp between '2011-10-01' and '2021-01-21';
Index is created on video and timestamp column and even range partition has been created on timestamp table. Can anything be done to fetch result faster?
Please provide SHOW CREATE TABLE.
Plan A: INDEX(video, timestamp)
Plan B - slightly better because of being "covering":
INDEX(video, timestamp, referrer, id)
PARTITIONing will not help the performance of this query any more than indexing.
You say "it" contains 100K rows -- are you referring to the table? Or the just the number of rows returned. If 'table', then the index will help. If the 'resultset', then you are constrained by having to send so many rows. What will the client do with 100K rows?? Can the server condense the data (eg summarize it in some way)?
I have a semi-large (10,000,000+ record) credit card transaction database that I need to query regularly. I have managed to optimise most queries to be sub 0.1 seconds but I'm struggling to do the same for sub-queries.
The purpose of the following query is to obtain the number of "inactive" credit cards (credit cards that have not made a card transaction in the last x days / weeks) for both the current user's company, and all companies (so as to form a comparison).
The sub-query first obtains the last card transaction of all credit cards, and then the parent query removes any expired credit cards, and groups the card based on their associated company and whether or not the they are deemed "inactive" (the (UNIX_TIMESTAMP() - (14 * 86400)) is used in place of a PHP time calculation.
SELECT
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllActiveCards,
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyActiveCards
FROM CardTransactions
JOIN
(
SELECT
CardSerialNumberID,
MAX(CardTransactions.Timestamp) AS LastActivity,
CardTransactions.CompanyID
FROM CardTransactions
GROUP BY
CardTransactions.CardSerialNumberID, CardTransactions.CompanyID
) lastCardTransactions
ON
CardTransactions.CardSerialNumberID = lastCardTransactions.CardSerialNumberID AND
CardTransactions.Timestamp = lastCardTransactions.LastActivity AND
CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()
The indexes in use are on CardSerialNumberID, CompanyID, Timestamp for the inner query, and CardSerialNumberID, Timestamp, CardExpiryTimestamp, CompanyID for the outer query.
The query takes around 0.4 seconds to execute when done multiple times, but the initial run can be as slow as 0.9 - 1.1 seconds, which is a big problem when loading a page with 4-5 of these types of query.
One thought I did have was to calculate the overall inactive card number in a routine separate to this, perhaps run daily. This would allow me to adjust this query to only pull records for a single company, thus reducing the dataset and bringing the query time down. However, this is only really a temporary fix, as the database will continue to grow until the same amount of data is being analysed anyway.
Note: The query above's fields have been modified to make them more generic, as the specific subject this query is used on is quite complex. As such there is no DB schema to give (and if there was, you'd need a dataset of 10,000,000+ records anyway to test the query I suppose). I'm more looking for a conceptual fix than for anyone to actually give me an adjusted query.
Any help is very much appreciated!
You're querying the table transactions two times, so your query has a size of Transactions x Transactions, which might be big.
One idea would be to monitor all credit cards for the last x days/weeks and save them in an extra table INACTIVE_CARDS that gets updated every day (add a field with the number of days of inactivity). Then you could limit the SELECT in your subquery to just search in INACTIVE_CARDS
SELECT
CardSerialNumberID,
MAX(Transactions.Timestamp) AS LastActivity,
Transactions.CompanyID
FROM Transactions
WHERE CardSerialNumberID in INACTIVE_CARDS
GROUP BY
Transactions.CardSerialNumberID, Transactions.CompanyID
Of course a card might have become active in the last hour, but you don't need to check all transactions for that.
Please use different "aliases" for the two instances of Transactions. What you have is confusing to read.
The inner GROUP BY:
SELECT card_sn, company, MAX(ts)
FROM Trans
GROUP BY card_sn, company
Now this index is good ("covering") for the inner:
INDEX(CardSerialNumberID, CompanyID, Timestamp)
Recommend testing (timing) the subquery by itself.
For the outside query:
INDEX(CardSerialNumberID, Timestamp, -- for JOINing (prefer this order)
CardExpiryTimestamp, CompanyID) -- covering (in this order)
Please move CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP() to a WHERE clause. It is helpful to the reader that the ON clause contain only the conditions that tie the two tables together. The WHERE contains any additional filtering. (The Optimizer will run this query the same, regardless of where you put that clause.)
Oh. Can that filter be applied in the subquery? It will make the subquery run faster. (It may impact the optimal INDEX, so I await your answer.)
I have assumed that most rows have not "expired". If they have, then other techniques may be better.
For much better performance, look into building and maintaining summary tables of the info. Or, perhaps, rebuild (daily) a table with these stats. Then reference the summary table instead of the raw data.
If that does not work, consider building a temp table with the "4-5" info at the start of the web page, then feed off it the tmp table.
Rather than repetitively calculating - 14 days and current UNIX_TIMESTAMP(), follow advice of
https://code.tutsplus.com/tutorials/top-20-mysql-best-practices--net-7855
then prior to SELECT .....
code similar to:
$uts_14d = UNIX_TIMESTAMP() - (14 * 86400);
$uts = UNIX_TIMESTAMP();
and substitute the ($uts_14d and $uts) variables result in 5 lines of your code?
I have two big tables from which I mostly select but complex queries with 2 joins are extremely slow.
First table is GameHistory in which I store records for every finished game (I have 15 games in separate table).
Fields: id, date_end, game_id, ..
Second table is GameHistoryParticipants in which I store records for every player participated in certain game.
Fields: player_id, history_id, is_winner
Query to get top players today is very slow (20+ seconds).
Query:
SELECT p.nickname, count(ghp.player_id) as num_games_today
FROM `GameHistory` as gh
INNER JOIN GameHistoryParticipants as ghp ON gh.id=ghp.history_id
INNER JOIN Players as p ON p.id=ghp.player_id
WHERE TIMESTAMPDIFF(DAY, gh.date_end, NOW())=0 AND gh.game_id='scrabble'
GROUP BY ghp.player_id ORDER BY count(ghp.player_id) DESC LIMIT 10
First table has 1.5 million records and the second one 3.5 million.
What indexes should I put ? (I tried some and it was all slow)
You are only interested in today's records. However, you search the whole GameHistory table with TIMESTAMPDIFF to detect those records. Even if you have an index on that column, it cannot be used, due to the fact that you use a function on the field.
You should have an index on both fields game_id and date_end. Then ask for the date_end value directly:
WHERE gh.date_end >= DATE(NOW())
AND gh.date_end < DATE_ADD(DATE(NOW()), INTERVAL 1 DAY)
AND gh.game_id = 'scrabble'
It would even be better to have an index on date_end's date part rather then on the whole time carrying date_end. This is not possible in MySQL however. So consider adding another column trunc_date_end for the date part alone which you'd fill with a before-insert trigger. Then you'd have an index on trunc_date_end and game_id, which should help you find the desired records in no time.
WHERE gh.trunc_date_end = DATE(NOW())
AND gh.game_id = 'scrabble'
add 'EXPLAIN' command at the beginning of your query then run it in a database viewer(ex: sqlyog) and you will see the details about the query, look for the 'rows' column and you will see different integer values. Now, index the table columns indicated in the EXPLAIN command result that contain large rows.
-i think my explanation is kinda messy, you can ask for clarification
I would like to perform some request in mysql that i know will be really slow:
I have 3 tables:
Users:
id, username, email
Question:
id, date, question
Answer
id_question, id_user, response, score
And i would like to do some statistics like the top X users with the best score (sum of all the scores) for all time or for a given amount of time (last month for example). Or it could be users between the 100th and the 110th range
I will have thousands of users and hundred of questions so the requests could be very long since I'll need to order by sum of scores, limit to a given range and sometimes only select some questions depending on the date, ...
I would like to know if there are some methods to optimize the requests!
If you have a lot of data no other choices, Only you can optimize it with creating new table that will somehow summarizes data in every day/week or month. Maybe summarizes scores by each week for users and stamps by that weeks date, or by month. As the range of summing longer as much your query works fast.
For archived statistics, you can create tables that store rankings that won't move (last year, last month, last day). Try to calculated as much as possible statistics in such tables, put indexes on id_user, date, type_of_ranking...
Try to limit as much as possible subqueries.
I'm creating a site where all of the users have a score that is updated everyday. I can easily create rankings from this score, however I'd like to be able to create a "Hot" list of the week or month, etc..
My brute force design would be each day for every user, calculate their score and put it into the "Scores" table. So everyday the Scores table would increase by how many users there are. I could rank users by their score deltas over whatever time period.
While I believe this would technically work I feel like there has to be a more sophisticated way of doing this, right? Or not? I feel like a Scores table that increases everyday by how many users there are can't be the way other sites are doing it.
You get the most flexibility by not storing any snapshots of score at all. Instead, record incremental scores, as they happen.
If you have tables like this:
USER
user_id
name
personal_high_score
{anything else that you store once per user}
SCORE_LOG
score_log_id
user_id (FK to USER)
date_time
scored_points
Now you can get a cumulative score for a user as of any point in time with a simple query like:
select sum(scored_points)
from SCORE_LOG
where user_id = #UserID
and date_time <= #PointInTime
You can also easily get top ranking scorers for a time period with something like:
select
user_id
, sum(scored_points)
from SCORE_LOG
group by
user_id
where date_time >= #StartOfPeriod
and date_time <= #EndOfPeriod
order by
sum(scored_points) desc
limit 5
If you get to production and find that you're having performance issues in practice, then you could consider denormalizing a snapshot of whatever statistics make sense. The problem with these snapshot statistics is that they can get out of sync with your source data, so you'll need a strategy for recalculating the snapshots periodically.
It's pretty much a truism (consider it a corollary of Murphy's Law) that if you have two sources of truth you'll eventually end up with two "truths".
Barranka was on the right track with his comment, you need to make sure you are not duplicating any of the data wherever possible.
However, if you are looking to be able to revert back to some old users score or possibly be able to pick out a day and see who was top at a certain point i.e. dynamic reporting, then you will need to record each record separately next to a date. Having a separate table for this would be useful as you could deduce the daily score from the existing user data via SQL and just enter it in to the table whenever you want.
The decision you have is how many users record do you want to maintain in the history and how long. I have written the below with the idea that the "hot list" would be the top 5 users, you could have a CRON job or scheduled task running each day/month to run the inserts and also clean out very old data.
Users
id
username
score
score_ranking
id
user_id (we normalise by using the id rather than all the user info)
score_at_the_time
date_of_ranking
So to generate a single data ranking you could insert into this table. Something like:
INSERT INTO
`score_ranking` (`user_id`, `score_at_the_time`, `date_of_ranking`)
SELECT
`id`, `score`, CURDATE()
FROM
`users`
ORDER BY
`score` DESC
LIMIT
5
To read the data for a specific date (or date range) you could then do:
SELECT * FROM score_ranking
WHERE date_of_ranking = 'somedate'
ORDER BY score_at_the_time DESC