Optimize sql request for statistics - mysql

I would like to perform some request in mysql that i know will be really slow:
I have 3 tables:
Users:
id, username, email
Question:
id, date, question
Answer
id_question, id_user, response, score
And i would like to do some statistics like the top X users with the best score (sum of all the scores) for all time or for a given amount of time (last month for example). Or it could be users between the 100th and the 110th range
I will have thousands of users and hundred of questions so the requests could be very long since I'll need to order by sum of scores, limit to a given range and sometimes only select some questions depending on the date, ...
I would like to know if there are some methods to optimize the requests!

If you have a lot of data no other choices, Only you can optimize it with creating new table that will somehow summarizes data in every day/week or month. Maybe summarizes scores by each week for users and stamps by that weeks date, or by month. As the range of summing longer as much your query works fast.

For archived statistics, you can create tables that store rankings that won't move (last year, last month, last day). Try to calculated as much as possible statistics in such tables, put indexes on id_user, date, type_of_ranking...
Try to limit as much as possible subqueries.

Related

How do a bulk addition in sql relation (based off of query results in another relation)?

How would I do this question in sql (taken from glassdoor):
You have a table where you have date, user_id, song_id and count. It shows at the end of each day how many times in her history a user has listened to a given song. So count is cumulative sum.
You have to update this on a daily basis based on a second table that records in real time when a user listens to a given song. Basically, at the end of each day, you go to this second table and pull a count of each user/song combination and then add this count to the first table that has the lifetime count.
I particularly do not know how to update a table in such a bulk/massive/looping type of way and would appreciate the mysql code to achieve something like that. I haven't written the code because I do not know how to do such a large scale addition in an efficient manner.
It sounds like you don't add anything to the existing counts, you insert a new record for that day with the total play count as on that day. Old records in the history table are not updated
At the end of each day, you run this:
INSERT INTO playhistory
SELECT CURDATE(), user_id, song_id, count(*)
FROM individualplays
GROUP BY user_id, song_id
Individualplays table holds the user and song ids for all time. If a new user plays the same song 10 times today, the count(*) will be 10. Tomorrow if she plays that song another 5 times, the count will now be 15
If you cannot guarantee to run the query right at the end of the day, your individualplays table needs the date and time that a song was played, then at any time the day after, you can update your history table thus:
INSERT INTO playhistory
SELECT DATE_SUB(CURDATE(), INTERVAL 1 DAY), user_id, song_id, count(*)
FROM individualplays
WHERE playdate < CURDATE()
GROUP BY user_id, song_id
its a shame you're using MySQL actually, because more powerful rdbms can do the history entirely out of the individualplays table dynamically through use of analytic/window functions; devices that can do things like counting all the rows from the start of time, to the current row, per user/song. You can simulate these in MySQL but it's pretty nasty - it basically would involve joining the individualplays table to itself on userid=userid,songid=songid and playdate

Mysql alternative to index by for efficient database data grouping on non-unique field

My mysql database table has multiple entries with the following structure:
id, title, date, time
There are presently 30 entries in the table and some of those share a common date.
What I'm trying to accomplish is retrieving the database data in such a way that will group them under common dates. So, all entries that share the same date will be grouped in an array indexed by that common date.
In another post, I learnt INDEX BY is great for what I'm trying to achieve but it works only/best on unique fields.
So, I am just curious if there is anything else that could help efficiently group my database entrie.
SELECT date, GROUP_CONCAT(title)
FROM tbl
GROUP BY date
ORDER BY date;
Don't worry about performance until you have thousands of rows.

Count record views for a time period

I have a DB with a lot of records (of articles) and currently I keep track of how many times each record has been viewed by counting the views so I can sort on somehting like "see the top 5 most viewed articles"
This is done with a column of integers, and whenever the record is retrieved, the integer count increases by 1.
This works fine but since the counting system is very simple, I can only see views of "all time".
I would like to have something like "see the top 5 most viewed articles this week".
The only way I can think of is to have a separate table which makes a record with the article Id and Date whenever an article is viewed, and then make a SELECT statement for a limited time period.
This could easily work, but at the same time the table would be very large in no time.
Is there any better way of acomplishing the same thing? I've seen the sorting criteria on many websites, but I dont know how this is achieved.
Any thoughts or comments?
Thanks in advance :)
Instead of a row for each view of each article, you could have a row per day. When an article is viewed, you would do:
INSERT INTO article_views (article_id, date, views)
VALUES (#article, CURRENT_DATE(), 1)
ON DUPLICATE KEY UPDATE views = views + 1;
Then to get the top 5 articles viewed in the past week:
SELECT article_id, SUM(views) total_views
FROM article_views
WHERE date > NOW() - INTERVAL 7 day
GROUP BY article_id
ORDER BY total_views DESC
LIMIT 5
To keep the table from growing too large, you can delete old records periodically.

Optimizing MySQL Data Retrieval for Time Series Application

I'm working on a Web app to display some analytics data from a MYSQL database table. I expect to collect data from about 10,000 total users at the most. This table is going to have millions of records per user.
I'm considering giving each user their own table, but more importantly I want to figure out how to optimize data retrieval.
I get data from the database table using a series of SELECT COUNT queries for a particular day. An example is below:
SELECT * FROM
(SELECT COUNT(id) AS data_point_1 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '1') AS col_1
CROSS JOIN
(SELECT COUNT(id) AS data_point_2 FROM my_table WHERE customer_id = '1' AND datetime_added LIKE '2013-01-20%' AND status_id = '0') AS col_2
CROSS JOIN ...
When I want to retrieve data from the last 30 days, the query will be 30 times as long as it is above; 60 days likewise, etc. The user will have the ability to select the number of days e.g. 30, 60, 90, and a custom range.
I need the data for a time series chart. Just to be clear, data for each day could range from thousands of records to millions.
My question is:
Is this the most performant way of retrieving this data, or is there a better way to getting all the time series data I need in one SQL query?! How is this going to work when a user needs data from the last 2 years i.e. a MySQL Query that is potential over a thousand lines long?!
Should I consider caching the retrieved data (using memcache for example) for extended periods of time e.g. an hour or more, to reduce server (Being that this is analytics data, it really should be real-time but I'm afraid of overloading the server with queries for the same data even when there are no changes)?!
Any assitance would be appreciated.
First, you should not put each user in a separate table. You have other options that are not nearly as intrusive on your application.
You should consider partitioning the data. Based on what you say, I would have one partition by time (by day, week, or month) and an index on the users. Your query should probably look more like:
select date(datetime), count(*)
from t
where userid = 1 and datetime between DATE1 and DATE2
group by date(datetime)
You can then pivot this, either in an outer query or in an application.
I would also suggest that you summarize the data on a daily basis, so your analyses can run on the summarized tables. This will make things go much faster.

Mysql database design for storing user scores over time

I'm creating a site where all of the users have a score that is updated everyday. I can easily create rankings from this score, however I'd like to be able to create a "Hot" list of the week or month, etc..
My brute force design would be each day for every user, calculate their score and put it into the "Scores" table. So everyday the Scores table would increase by how many users there are. I could rank users by their score deltas over whatever time period.
While I believe this would technically work I feel like there has to be a more sophisticated way of doing this, right? Or not? I feel like a Scores table that increases everyday by how many users there are can't be the way other sites are doing it.
You get the most flexibility by not storing any snapshots of score at all. Instead, record incremental scores, as they happen.
If you have tables like this:
USER
user_id
name
personal_high_score
{anything else that you store once per user}
SCORE_LOG
score_log_id
user_id (FK to USER)
date_time
scored_points
Now you can get a cumulative score for a user as of any point in time with a simple query like:
select sum(scored_points)
from SCORE_LOG
where user_id = #UserID
and date_time <= #PointInTime
You can also easily get top ranking scorers for a time period with something like:
select
user_id
, sum(scored_points)
from SCORE_LOG
group by
user_id
where date_time >= #StartOfPeriod
and date_time <= #EndOfPeriod
order by
sum(scored_points) desc
limit 5
If you get to production and find that you're having performance issues in practice, then you could consider denormalizing a snapshot of whatever statistics make sense. The problem with these snapshot statistics is that they can get out of sync with your source data, so you'll need a strategy for recalculating the snapshots periodically.
It's pretty much a truism (consider it a corollary of Murphy's Law) that if you have two sources of truth you'll eventually end up with two "truths".
Barranka was on the right track with his comment, you need to make sure you are not duplicating any of the data wherever possible.
However, if you are looking to be able to revert back to some old users score or possibly be able to pick out a day and see who was top at a certain point i.e. dynamic reporting, then you will need to record each record separately next to a date. Having a separate table for this would be useful as you could deduce the daily score from the existing user data via SQL and just enter it in to the table whenever you want.
The decision you have is how many users record do you want to maintain in the history and how long. I have written the below with the idea that the "hot list" would be the top 5 users, you could have a CRON job or scheduled task running each day/month to run the inserts and also clean out very old data.
Users
id
username
score
score_ranking
id
user_id (we normalise by using the id rather than all the user info)
score_at_the_time
date_of_ranking
So to generate a single data ranking you could insert into this table. Something like:
INSERT INTO
`score_ranking` (`user_id`, `score_at_the_time`, `date_of_ranking`)
SELECT
`id`, `score`, CURDATE()
FROM
`users`
ORDER BY
`score` DESC
LIMIT
5
To read the data for a specific date (or date range) you could then do:
SELECT * FROM score_ranking
WHERE date_of_ranking = 'somedate'
ORDER BY score_at_the_time DESC