I use MySQL for my database and i do some processing on the database side to make it easier for my application.
The queries i do used to be very fast until recently my database has lots of data and the queries are very very very slow.
My application do mainly statistics and has lots of related database to fetch data.
Here is an example:
tbl_game
+-------------------------------------+
| id | winner | duration| endedAt |
|--------+--------+---------+---------|
| 1 | 1 | 1200 |timestamp|
| 2 | 0 | 1200 |timestamp|
| 3 | 1 | 1200 |timestamp|
| 4 | 1 | 1200 |timestamp|
+-------------------------------------+
winner is either 0 or 1 for the team who won the game
duration is the number of seconds a game took
tbl_game_player
+-------------------------------------------------+
| gameId | playerId | playerSlot | frags | deaths |
|--------+----------+------------+-------+--------|
| 1 | 100 | 1 | 24 | 50 |
| 1 | 150 | 2 | 32 | 52 |
| 1 | 101 | 3 | 26 | 62 |
| 1 | 109 | 4 | 48 | 13 |
| 1 | 123 | 5 | 24 | 52 |
| 1 | 135 | 6 | 30 | 30 |
| 1 | 166 | 7 | 28 | 48 |
| 1 | 178 | 8 | 52 | 96 |
| 1 | 190 | 9 | 12 | 75 |
| 1 | 106 | 10 | 68 | 25 |
+-------------------------------------------------+
The details are only for the first game with id 1
1 game has 10 player slots where slot 1-5 = team 0 and 6-10 = team 1
There are more details in my real table this is just to give an overview.
So i need to calculate the statistics of each player in all the games. I created a view to accomplish this and it works fine when i have little data.
Here is an example:
+--------------------------------------------------------------------------+
| gameId | playerId | frags | deaths | actions | team | percent | isWinner |
|--------+----------+-------+--------+---------+------+---------+----------|
actions = frags + deaths
percent = (actions / sum(actions of players in the same team)) * 100
team is calculated using playerSlot in 1,2,3,4,5 or 6,7,8,9,10
isWinner is calculated by the team and winner
This is just 1 algorithm and i have many others to perform. My database is 1 milion + records and the queries are very slow.
here is the query for the above:
SELECT
tgp.gameId,
tgp.playerId,
tgp.frags,
tgp.deaths,
tgp.frags + tgp.deaths AS actions,
IF(playerSlot in (1,2,3,4,5), 0, 1) AS team,
((SELECT actions) / tgpx.totalActions) * 100 AS percent,
IF((SELECT team) = tg.winner, 1, 0) AS isWinner
FROM tbl_game_player tgp
INNER JOIN tbl_game tg on tgp.gameId = tg.id
INNER JOIN (
SELECT
gameId,
SUM(frags) AS totalFrags,
SUM(deaths) AS totalDeaths,
SUM(frags) + SUM(deaths) as totalActions,
IF(playerSlot in (1,2,3,4,5), 0, 1) as team
FROM tbl_game_player
GROUP BY gameId, team
) tgpx on tgp.gameId = tgpx.gameId and team = tgpx.team
It's quite obvious that indexes don't help you here¹, because you want all data from the two tables. You even want the data from tbl_game_player twice, once aggregated, once not aggregated. So there are millions of records to read and join. Your query is fine, and I see no way to improve it really.
¹ Of course you should always have indexes on primary and foreign keys, so the DBMS can make use of them in joins. (E.g. there should be an index on tbl_game(tgp.gameId)).
So your options lie outside the query:
Hardware (obviously).
Add a computed column for the team to tbl_game_player, so at least you save its evaluation when querying.
Partitions. One partition per team, so the aggregates can be calcualted separately.
Pre-computed data: Add a table tbl_game_team holding the sums; fill it with triggers. Thus you don't have to compute the aggregates in your query.
Data warehouse table: Make a table holding the complete result. Fill it with triggers or at intervals.
Setting up indexes would speed up your queries. Queries can take a while to run if there is a lot of results, this is definitely a start though.
for large databases Mysql INDEX can be very helpful in speed problems, An index can be created in a table to find data more quickly & efficiently. so must create index , you can learn more about MYsql index here http://www.w3schools.com/sql/sql_create_index.asp
Related
I have a table where I store a player ID and how many points they have, at the moment the only index is the player ID.
Like this:
+----------+--------+
| playerID | points |
+----------+--------+
| 1 | 14 |
| 2 | 18 |
| 3 | 0 |
| 4 | 55 |
+----------+--------+
At the moment I have ~100k players
What I want is given a player ID, what rank they are in terms of points I've got this query so far, however execution times are high > 0.5 seconds depending on how few points the player has.
SELECT playerID, count(playerID)
FROM playerRanks
WHERE points >=
(SELECT points
FROM playerRanks
WHERE playerID = '3')
Which will return this
+----------+------+
| playerID | rank |
+----------+------+
| 3 | 2 |
+----------+------+
I've tried adding an index on the points, but whilst that helps in the explain, it doesn't help execution times. Is there a better way of optimizing this, or is indexing Points the best? Alternatively, should I change the query?
I have read all the arguments: Tell SQL what you want, not how to get it. Use set-based approaches instead of procedural logic. Avoid cursors and loops at all costs.
Unfortunately, I have been racking my brain for weeks and I can't figure out how to come up with a set-based approach to generating an iterative COUNT for sequential subsets of chronologically ordered data.
Here is the specific application of the problem I am working on.
I do football-related research using a database that comprises many years of play-by-play data, which is of course arranged chronologically by year, game, and play. The database is loaded onto a web server running MySQL 5.0.
The fields I need for this particular problem are contained in the core table. Here is some sample data from the relevant part of the table:
GID | PID | OFF | DEF | QTR | MIN | SEC | PTSO | PTSD
--------------------------------------------------------
121 | 2455 | ARI | CHI | 2 | 4 | 30 | 17 | 10
121 | 2456 | ARI | CHI | 2 | 4 | 15 | 17 | 10
121 | 2457 | ARI | CHI | 2 | 3 | 53 | 17 | 10
121 | 2458 | ARI | CHI | 2 | 3 | 31 | 20 | 10
The columns represent, respectively: unique game identifier, unique play identifier, which team is on offense for that play, which team is on defense for that play, the quarter and time the play occurred, and the offense's and defense's scores going into the play. In other words, in (hypothetical) game 121, the Arizona Cardinals scored a field goal on play 2457 (i.e., going into play 2458).
What I want to do is go through several years' worth of data game by game, second by second, and count the number of times any possible score differential occurred at any given elapsed time. The following query arranges the data by seconds elapsed and score differential:
SELECT core.GID, core.PID, core.QTR, core.MIN, core.SEC, core.PTSO, core.PTSD,
((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS secEl,
core.PTSO - core.PTSD AS oDif, (core.PTSO - core.PTSD) * -1 AS dDif
FROM core
ORDER BY secEl ASC, oDif ASC;
The result looks something like this:
GID | PID | OFF | DEF | QTR | MIN | SEC | PTSO | PTSD | secEl | oDif | dDif
---------------------------------------------------------------------------------
616 | 100022 | CHI | MIN | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
617 | 100169 | HOU | DAL | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
618 | 100224 | PHI | SEA | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
619 | 100303 | JAX | NYJ | 1 | 15 | 00 | 0 | 0 | 0 | 0 | 0
Although that looks pretty, my goal is not to sort the data chronologically. Rather, I want to step sequentially through every one of the 4,500 possible seconds (four 15-minute quarters plus one 15-minute overtime period) in an NFL game and count the number of times every score differential has ever occurred in each one of those seconds.
In other words, I don't want to count just the number of times a team has been up by, say, 21 points at 1,800 seconds elapsed (i.e., the start of the second quarter) between 2002 and 2013. I want to count the number of times a team has been up by 21 points at any point in a game. On top of that, I want to do this for every score differential that has ever occurred (i.e., -50, -49, -48, ..., 0, 1, 2, ... 48, 49, 50, ...) for every second of every game.
This would be relatively easy to accomplish with a series of nested loops, but it wouldn't be the most reusable of code.
What I want to do is construct set logic that will COUNT the instances of each score differential that has occurred at every second of time elapsed without using loops or cursors. The results would be tabulated as follows:
secondsElapsed | scoreDif | Occurrences
-----------------------------------------
10 | -1 | 12
10 | 0 | 125517
10 | 1 | 0
10 | 2 | 3
Here is a sample query for getting the total number of instances of a specific score differential (+21) at a specific time point (3,000 seconds elapsed):
SELECT ((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS timeElapsed,
(core.PTSO - core.PTSD) AS diff, COUNT(core.PTSO - core.PTSD) AS occurrences
FROM core
WHERE ((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) = 3000
AND ABS(core.PTSO - core.PTSD) = 21
That query returns the following results:
timeElapsed | diff | occurrences
----------------------------------
3000 | 21 | 5
Now I want generalize this query to count the instances of every differential at every second elapsed.
Your description is rather confusing but if you want to "COUNT all of the possible score differentials for every possible second without using loops or cursors" then I would do something like:
1) Build a work table (either a temporary table# or a Table data type#) and fill it with the time increments you want e.g.
QTR | MIN | SEC |
1 | 00 | 01
1 | 00 | 02
..
1 | 01 | 59
1 | 02 | 00
1 | 02 | 01
1 | 02 | 02
..
4 | 15 | 59
2) You then use this as the basis of your query. Cross Join a list of the games you are interested in with the work table to give you a table of every game and every minute in that game.
3) With the result of (2) left join your query above back into it?
With this result set you can then look at a whole game and sum\count as neccessary without having to loop.
Not sure if this will cure your problem, but you could try using row_number over a partition...
SELECT ROW_NUMBER() OVER (PARTITION BY <column> ORDER BY <column>) AS aColumn, aColumn FROM aTable
I did it using a sub-query and two variables to define the time point and another to define the point difference.
The query then returns the Diff, then the amount of times the offensive side had it, followed by the defensive side and total times.
SET #Diff INT = 7;
SET #Seconds INT = 1530;
SELECT ABS(core.PTSO - core.PTSD) AS diff, SUM(CASE WHEN core.PTSO - core.PTSD <= 0 THEN 1 ELSE 0 END) OffensiveTimes, SUM(CASE WHEN core.PTSO - core.PTSD >= 0 THEN 1 ELSE 0 END) DefensiveTimes, SUM(1) TotalTimes
FROM (SELECT core.GID, core.PID, core.QTR, core.MIN, core.SEC, core.PTSO, core.PTSD,
((core.QTR - 1) * 900 + (900-(core.MIN * 60 + core.SEC))) AS secEl,
core.PTSO - core.PTSD AS oDif, (core.PTSO - core.PTSD) * -1 AS dDif
FROM core
) core
WHERE secEl = #Seconds AND ABS(core.PTSO - core.PTSD) = #Diff
GROUP BY ABS(core.PTSO - core.PTSD);
This returns this for the small dataset you gave
7 diff, 0 OffensiveTimes, 1 DefensiveTimes, 1 Times
Hope that was what you were looking for :)
I have table named questions like follows
+----+---------------------------------------------------------+----------+
| id | title | category |
+----+---------------------------------------------------------+----------+
| 89 | Tinker or work with your hands? | 2 |
| 54 | Sketch, draw, paint? | 3 |
| 53 | Express yourself clearly? | 4 |
| 77 | Keep accurate records? | 6 |
| 32 | Efficient? | 6 |
| 52 | Make original crafts, dinners, school or work projects? | 3 |
| 70 | Be elected to office or make your opinions heard? | 5 |
| 78 | Take photographs? | 3 |
| 84 | Start your own political campaign? | 5 |
| 9 | Free spirit or a rebel? | 3 |
| 38 | Lead a group? | 5 |
| 71 | Work in groups? | 4 |
| 2 | Helpful? | 4 |
| 4 | Mechanical? | 6 |
| 14 | Responsible? | 6 |
| 66 | Pitch a tent, an idea? | 1 |
| 62 | Write useful business letters? | 5 |
| 28 | Creative? | 3 |
| 68 | Perform experiments? | 2 |
| 10 | Like to figure things out? | 2 |
+----+---------------------------------------------------------+----------+
I have a sql query to get one random record from each category.Can any one convert the mysql query to rails activerecord query(with out using Question.find_by_sql).This mysql query is working absolutely fine but I need only active record query because of my dependency in further steps.
Here is mysql query
SELECT t.id, title as question, category
FROM
(
SELECT
(
SELECT id
FROM questions
WHERE category = t.category
ORDER BY RAND()
LIMIT 1
) id
FROM questions t
GROUP BY category
) q JOIN questions t
ON q.id = t.id
Thank You for your consideration!
When things get crazy one have to reach out for Arel:
It is intended to be a framework framework; that is, you can build
your own ORM with it, focusing on innovative object and collection
modeling as opposed to database compatibility and query generation.
So what we want to do is to let Arel create the query for us. Moreover the approach here is gonna be used: the questions table is left joined with randomized version of itself:
q_normal = Arel::Table.new("questions")
q_random = Arel::Table.new("questions").project(Arel.sql("*")).order("RAND()").as("q2")
Time to left join
query = q_normal.join(q_random, Arel::Nodes::OuterJoin).on(q_normal[:category].eq(q_random[:category])).group(q_normal[:category]).order(q_random[:category])
Now you can use which columns you want using project, e.g.:
query.project(q_normal[:id])
The only way I can think of to do this requires a good bit of application code. I don't think there's a way of accessing the RAND() functionality in MySQL (or equivalent in other DB technologies) using ActiveRecord. Here's what I came up with:
counts = Question.group(:category_id).count(:id)
offsets = {}
counts.each do |cat_id, count|
offsets[cat_id] = rand(count)
end
random_questions = []
offsets.each do |cat_id, offset|
random_questions.push(Question.where(:category_id => cat_id).offset(offset).first)
end
I have a data table that I use to do some calculations. The resulting data set after calculations looks like:
+------------+-----------+------+----------+
| id_process | id_region | type | result |
+------------+-----------+------+----------+
| 1 | 4 | 1 | 65.2174 |
| 1 | 5 | 1 | 78.7419 |
| 1 | 6 | 1 | 95.2308 |
| 1 | 4 | 1 | 25.0000 |
| 1 | 7 | 1 | 100.0000 |
+------------+-----------+------+----------+
By other hand I have other table that contains a set of ranges that are used to classify the calculations results. The range tables looks like:
+----------+--------------+---------+
| id_level | start | end | status |
+----------+--------------+---------+
| 1 | 0 | 75 | Danger |
| 2 | 76 | 90 | Alert |
| 3 | 91 | 100 | Good |
+----------+--------------+---------+
I need to do a query that add the corresponding 'status' column to each value when do calculations. Currently, I can do that adding the following field to calculation query:
select
...,
...,
[math formula] as result,
(select status
from ranges r
where result between r.start and r.end) status
from ...
where ...
It works ok. But when I have a lot of rows (more than 200K), calculation query become slow.
My question is: there is some way to find that 'status' value without do that subquery?
Some one have worked on something similar before?
Thanks
Yes, you are looking for a subquery and join:
select s.*, r.status
from (select s.*
from <your query here>
) s left outer join
ranges r
on s.result between r.start and r.end
Explicit joins often optimize better than nested select. In this case, though, the ranges table seems pretty small, so this may not be the performance issue.
Suppose I have the following database setup (a simplified version from what I actually have):
Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id | name | is_active | released_date | token |
| 1 | posting_1 | 1 | 2013-01-10 | 123 |
| 2 | posting_2 | 1 | 2013-01-11 | 124 |
| 3 | posting_3 | 0 | 2013-01-12 | 125 |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)
Table: news_category (500 entries)
| ------------------------------|
| category_id | name |
| 1 | category_1 |
| 2 | category_2 |
| 3 | category_3 |
| ------------------------------|
PRIMARY category_id
Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id | posting_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 2 | 2 |
| 3 | 2 |
| 1 | 3 |
| 2 | 3 |
| ------------------------------|
UNIQUE idx (category_id, posting_id)
My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.
Is there any way to optimize the table setup and/or the query?
I was able to get the above query to run even faster than with a single-value category list version by rewriting it as follows:
SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
SELECT ncm.posting_id
FROM news_cat_match ncm
WHERE ncm.posting_id = np.posting_id
AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
LIMIT 1
)
ORDER BY np.token DESC LIMIT 50
This now takes under a second on my data set.
The sad part is that this is even faster than if there is just one category_id specified. That's because the subset of news items is bigger than with just one category_id, so it finds the results more quickly.
Now my next question is whether this can be optimized for cases when a category has only few news that are spread in time?
The following is still pretty slow on my development machine. Although it's fast enough on the production server, I would like to optimize this if possible.
SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50
Does anyone have any further suggestions?