I'm trying to write a query that will find the most consecutive of something in my database. This has led to me trying out variables, which I've never really used before.
The problem I have is my query is giving me exactly the result I think it should, but when I use it as a subquery inside another query, it all seems to go to pot when I add the group by/order by clauses.
Is this normal, and if so what tends to be the solution? Or have I made a simple mistake?
The results of my subquery are perfect, and all I'm trying to do in the outer query is select the maximum of the "consecutive" column that I've created. This column takes the form of
#r := IF(nFound=nThis,#r + 1,0)
I.e. it simply counts up 1 for each row that fits my where/order arrangement, and resets to 0 if a match isn't found.
I was hoping that the subquery results would be "set" and simply used as the values before being used in the main query.
I liken this to excel; sometimes you want to "paste as values" rather than copying all of the formulas across, if you get what I mean. Is there a simple way to do so in MySQL?
I wondered if creating a view might "solidify" the data set, but then found out variables aren't allowed in views!
EDIT
OK, here's the query. It's not pretty, but I've been hacking around and trying lots of things. If you remove the last 2 lines and the "MAX" function it works fine, with them it only returns a single row, rather than 10 rows.
I've never used a cross join before today either; virtually everything I do normally seems to be just "JOIN" or "LEFT JOIN"s, but today it seemed necessary.
Basically the idea is to retrieve the maximum number of chronologically consecutive events that each person has been present at. Feel free to amend as you see fit!
The "P.person < 10" was just a test. There are in fact thousands of people, but if I tried to do it on everyone at once it was sitting and doing nothing for ages - the crossjoin getting too big, I assume?
SET #r=0;
SELECT person,MAX(nConsec) FROM (
SELECT #r := IF(nFound=person,#r + 1,0) AS nConsec
test.*
FROM (SELECT P.person, event, tDate, MAX(C.person) AS nFound
FROM PEOPLE P
CROSS JOIN EVENTS E
LEFT JOIN COMPETITORS C ON C.event=E.event AND C.person = P.person
WHERE P.person < 10
AND tDate < NOW()
GROUP BY P.person, event, tDate
ORDER BY P.person ASC, tDate ASC
) test
) test2
GROUP BY person
ORDER BY MAX(nConsec) DESC
EDIT 2
OK I've no idea what, but while changing some things to preserve a bit of anonymity, I seem to have inadvertently fixed my own code... A pleasant surprise, but annoying that no amount of ctrl-Z and ctrl-shift-Zing seems to be showing me what I was doing wrong in the first place!
Any opinion/advice on the mess I've got still appreciated. I'm sure I can do something cleverer that doesn't use a cross join. There's abotu 30,000 rows in "people" and 1000 in "events", and about 500 competitors per event, so I can see why a cross join gives me issues (15 billion rows I make that...). Query takes 0.6 seconds for those 10 IDs that I picked out, 34 seconds if I raise it to 1000 IDs.
What does this do for you:
SELECT person, MAX(nConsec) AS numConsecutive FROM (
SELECT person, COUNT(*) AS nConsec FROM (
SELECT #r := #r + (COALESCE(#person, P.person) <> P.person) as consecutive, #person := P.person AS person FROM (
SELECT #r := 0, #person := NULL
) vars
JOIN PEOPLE P
JOIN EVENTS E
LEFT JOIN COMPETITORS C
ON C.person = P.person
AND C.event = E.event
ORDER BY tDate
)
GROUP BY consecutive
)
Modified from code found at http://www.dancewithgrenades.com/blog/mysql-consecutive-row-streaks.
Note that if you're counting across multiple people, you need to keep track of the person you're counting for (#person variable). I think this should run quicker though, mostly due to the lack of GROUPing in the innermost subquery which was probably having a large impact on performance. If performance still isn't good enough, then I'd suggest creating a column in PEOPLE to hold this consecutive attendance value, modify the query to work on only one person at a time, and run the query for different sets of users at different times to update the value in PEOPLE.
Oh and as far as CROSS JOINs — in MySQL, CROSS JOIN is equivalent to INNER JOIN is equivalent to JOIN. You've used cross joins before, you just didn't realize it. ;)
Related
If I use GROUP BY then I will get just 1 row per group. For example
Sessions table: SessionId (other things)
Actions table: ActionId, SessionId, (other things)
With:
SELECT S.*, A.* FROM ActionList A JOIN SessionList S ON A.SessionId
=S.SessionId
WHERE 1 /*various criteria to filter*/
ORDER BY S.SessionId DESC, ActionId DESC;
Thus showing me the most recent session at the top. Now I want to look at only sessions with 2 or more actions.
If I use GROUP BY A.SessionId then I can get COUNT(ActionId) and use HAVING to look at rows only with the required count, but I wont get both rows (or more) rows, just the one.
I suspect I can do this by JOINing a table with SessionIds and the count of action IDs but I'm fairly new to joins (I could do this via a subquery any ANY).
If a view would help, I would create a view of the form:
SELECT SessionId, COUNT(*) FROM Actions GROUP BY SessionId;
Or put this in brackets and JOIN on it (but I confess I'd have to loop up 3 table joins)
What is the neatest way to do this?
Also is this where "Foreign keys" come into play? That'd probably stop the "ambiguity errors" I get if I don't qualify SessionId. I've avoided them for fear of TRIGGERs, I also didn't know about JOINs and just used subqueries until recently. I've realised it is stupid to avoid things that were added to help.
Additionally I'm quite timid with joins because I know what it does, well worst case. If I JOIN on a table with m rows, and another with n I end up with m*n rows. That could be VERY large! I'm dealing with large tables (as in: schema wont fit in RAM large) so that is quite scary. I do know MySQL optimises well (able to move stuff from HAVING to WHERE and so forth) but still!
If you want to look at sessions with two or more actions, then use a join:
select sl.*
from SessionList sl join
(select SessionId, count(*) as cnt
from Actions
group by SessionId
) a
on sl.SessionId = a.SessionId and cnt > 1;
I've been trying to structure a massive query, and I have succeeded and been able to actually finish the query. However I went from my dev environment (small database) to testing on the live environment (big database), and I've ran into performance problems.
I think the answer can be found here: https://dba.stackexchange.com/a/16376
But is there really no other way around? The reason I am even putting the subqueries in a VIEW is because they have more complex constructs.
Example of the VIEWS / queries:
pjl view:
(SELECT `pj`.`id` AS `id`,`pj`.`globalId` AS `globalId`,`pj`.`date` AS `date`,`pj`.`serverId` AS `serverId`,`pj`.`playerId` AS `playerId`,'playerjoins' AS `origin`
FROM `playerjoins` `pj`)
UNION ALL
(SELECT `pl`.`id` AS `id`,`pl`.`globalId` AS `globalId`,`pl`.`date` AS `date`,`pl`.`serverId` AS `serverId`,`pl`.`playerId` AS `playerId`,'playerleaves' AS `origin`
FROM `playerleaves` `pl`)
ll_below view:
SELECT `ll`.`id` AS `id`,`ll`.`globalId` AS `globalId`,`ll`.`date` AS `date`,`ll`.`serverId` AS `serverId`,`ll`.`gamemodeId` AS `gamemodeId`,`ll`.`mapId` AS `mapId`,`pjl`.`origin` AS `origin`,`pjl`.`date` AS `pjldate`,`pjl`.`playerId` AS `playerId`
FROM `pjl`
JOIN `levelsloaded` `ll`
ON `pjl`.`date` <= `ll`.`date`
the, now simple, query:
SELECT * FROM
(
(SELECT * FROM ll_below WHERE playerId = 976) llbelow
INNER JOIN
(SELECT id, MAX(pjldate) AS maxdate FROM ll_below WHERE playerId = 976 GROUP BY id) llbelow_inner
ON llbelow.id = llbelow_inner.id AND llbelow.pjldate = llbelow_inner.maxdate
)
WHERE origin = 'playerjoins'
ORDER BY date DESC
I could put everything in one big query, but in my eyes it gets a big mess then.
I also know why the performance is being hit so hard, because MySQL cannot use the MERGE algorithm for the pjl view as there is an UNION ALL in it. If I put the WHERE playerId = 976 clauses in the correct places, then the performance hit is gone, but I'd also have a query consisting of 50 lines or something.
Can someone please suggest me what to do if I want performance ánd a query that is still concise?
This clause:
WHERE origin = 'playerjoins'
Means that you didn't need to do a UNION at all, since you're not using any of the rows from pl by the end of the query.
You're right that the view is likely forcing a temporary table instead of using the merge algorithm.
UNION ALL also creates its own temporary table. This case is optimized in MySQL 5.7.3 (still pre-alpha as of this writing), according to Bug #50674 Do not create temporary tables for UNION ALL.
Also, the GROUP BY is probably creating a third level of temporary table.
I see you're also doing a greatest-n-per-group operation, to match the rows with the max date per id. There are different solutions for this type of operation, which don't use a subquery. See my answers for example:
Retrieving the last record in each group
Fetch the row which has the Max value for a column
Depending on the number of rows and other conditions, I've seen both solutions for greatest-n-per-group queries give better performance. So you should test both solutions and see which is better given the state and size of your data.
I think you should unravel the views and unions and subqueries. See if you can apply the various WHERE conditions (like playerId=976) directly against the base tables before doing joins and aggregates. That should greatly reduce the number of examined rows, and avoid the multiple layers of temp tables caused by the view and union and group by.
Re your comment:
The query you seem to want is the most recent join per level for one specific player.
Something like this:
SELECT ll.id,
ll.globalId,
ll.date AS leveldate,
ll.serverId,
ll.gamemodeId,
ll.mapId,
pj.date AS joindate,
pj.playerId
FROM levelsloaded AS ll
INNER JOIN playerjoins AS pj
ON pj.date <= ll.date
LEFT OUTER JOIN playerjoins AS pj2
ON pj.playerId = pj2.playerId AND pj2.date <= ll.date AND pj.date < pj2.date
WHERE pj.playerId = 976
AND pj2.playerID IS NULL
ORDER BY joindate DESC
(I have not tested this query, but it should get you started.)
Bill is absolutely correct... your views don't even really provide any benefit. I've tried to build something for you, but my interpretation may not be exactly correct. Start by asking yourself IN SIMPLE WORDS what am I trying to get. Here is what I came up with.
I'm looking for a single player (hence your player ID = 976). I'm also only considering the PLAYERJOINS instance (not the player leaving which knocks out the union part). For this player, I want the most recent date they joined a game. From that date as the baseline, I want all Levels Loaded that were created at or after the maximum date joined.
So, the first query is nothing but the maximum date for player 976 from the playerJoined table. Who cares about anything else, or any other user. The ID Here is the same as it would be in the LevelsLoaded table via the join, so getting that player ID and the same levelsLoaded ID for the same person is, IMO, pointless. Then, get the rest of the details from the Levels Loaded on/after the max date for the same person, order by whatever..
If my interpretation of your query is incorrect, offer obvious clarification for adjustments.
SELECT
ll.id,
ll.globalId,
ll.`date`,
ll.serverId,
ll.gamemodeId,
ll.mapId,
'playerjoins' as origin,
playerMax.MaxDate AS pjldate
FROM
( SELECT MAX( pj.`date` ) as MaxDate
FROM playerjoins pj
where pj.id = 976 ) playerMax
JOIN levelsloaded ll
ON ll.id = 976
AND playerMax.MaxDate <= ll.`date`
I have an SQL query(see below) that returns exactly what I need but when ran through phpMyAdmin takes anywhere from 0.0009 seconds to 0.1149 seconds and occasionally all the way up to 7.4983 seconds.
Query:
SELECT
e.id,
e.title,
e.special_flag,
CASE WHEN a.date >= '2013-03-29' THEN a.date ELSE '9999-99-99' END as date
CASE WHEN a.date >= '2013-03-29' THEN a.time ELSE '99-99-99' END as time,
cat.lastname,
FROM e_table as e
LEFT JOIN a_table as a ON (a.e_id=e.id)
LEFT JOIN c_table as c ON (e.c_id=c.id)
LEFT JOIN cat_table as cat ON (cat.id=e.cat_id)
LEFT JOIN m_table as m ON (cat.name=m.name AND cat.lastname=m.lastname)
JOIN (
SELECT DISTINCT innere.id
FROM e_table as innere
LEFT JOIN a_table as innera ON (innera.e_id=innere.id AND
innera.date >= '2013-03-29')
LEFT JOIN c_table as innerc ON (innere.c_id=innerc.id)
WHERE (
(
innera.date >= '2013-03-29' AND
innera.flag_two=1
) OR
innere.special_flag=1
) AND
innere.flag_three=1 AND
innere.flag_four=1
ORDER BY COALESCE(innera.date, '9999-99-99') ASC,
innera.time ASC,
innere.id DESC LIMIT 0, 10
) AS elist ON (e.id=elist.id)
WHERE (a.flag_two=1 OR e.special_flag) AND e.flag_three=1 AND e.flag_four=1
ORDER BY a.date ASC, a.time ASC, e.id DESC
Explain Plan:
The question is:
Which part of this query could be causing the wide range of difference in performance?
To specifically answer your question: it's not a specific part of the query that's causing the wide range of performance. That's MySQL doing what it's supposed to do - being a Relational Database Management System (RDBMS), not just a dumb SQL wrapper around comma separated files.
When you execute a query, the following things happen:
The query is compiled to a 'parametrized' query, eliminating all variables down to the pure structural SQL.
The compilation cache is checked to find whether a recent usable execution plan is found for the query.
The query is compiled into an execution plan if needed (this is what the 'EXPLAIN' shows)
For each execution plan element, the memory caches are checked whether they contain fresh and usable data, otherwise the intermediate data is assembled from master table data.
The final result is assembled by putting all the intermediate data together.
What you are seeing is that when the query costs 0.0009 seconds, the cache was fresh enough to supply all data together, and when it peaks at 7.5 seconds either something was changed in the queried tables, or other queries 'pushed' the in-memory cache data out, or the DBMS has other reasons to suspect it needs to recompile the query or fetch all data again. Probably some of the other variations have to do with used indexes still being cached freshly enough in memory or not.
Concluding this, the query is ridiculously slow, you're just sometimes lucky that caching makes it appear fast.
To solve this, I'd recommend looking into 2 things:
First and foremost - a query this size should not have a single line in its execution plan reading "No possible keys". Research how indexes work, make sure you realize the impact of MySQL's limitation of using a single index per joined table, and tweak your database so that each line of the plan has an entry under 'key'.
Secondly, review the query in itself. DBMS's are at their fastest when all they have to do is combine raw data. Using programmatic elements like CASE and COALESCE are by all means often useful, but they do force the database to evaluate more things at runtime than just take raw table data. Try to eliminate such statements, or move them to the business logic as post-processing with the retrieved data.
Finally, never forget that MySQL is actually a rather stupid DBMS. It is optimized for performance in simple data fetching queries such as most websites require. As such it is much faster than SQL Server and Oracle for most generic problems. Once you start complicating things with functions, cases, huge join or matching conditions etc., the competitors are frequently much better optimized, and have better optimization in their query compilers. As such, when MySQL starts becoming slow in a specific query, consider splitting it up in 2 or more smaller queries just so it doesn't become confused, and do some postprocessing in PHP or whatever language you are calling with. I've seen many cases where this increased performance a LOT, just by not confusing MySQL, especially in cases where subqueries were involved (as in your case). Especially the fact that your subquery is a derived table, and not just a subquery, is known to complicate stuff for MySQL beyond what it can cope with.
Lets start that both your outer and inner query are working with the "e" table WITH a minimum requirement of flag_three = 1 AND flag_four = 1 (regardless of your inner query's (( x and y ) or z) condition. Also, your outer WHERE clause has explicit reference to the a.Flag_two, but no NULL which forces your LEFT JOIN to actually become an (INNER) JOIN. Also, it appears every "e" record MUST have a category as you are looking for the "cat.lastname" and no coalesce() if none found. This makes sense at it appears to be a "lookup" table reference. As for the "m_table" and "c_table", you are not getting or doing anything with it, so they can be removed completely.
Would the following query get you the same results?
select
e1.id,
e1.Title,
e1.Special_Flag,
e1.cat_id,
coalesce( a1.date, '9999-99-99' ) ADate,
coalesce( a1.time, '99-99-99' ) ATime
cat.LastName
from
e_table e1
LEFT JOIN a_table as a1
ON e1.id = a1.e_id
AND a1.flag_two = 1
AND a1.date >= '2013-03-29'
JOIN cat_table as cat
ON e1.cat_id = cat.id
where
e1.flag_three = 1
and e1.flag_four = 1
and ( e1.special_flag = 1
OR a1.id IS NOT NULL )
order by
IF( a1.id is null, 2, 1 ),
ADate,
ATime,
e1.ID Desc
limit
0, 10
The Main WHERE clause qualifies for ONLY those that have the "three and four" flags set to 1 PLUS EITHER the ( special flag exists OR there is a valid "a" record that is on/after the given date in question).
From that, simple order by and limit.
As for getting the date and time, it appears that you only want records on/after the date to be included, otherwise ignore them (such as they are old and not applicable, you don't want to see them).
The order by, I am testing FIRST for a NULL value for the "a" ID. If so, we know they will all be forced to a date of '9999-99-99' and time of '99-99-99' and want them pushed to the bottom (hence 2), otherwise, there IS an "a" record and you want those first (hence 1). Then, sort by the date/time respectively and then the ID descending in case many within the same date/time.
Finally, to help on the indexes, I would ensure your "e" table has an index on
( id, flag_three, flag_four, special_flag ).
For the "a" table, index on
(e_id, flag_two, date)
I'm developing a scoreboard of sorts. The table structure is ID, UID, points with UID being linked to a users account.
Now, I have this working somewhat, but I need one specific thing for this query to be pretty much perfect. To pick a user based on rank.
I'll show you my the SQL.
SELECT *, #rownum := #rownum + 1 AS `rank` FROM
(SELECT * FROM `points_table` `p`
ORDER BY `p`.`points` DESC
LIMIT 1)
`user_rank`,
(SELECT #rownum := 0) `r`, `accounts_table` `a`, `points_table` `p`
WHERE `a`.`ID` = `p`.`UID`
It's simple to have it pick people out by UID, but that's no good. I need this to pull the user by their rank (which is a, um, fake field ^_^' created on the fly). This is a bit too complex for me as my SQL knowledge is enough for simple queries, I have never delved into alias' or nested queries, so you'll have to explain fairly simply so I can get a grasp.
I think there is two problems here. From what I can gather you want to do a join on two tables, order them by points and then return the nth record.
I've put together an UNTESTED query. The inner query does a join on the two tables and the outer query specifies that only a specific row is returned.
This example returns the 4th row.
SELECT * FROM
(SELECT *, #rownum := #rownum + 1 AS rank
FROM `points_table` `p`
JOIN `accounts_table` `a` ON a.ID = p.UID,
(SELECT #rownum:=0) r
ORDER BY `p`.`points` DESC) mytable
WHERE rank = 4
Hopefully this works for you!
I've made a change to the answer which should hopefully resolve that problem. Incidentally, whether you use a php or mysql to get the rank, you are still putting a heavy strain on resources. Before mysql can calculate the rank it must create a table of every user and then order them. So you are just moving the work from one area to another. As the number of users increases, so too will the query execution time regardless of your solution. MySQL will probably take slightly longer to perform calculations which is why PHP is probably a more ideal solution. But I also know from experience, that sometimes extraneous details prevent you from having a completely elegant solution. Hope the altered code works.
The following query gets the info that I need. However, I noticed that as the tables grow, my code gets slower and slower. I'm guessing it is this query. Can this written a different way to make it more efficient? I've heard a lot about using joins instead of subqueries, however, I don't "get" how to do it.
SELECT * FROM
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T
LEFT OUTER JOIN subtransactions AS S
ON T.id = S.transid
GROUP BY T.id
ORDER BY T.position) AS result2
WHERE MAXid = id
Your code:
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T [<--- here ]
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T [<--- here ]
Notice the [<---- here ] marks I added to your code.
The first T is not in any way related to the second T. They have the same correlation alias, they refer to the same table, but they're entirely independent selects and results.
So what you're doing in the first, uncorrelated, subquery is getting the max id for all positions in transactions.
And then you're joining all transaction.position.max(id)s to result2 (which result2 happens to be a join of all transaction.positions to subtransactions). (And the internal order by is pointless and costly, too, but that's not the main problem.)
You're joining every transaction.position.max(id) to every (whatever result 2 selects).
On Edit, after getting home: Ok, you're not Cartesianing, the "where MAXid = id" does join result1 to result2. But you're still rolling up all rows of transaction in both queries.
So you're getting a Cartesian join -- every result1 joined to every result2, unconditionally (nothing tells the database, for example, that they ought to be joined by (max) id or by position).
So if you have ten unique position.max(id)s in transaction, you're getting 100 rows. 1000 unique positions, a million rows. Etc.
When you want to write a complicated query like this, it's a lot easier if you compose it out of simpler views. in particular, you can test each view on its own, to make sure you're getting reasonable results, and then just join the views.
I would split the query into smaller chunks, probably using a stored proc. For example get the max ids from transaction and put this in a table variable. Then join this with subtransactions. This will make it easier for you and the compiler to work out what is going on.
Also without knowing what indexes are on your table it is hard to offer more advice
Put a benchmark function in the code. Then time each section of the code to determine where the slow down is happening. Often times the slow down happens in a different query than you first guess. Determine the correct query that needs to be optimized before posting to stackoverflow.