MySQL ranking in presence of indexes using variables - mysql
Using the classic trick of using #N=#N + 1 to get the rank of items on some ordered column. Now before ordering I need to filter out some values from the base table by inner joining it with some other table. So the query looks like this -:
SET #N=0;
SELECT
#N := #N + 1 AS rank,
fa.id,
fa.val
FROM
table1 AS fa
INNER JOIN table2 AS em
ON em.id = fa.id
AND em.type = "A"
ORDER BY fa.val ;
The issue is if I don't have an index on the em.type, then everything works fine but if I put an index on em.type then hell unleashes and the rank values instead of coming ordered by the val column comes in the order the rows are stored in the em table.
here are sample outputs -:
without index-:
rank id val
1 05F8C7 55050.000000
2 05HJDG 51404.733458
3 05TK1Z 46972.008208
4 05F2TR 46900.000000
5 05F349 44433.412847
6 06C2BT 43750.000000
7 0012X3 42000.000000
8 05MMPK 39430.399658
9 05MLW5 39054.046383
10 062D20 35550.000000
with index-:
rank id val
480 05F8C7 55050.000000
629 05HJDG 51404.733458
1603 05TK1Z 46972.008208
466 05F2TR 46900.000000
467 05F349 44433.412847
3534 06C2BT 43750.000000
15 0012X3 42000.000000
1109 05MMPK 39430.399658
1087 05MLW5 39054.046383
2544 062D20 35550.000000
I believe the use of indexes should be completely transparent and outputs should not be effected by it. Is this a bug in MySQL?
This "trick" was a bomb waiting to explode. A clever optimizer will evaluate a query as it sees fits, optimizing for speed - that's why it's called optimizer. I don't think this use of MySQL variables was documented to work as you expect it to work, but it was working.
Was working, up until recent improvements on the MariaDB optimizer. It will probably break as well in the mainstream MySQL as there are several improvements on the optimizer in the (yet to be released, still beta) 5.6 version.
What you can do (until MySQL implemented window functions) is to use a self-join and a grouping. Results will be consistent, no matter what future improvements are done in the optimizer. Downside is that that it may not be very efficient:
SELECT
COUNT(*) AS rank,
fa.id,
fa.val
FROM
table1 AS fa
INNER JOIN table2 AS em
ON em.id = fa.id
AND em.type = 'A'
INNER JOIN
table1 AS fa2
INNER JOIN table2 AS em2
ON em2.id = fa2.id
AND em2.type = 'A'
ON fa2.id <= fa.id
--- assuming that `id` is the Primary Key of the table
GROUP BY fa.id
ORDER BY fa.val ;
Related
SQL to club records in sequence
I have data in MySQL table, my data looks like Key, value A 1 A 2 A 3 A 6 A 7 A 8 A 9 B 1 B 2 and I want to group it based on the continuous sequence. Data is sorted in the table. Key, min, max A 1 3 A 6 9 B 1 2 I tried googling it but could find any solution to it. Can someone please help me with this.
This is way easier with a modern DBMS that support window functions, but you can find the upper bounds by checking that there is no successor. In the same way you can find the lower bounds via absence of a predecessor. By combining the lowest upper bound for each lower bound we get the intervals. select low.keyx, low.valx, min(high.valx) from ( select t1.keyx, t1.valx from t t1 where not exists ( select 1 from t t2 where t1.keyx = t2.keyx and t1.valx = t2.valx + 1 ) ) as low join ( select t3.keyx, t3.valx from t t3 where not exists ( select 1 from t t4 where t3.keyx = t4.keyx and t3.valx = t4.valx - 1 ) ) as high on low.keyx = high.keyx and low.valx <= high.valx group by low.keyx, low.valx; I changed your identifiers since value is a reserved world. Using a window function is way more compact and efficient. If at all possible, consider upgrading to MySQL 8+, it is superior to 5.7 in so many aspects. We can create a group by looking at the difference between valx and an enumeration of the vals, if there is a gap the difference increases. Then, we simply pick min and max for each group: select keyx, min(valx), max(valx) from ( select keyx, valx , valx - row_number() over (partition by keyx order by valx) as grp from t ) as tt group by keyx, grp; Fiddle
Optimizing Parameterized MySQL Queries
I have a query that has a number of parameters which if I run from in MySQLWorkbench takes around a second to run. If I take this query and get rid of the parameters and instead substitute the values into the query then it takes about 22 seconds to run, same as If I convert this query to a parameterized stored procedure and run it (it then takes about 22 seconds). I've enabled profiling on MySQL and I can see a few things there. For example, it shows the number of rows examined and there's an order of difference (20,000 to 400,000) which I assume is the reason for the 20x increase in processing time. The other difference in the profile is that the parameterized query sent from MySQLWorkbench still has the parameters in (e.g. where limit < #lim) while the sproc the values have been set (where limit < 300). I've tried this a number of different ways, I'm using JetBrains's DataGrip (as well as MySQLWorkbench) and that works like MySQLWorkbench (sends through the # parameters), I've tried executing the queries and the sproc from MySQLWorkbench, DataGrip, Java (JDBC) and .Net. I've also tried prepared statements in Java but I can't get anywhere near the performance of sending the 'raw' SQL to MySQL. I feel like I'm missing something obvious here but I don't know what it is. The query is relatively complex, it has a CTE a couple of sub-selects and a couple of joins, but as I said it runs quickly straight from MySQL. My main question is why the query is 20x faster in one format than another. Does the way the query is sent to MySQL have anything to do with this (the '#' values sent through and can I replicate this in a stored procedure? Updated 1st Jan Thanks for the comments, I didn't post the query originally as I'm more interested in the general concepts around the use of variables/parameters and how I could take advantage of that (or not) Here is the original query: with tmp_bat as (select bd.MatchId, bd.matchtype, bd.playerid, bd.teamid, bd.opponentsid, bd.inningsnumber, bd.dismissal, bd.dismissaltype, bd.bowlerid, bd.fielderid, bd.score, bd.position, bd.notout, bd.balls, bd.minutes, bd.fours, bd.sixes, bd.hundred, bd.fifty, bd.duck, bd.captain, bd.wicketkeeper, m.hometeamid, m.awayteamid, m.matchdesignator, m.matchtitle, m.location, m.tossteamid, m.resultstring, m.whowonid, m.howmuch, m.victorytype, m.duration, m.ballsperover, m.daynight, m.LocationId from (select * from battingdetails where matchid in (select id from matches where id in (select matchid from battingdetails) and matchtype = #match_type )) as bd join matches m on m.id = bd.matchid join extramatchdetails emd1 on emd1.MatchId = m.Id and emd1.TeamId = bd.TeamId join extramatchdetails emd2 on emd2.MatchId = m.Id and emd2.TeamId = bd.TeamId ) select players.fullname name, teams.teams team, '' opponents, players.sortnamepart, innings.matches, innings.innings, innings.notouts, innings.runs, HS.score highestscore, HS.NotOut, CAST(TRUNCATE(innings.runs / (CAST((Innings.Innings - innings.notOuts) AS DECIMAL)), 2) AS DECIMAL(7, 2)) 'Avg', innings.hundreds, innings.fifties, innings.ducks, innings.fours, innings.sixes, innings.balls, CONCAT(grounds.CountryName, ' - ', grounds.KnownAs) Ground, '' Year, '' CountryName from (select count(case when inningsnumber = 1 then 1 end) matches, count(case when dismissaltype != 11 and dismissaltype != 14 then 1 end) innings, LocationId, playerid, MatchType, SUM(score) runs, SUM(notout) notouts, SUM(hundred) Hundreds, SUM(fifty) Fifties, SUM(duck) Ducks, SUM(fours) Fours, SUM(sixes) Sixes, SUM(balls) Balls from tmp_bat group by MatchType, playerid, LocationId) as innings JOIN players ON players.id = innings.playerid join grounds on Grounds.GroundId = LocationId and grounds.MatchType = innings.MatchType join (select pt.playerid, t.matchtype, GROUP_CONCAT(t.name SEPARATOR ', ') as teams from playersteams pt join teams t on pt.teamid = t.id group by pt.playerid, t.matchtype) as teams on teams.playerid = innings.playerid and teams.matchtype = innings.MatchType JOIN (SELECT playerid, LocationId, MAX(Score) Score, MAX(NotOut) NotOut FROM (SELECT battingdetails.playerid, battingdetails.score, battingdetails.notout, battingdetails.LocationId FROM tmp_bat as battingdetails JOIN (SELECT battingdetails.playerid, battingdetails.LocationId, MAX(battingdetails.Score) AS score FROM tmp_bat as battingdetails GROUP BY battingdetails.playerid, battingdetails.LocationId, battingdetails.playerid) AS maxscore ON battingdetails.score = maxscore.score AND battingdetails.playerid = maxscore.playerid AND battingdetails.LocationId = maxscore.LocationId ) AS internal GROUP BY internal.playerid, internal.LocationId) AS HS ON HS.playerid = innings.playerid and hs.LocationId = innings.LocationId where innings.runs >= #runs_limit order by runs desc, KnownAs, SortNamePart limit 0, 300; Wherever you see '#match_type' then I substitute that for a value ('t'). This query takes ~1.1 secs to run. The query with the hard coded values rather than the variables down to ~3.5 secs (see the other note below). The EXPLAIN for this query gives this: 1,PRIMARY,<derived7>,,ALL,,,,,219291,100,Using temporary; Using filesort 1,PRIMARY,players,,eq_ref,PRIMARY,PRIMARY,4,teams.playerid,1,100, 1,PRIMARY,<derived2>,,ref,<auto_key3>,<auto_key3>,26,"teams.playerid,teams.matchtype",11,100,Using where 1,PRIMARY,grounds,,ref,GroundId,GroundId,4,innings.LocationId,1,10,Using where 1,PRIMARY,<derived8>,,ref,<auto_key0>,<auto_key0>,8,"teams.playerid,innings.LocationId",169,100, 8,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary 8,DERIVED,<derived14>,,ref,<auto_key0>,<auto_key0>,13,"battingdetails.PlayerId,battingdetails.LocationId,battingdetails.Score",10,100,Using index 14,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary 7,DERIVED,t,,ALL,PRIMARY,,,,3323,100,Using temporary; Using filesort 7,DERIVED,pt,,ref,TeamId,TeamId,4,t.Id,65,100, 2,DERIVED,<derived3>,,ALL,,,,,349893,100,Using temporary 3,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where 3,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100, 3,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index 3,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index 3,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100, 3,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails) and the EXPLAIN for the query with the hardcoded values looks like this: 1,PRIMARY,<derived8>,,ALL,,,,,20097,100,Using temporary; Using filesort 1,PRIMARY,players,,eq_ref,PRIMARY,PRIMARY,4,HS.PlayerId,1,100, 1,PRIMARY,grounds,,ref,GroundId,GroundId,4,HS.LocationId,1,100,Using where 1,PRIMARY,<derived2>,,ref,<auto_key0>,<auto_key0>,30,"HS.LocationId,HS.PlayerId,grounds.MatchType",17,100,Using where 1,PRIMARY,<derived7>,,ref,<auto_key0>,<auto_key0>,46,"HS.PlayerId,innings.MatchType",10,100,Using where 8,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary 8,DERIVED,m,,eq_ref,"PRIMARY,LocationId",PRIMARY,4,matches.Id,1,100, 8,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index 8,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index 8,DERIVED,<derived14>,,ref,<auto_key2>,<auto_key2>,4,m.LocationId,17,100, 8,DERIVED,battingdetails,,ref,"PlayerId,TeamId,Score,MatchId,match_team",MatchId,8,"matches.Id,maxscore.PlayerId",1,3.56,Using where 8,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails) 14,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary 14,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100, 14,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index 14,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index 14,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100, 14,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails) 7,DERIVED,t,,ALL,PRIMARY,,,,3323,100,Using temporary; Using filesort 7,DERIVED,pt,,ref,TeamId,TeamId,4,t.Id,65,100, 2,DERIVED,matches,,ALL,PRIMARY,,,,114162,10,Using where; Using temporary 2,DERIVED,m,,eq_ref,PRIMARY,PRIMARY,4,matches.Id,1,100, 2,DERIVED,emd1,,ref,"PRIMARY,TeamId",PRIMARY,4,matches.Id,1,100,Using index 2,DERIVED,emd2,,eq_ref,"PRIMARY,TeamId",PRIMARY,8,"matches.Id,emd1.TeamId",1,100,Using index 2,DERIVED,battingdetails,,ref,"TeamId,MatchId,match_team",match_team,8,"emd1.TeamId,matches.Id",15,100, 2,DERIVED,battingdetails,,ref,MatchId,MatchId,4,matches.Id,31,100,Using index; FirstMatch(battingdetails) Pointers as to ways to improve my SQL are always welcome (I'm definitely not a database person), but I''d still like to understand whether I can use the SQL with the variables from code and why that improves the performance by so much Update 2 1st Jan AAArrrggghhh. My machine rebooted overnight and now the queries are generally running much quicker. It's still 1 sec vs 3 secs but the 20 times slowdown does seem to have disappeared
In your WITH construct, are you overthinking your select in ( select in ( select in ))) ... overstating what could just be simplified to the with Innings I have in my solution. Also, you were joining to the extraMatchDetails TWICE, but joined on the same conditions on match and team, but never utliized either of those tables in the "WITH CTE" rendering that component useless, doesn't it? However, the MATCH table has homeTeamID and AwayTeamID which is what I THINK your actual intent was Also, your WITH CTE is pulling many columns not needed or used in subsequent return such as Captain, WicketKeeper. So, I have restructured... pre-query the batting details once up front and summarized, then you should be able to join off that. Hopefully this MIGHT be a better fit, function and performance for your needs. with innings as ( select bd.matchId, bd.matchtype, bd.playerid, m.locationId, count(case when bd.inningsnumber = 1 then 1 end) matches, count(case when bd.dismissaltype in ( 11, 14 ) then 0 else 1 end) innings, SUM(bd.score) runs, SUM(bd.notout) notouts, SUM(bd.hundred) Hundreds, SUM(bd.fifty) Fifties, SUM(bd.duck) Ducks, SUM(bd.fours) Fours, SUM(bd.sixes) Sixes, SUM(bd.balls) Balls from battingDetails bd join Match m on bd.MatchID = m.MatchID where matchtype = #match_type group by bd.matchId, bd.matchType, bd.playerid, m.locationId ) select p.fullname playerFullName, p.sortnamepart, CONCAT(g.CountryName, ' - ', g.KnownAs) Ground, t.team, i.matches, i.innings, i.runs, i.notouts, i.hundreds, i.fifties, i.ducks, i.fours, i.sixes, i.balls, CAST( TRUNCATE( i.runs / (CAST((i.Innings - i.notOuts) AS DECIMAL)), 2) AS DECIMAL(7, 2)) 'Avg', hs.maxScore, hs.maxNotOut, '' opponents, '' Year, '' CountryName from innings i JOIN players p ON i.playerid = p.id join grounds g on i.locationId = g.GroundId and i.matchType = g.matchType join (select pt.playerid, t.matchtype, GROUP_CONCAT(t.name SEPARATOR ', ') team from playersteams pt join teams t on pt.teamid = t.id group by pt.playerid, t.matchtype) as t on i.playerid = t.playerid and i.MatchType = t.matchtype join ( select i2.playerid, i2.locationid, max( i2.score ) maxScore, max( i2.notOut ) maxNotOut from innings i2 group by i2.playerid, i2.LocationId ) HS on i.playerid = HS.playerid AND i.locationid = HS.locationid FROM where i.runs >= #runs_limit order by i.runs desc, g.KnownAs, p.SortNamePart limit 0, 300; Now, I know that you stated that after the server reboot, performance is better, but really, what you DO have appears to really have overbloated queries.
Not sure this is the correct answer but I thought I'd post this in case other people have the same issue. The issue seems to be the use of CTEs in a stored procedure. I have a query that creates a CTE and then uses that CTE 8 times. If I run this query using interpolated variables it takes about 0.8 sec, if I turn it into a stored procedure and use the stored procedure parameters then it takes about to a minute (between 45 and 63 seconds) to run! I've found a couple of ways of fixing this, one is to use multiple temporary tables (8 in this case) as MySQL cannot re-use a temp table in a query. This gets the query time right down but just doesn't fell like a maintainable or scalable solution. The other fix is to leave the variables in place and assign them from the stored procedure parameters, this also has no real performance issues. So my sproc looks like this: create procedure bowling_individual_career_records_by_year_for_team_vs_opponent(IN team_id INT, IN opponents_id INT) begin set #team_id = team_id; set #opponents_id = opponents_id; # use these variables in the SQL below ... end Not sure this is the best solution but it works for me and keeps the structure of the SQL the same as it was previously.
query optimization for mysql
I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes. select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1 from matrix_report1 ma1, matrix_normalized_notes mn1 where ma1.final_value = 1 and (mn1.normalized_value != 0.2 and mn1.normalized_value != 0.0 ) and ma1.user_id = mn1.user_id and ma1.request_id = mn1.request_id and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1 group by r1.person_id) rr1 ,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value from matrix_report1 ma, matrix_normalized_notes mn where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 ) and ma.user_id = mn.user_id and ma.request_id = mn.request_id and ma.request_id = 4 group by ma.person_id, mn.store_name) r2 group by r2.person_id) rr2 where rr1.person_id = rr2.person_id Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much. There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better. However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest. To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ). Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help. SELECT r1.person_id, avg(r1.ANV1) as t1_value, avg(r1.ANV0) as t0_value from ( select ma1.person_id, mn1.store_name, avg( case when ma1.final_value = 1 then mn1.normalized_value end ) as ANV1, avg( case when ma1.final_value = 0 then mn1.normalized_value end ) as ANV0 from matrix_report1 ma1 JOIN matrix_normalized_notes mn1 ON ma1.request_id = mn1.request_id AND ma1.user_id = mn1.user_id AND NOT mn1.normalized_value in ( 0.0, 0.2 ) where ma1.request_id = 4 AND ma1.final_Value in ( 0, 1 ) group by ma1.person_id, mn1.store_name) r1 group by r1.person_id Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed. So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.
Using MAX of a field to select a record returns MAX of the table ignores the condition
I'm doing a stored procedure for a report, and I'm trying to get only those records with the highest value of a determined field (accumulated amount), the thing is I can't seem to find the solution to this, the only solution that i've came up with is using an extra condition, the problem is the field changes every month (period) and not all the records are updated but I need to retrieve them all... (if an asset is depreciated there wont be anymore records relating that asset in that table) I'm sorry if this is confusing, I'll try my best to explain The report needs to have for each supplier registered a list of the assets that supplies, their description, their current location, it's price, and how much money still needs to be depreciated from the asset. So, what I'm doing it's first getting the list of suppliers, then getting the list of assets associated with a location (Using cursors) then I try to calculate how much money needs to be depreciated, there's a table called 'DEPRECIACIONES' that stores the asset, the period, and how much money has been depreciated from that asset for each period and for each asset that hasn't been completely depreciated. The problem comes when I try to calculate the MAX amount of money depreciated for an asset and then selecting the row for that item that has that MAX amount, I'm sure i'm doing something wrong, my TSQL and general database knowledge is not good and I'm trying to learn by myself. I've uploaded the schema, tables and the stored procedure that throws the wrong output here: http://sqlfiddle.com/#!3/78c32 The Right output should be something like this: Proveedor | Activo | Descripcion | Ubicacion Actual | Costo Adquisicion | Saldo sin depreciar | Periodo Supplier | Asset | Description | Current Location | Cost | Money to be depreciated | Period ------------------------------------------------------------------------------------------- Monse |ActivoT| texthere | 1114 |2034.50| RANDOM NUMBER HERE |RandomP Monse |cesart | texthere | 4453 |4553.50| RANDOM NUMBER HERE |RandomP nowlast | activ | texthere | 4453 |1234.65| RANDOM NUMBER HERE |RandomP nowlast |augusto| texthere | 4450 |4553.50| RANDOM NUMBER HERE |RandomP Sara |Activo | texthere | 1206 |746.65 | RANDOM NUMBER HERE |RandomP I'd really appreciate telling me what i'm doing wrong (which is probably a lot) and how to fix it, thank you in advance.
Good skills in giving complete information via SqlFiddle. I don't have a complete answer for you, but this may help. Firstly, ditch the cursor - it's hard to debug and possibly slow. Refactor to a SELECT statement. This is my attempt, which should be logically equivalent to your code: SELECT p.Proveedor, a.Activo, a.Descripcion, Ubi.Ubicacion, saldo_sin_depreciar = a.Costo_adquisicion - d.Monto_acumulado, d.Periodo FROM PROVEEDORES p INNER JOIN ACTIVOS_FIJOS a ON a.Proveedor = p.Proveedor INNER JOIN DEPRECIACIONES d ON a.Activo = d.Activo INNER JOIN ( SELECT MAX(d1.Monto_acumulado) AS MaxMonto FROM DEPRECIACIONES d1 INNER JOIN DEPRECIACIONES d2 ON d1.Monto_acumulado = d2.Monto_acumulado ) MaxAe ON d.Monto_acumulado = MaxAe.MaxMonto INNER JOIN ACTIVO_UBICACION Ubi ON a.activo = ubi.activo INNER JOIN ( SELECT activo, ubicacion, Fecha_Ubicacion, RowNum = row_number() OVER ( partition BY activo ORDER BY abs(datediff(dd, Fecha_Ubicacion, getdate()))) FROM ACTIVO_UBICACION ) UbU ON UbU.ubicacion = Ubi.Ubicacion WHERE -- a.Activo IS NOT NULL AND UbU.RowNum = 1 ORDER BY p.Proveedor COMMENTS I've moved the WHERE criteria that are defining the joins up into ON clauses in the table list, that makes it easier to see how you are joining the tables. Note that all the joins are INNER, which may not be what you want - you may need some LEFT JOIN's, I don't understand the logic enough to say. Also, in your cursor procedure the Ubi and UbU parts don't seem to explicitly join with the rest of the tables, so I've pencilled-in an INNER JOIN on the activo column, as this is the way the tables join in the FK relationship. In your cursor code, you would effectively get a CROSS JOIN which is probably wrong and also expensive to run. The WHERE clause a.Activo IS NOT NULL is not required, because the INNER JOIN ensures it. Hope this helps you sort it out.
I ended up using another query for the cursor and fixed the problem. It's probably not optimal but it works. Whenever I learn more database related stuff I'll optimize it. Here's the new query: DECLARE P CURSOR STATIC FOR SELECT a.Proveedor, actub.activo, actub.ubicacion FROM [SISACT].PROVEEDORES p,[SISACT].ACTIVOS_FIJOS a, (SELECT activo, ubicacion, Fecha_Ubicacion, row_number() OVER ( partition BY activo ORDER BY abs(datediff(dd, Fecha_Ubicacion, getdate())) ) AS RowNum FROM [SISACT].ACTIVO_UBICACION) actub WHERE RowNum = 1 AND a.Proveedor = p.Proveedor AND actub.activo = a.Activo OPEN P FETCH NEXT FROM P INTO #p, #a, #u WHILE ##FETCH_STATUS = 0 BEGIN SELECT #activo = a.Activo, #descripcion = a.Descripcion, #costo_adquisicion = a.Costo_adquisicion, #saldo_depreciado = MaxAe.MaxMonto, #periodo = d.Periodo FROM [SISACT].ACTIVOS_FIJOS a, [SISACT].DEPRECIACIONES d, SISACT.PROVEEDORES pro, SISACT.ACTIVO_UBICACION actu, (SELECT MAX(d1.Monto_acumulado) AS MaxMonto FROM [SISACT].DEPRECIACIONES d1 INNER JOIN [SISACT].DEPRECIACIONES d2 ON d1.Monto_acumulado = d2.Monto_acumulado WHERE d1.Activo = #a AND d2.Activo = #a) MaxAe WHERE a.Activo = d.Activo AND a.Activo = #a AND d.Activo = #a AND a.Proveedor = #p AND actu.Activo = #a AND actu.Ubicacion = #u SET #saldo_sin_depreciar = #costo_adquisicion - #saldo_depreciado FETCH NEXT FROM P INTO #p, #a, #u END CLOSE P DEALLOCATE P
Need Help streamlining a SQL query to avoid redundant math operations in the WHERE and SELECT
*Hey everyone, I am working on a query and am unsure how to make it process as quickly as possible and with as little redundancy as possible. I am really hoping someone there can help me come up with a good way of doing this. Thanks in advance for the help!* Okay, so here is what I have as best I can explain it. I have simplified the tables and math to just get across what I am trying to understand. Basically I have a smallish table that never changes and will always only have 50k records like this: Values_Table ID Value1 Value2 1 2 7 2 2 7.2 3 3 7.5 4 33 10 ….50000 44 17.2 And a couple tables that constantly change and are rather large, eg a potential of up to 5 million records: Flags_Table Index Flag1 Type 1 0 0 2 0 1 3 1 0 4 1 1 ….5,000,000 1 1 Users_Table Index Name ASSOCIATED_ID 1 John 1 2 John 1 3 Paul 3 4 Paul 3 ….5,000,000 Richard 2 I need to tie all 3 tables together. The most results that are likely to ever be returned from the small table is somewhere in the neighborhood of 100 results. The large tables are joined on the index and these are then joined to the Values_Table ON Values_Table.ID = Users_Table.ASSOCIATED_ID …. That part is easy enough. Where it gets tricky for me is that I need to return, as quickly as possible, a list limited to 10 results where value1 and value2 are mathematically operated on to return a new_ value where that new_value is less than 10 and the result is sorted by that new_value and any other where statements I need can be applied to the flags. I do need to be able to move along the limit. EG LIMIT 0,10 / 11,10 / 21,10 etc... In a subsequent (or the same if possible) query I need to get the top 10 count of all types that matched that criteria before the limit was applied. So for example I want to join all of these and return anything where Value1 + Value2 < 10 AND I also need the count. So what I want is: Index Name Flag1 New_Value 1 John 0 9 2 John 0 9 5000000 Richard 1 9.2 The second response would be: ID (not index) Count 1 2 2 1 I tried this a few ways and ultimately came up with the following somewhat ugly query: SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value FROM Values_Table JOIN Users_Table ON ASSOCIATED_ID = ID JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index WHERE (Value1 * some_variable + Value1) < 10 ORDER BY New_Value LIMIT 0,10 And then for the count: SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value FROM Values_Table JOIN Users_Table ON ASSOCIATED_ID = ID JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index WHERE (Value1 * some_variable + Value1) < 10 GROUP BY TYPE ORDER BY New_Value LIMIT 0,10 Being able to filter on the different flags and such in my WHERE clause is important; that may sound stupid to comment on but I mention that because from what I could see a quicker method would have been to use the HAVING statement but I don't believe that will work in certain instance depending on what I want to use my WHERE clause to filter against. And when filtering using the flags table : SELECT INDEX, NAME, Flag1, (Value1 * some_variable + Value2) as New_Value FROM Values_Table JOIN Users_Table ON ASSOCIATED_ID = ID JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0 ORDER BY New_Value LIMIT 0,10 ...filtered count: SELECT ID, COUNT(TYPE) as Count, (Value1 * some_variable + Value2) as New_Value FROM Values_Table JOIN Users_Table ON ASSOCIATED_ID = ID JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index WHERE (Value1 * some_variable + Value1) < 10 AND Flag1 = 0 GROUP BY TYPE ORDER BY New_Value LIMIT 0,10 That works fine but has to run the math multiple times for each row, and I get the nagging feeling that it is also running the math multiple times on the same row in the Values_table table. My thought was that I should just get only the valid responses from the Values_table first and then join those to the other tables to cut down on the processing; with how SQL optimizes things though I wasn't sure if it might not already be doing that. I know I could use a HAVING clause to only run the math once if I did it that way but I am uncertain how I would then best join things. My questions are: Can I avoid running that math twice and still make the query work (or I suppose if there is a good way to make the first one work as well that would be great) What is the fastest way to do this as this is something that will be running very often. It seems like this should be painfully simple but I am just missing something stupid. I contemplated pulling into a temp table then joining that table to itself but that seems like I would trade math for iterations against the table and still end up slow. Thank you all for your help in this and please let me know if I need to clarify anything here! ** To clarify on a question, I can't use a 3rd column with the values pre-calculated because in reality the math is much more complex then addition, I just simplified it for illustration's sake.
Do you have a benchmark query to compare against? Usually it doesn't work to try to outsmart the optimizer. If you have acceptable performance from a starting query, then you can see where extra work is being expended (indicated by disk reads, cache consumption, etc.) and focus on that. Avoid the temptation to break it into pieces and solve those. That's an antipattern. That includes temp tables especially. Redundant math is usually ok - what hurts is disk activity. I've never seen a query that needed CPU work reduction on pure calculations.
Gather your results and put them in a temp table SELECT * into TempTable FROM (SELECT INDEX, NAME, Type, ID, Flag1, (Value1 + Value2) as New_Value FROM Values_Table JOIN Users_Table ON ASSOCIATED_ID = ID JOIN Flags_Table ON Flags_Table.Index = Users_Table.Index WHERE New_Value < 10) ORDER BY New_Value LIMIT 0,10 Return Result for First Query SELECT INDEX, NAME, Flag1, New_Value FROM TempTable Return Results for count of Types Select ID, Count(Type) FROM TempTable GROUP BY TYPE
Is there any chance that you can add a third column to the values_table with the pre-calculated value? Even if the result of your calculation is dependent on other variables, you could run the calculation for the whole table but only when those variables change.