MySQL max-value – when and how often - mysql

If I have a query like
SELECT (MAX(b.A)/a.A) AS c
FROM a
INNER JOIN b
ON a.b_id = b.id
GROUP by a.id
does MySQL evaluate the value for "MAX(b.A)" for every row or only once?
It's just of interest to me if there is room for performance improvement or not.
Thanks!
UPDATE
OK let's move on to a real world example: I want to calculate the proportional value of a users likes compared to max-user-likes.
The query to only read the max value of users.likes (which is indexed) takes 0.0003
SELECT MAX(likes)
FROM users
So I now know the value of max-user-likes, let's say it's 10000 so I could query like this which takes 0.0007s:
SELECT (users.likes/10000) AS weight
FROM posts
INNER JOIN users ON posts.author_id = users.id
So one would expect to have both queries together to be something like 0.0003 + 0.0007s, but it takes 0.3s:
SELECT (users.likes/(SELECT MAX(likes) FROM users)) AS weight
FROM posts
INNER JOIN users ON posts.author_id = users.id
So something seems still wrong with my database - any suggestions?

Since you have no GROUP BY clause, the result will only have one row and you can't know from which row the value of a.A will be. The value of MAX(b.A) will be only evaluated once.
When you have a GROUP BY clause, MAX(b.A) will be evaluated for every group.

In general, the expression within an aggregation function is evaluated and checked for NULL for each row. There certainly can be a optimization for MIN and MAX in case of walking through an index, but I doubt that.
BTW, you can easily check this, when you execute MAX(id) on a large table. You will see that the execution time is the same as for COUNT(id) (and might be much more than COUNT(*) depending on the engine).

Related

INNER JOIN condition in WHERE clause or ON clause?

I mistyped a query today, but it still worked and gave the intended result. I meant to run this query:
SELECT e.id FROM employees e JOIN users u ON u.email=e.email WHERE u.id='139840'
but I accidentally ran this query
SELECT e.id FROM employees e JOIN users u ON u.email=e.email AND u.id='139840'
(note the AND instead of WHERE in the last clause)
and both returned the correct employee id from the user id.
What is the difference between these 2 queries? Does the second form only join members of the 2 tables meeting the criteria, whereas the first one would join the entire table, and then run the query? Is one more or less efficient than the other? Is it something else I am missing?
Thanks!
For inner joins like this they are logically equivalent. However, you can run in to situations where a condition in the join clause means something different than a condition in the where clause.
As a simple illustration, imagine you do a left join like so;
select x.id
from x
left join y
on x.id = y.id
;
Here we're taking all the rows from x, regardless of whether there is a matching id in y. Now let's say our join condition grows - we're not just looking for matches in y based on the id but also on id_type.
select x.id
from x
left join y
on x.id = y.id
and y.id_type = 'some type'
;
Again this gives all the rows in x regardless of whether there is a matching (id, id_type) in y.
This is very different, though:
select x.id
from x
left join y
on x.id = y.id
where y.id_type = 'some type'
;
In this situation, we're picking all the rows of x and trying to match to rows from y. Now for rows for which there is no match in y, y.id_type will be null. Because of that, y.id_type = 'some type' isn't satisfied, so those rows where there is no match are discarded, which effectively turned this in to an inner join.
Long story short: for inner joins it doesn't matter where the conditions go but for outer joins it can.
In the case of an INNER JOIN, the two queries are semantically the same, meaning they are guaranteed to have the same results. If you were using an OUTER join, the meaning of the two queries could be very different, with different results.
Performance-wise, I would expect that these two queries would result in the same execution plan. However, the query engine might surprise you. The only way to know is to view the execution plans for the two queries.
The optimizer will treat them the same. You can do an EXPLAIN to prove it to yourself.
Therefore, write the one that is clearer.
SELECT e.id
FROM employees e JOIN users u ON u.email=e.email
WHERE u.id='139840'
If it were an outer join instead of inner, you'd get unintended results, but when using an inner join it makes no real difference whether you use additional join criteria instead of a WHERE clause.
Performance-wise they are most likely identical, but can't be certain.
I brought this up with my colleagues on our team at work. This response is a bit SQL Server centered and not MySQL. However, the optimizer should have similarities in operation between SQL and MySQL..
Some thoughts:
Essentially, if you have to add a WHERE, there are additional table scans done to verify equality for each condition (This goes up by orders of magnitude with an AND or dataset, an OR, the decision is cast at the first true condition) – if you have one id pointer in the example given it is extremely quick conversely, if you have to find all of the records that belong to a company or department it becomes more obscure as you may have multiples of records. If you can apply the equals condition, it is far more effective when working with an AuditLog or EventLog table that has zillions of rows. One would not really see the large benefits of this on small tables (at around 200,000 rows or so).
From: Allesandro Alpi
http://suxstellino.wordpress.com/2013/01/07/sql-server-logical-query-processing-summary/
From: Itzik Ben-Gan
http://tsql.solidq.com/books/insidetsql2008/Logical%20Query%20Processing%20Poster.pdf

SQL:in clause with query taking too long compared to in clause with actual data

I have 3 SQL queries as given:
select student_id from user where user_id =4; // returns 35
select * from student where student_id in (35);
select * from student where student_id in (select student_id from user where user_id =4);
first 2 queries take less than 0.5 second, but the third, similar as 2nd containing 1st as subquery, is taking around 8 seconds.
I indexed tables according to my need, but time is not reducing.
Can someone please give me a solution or provide some explanation for this behaviour.
Thanks!
Actually, MySQL execute the inner query at the end, it scans every indexes before. MySQL rewrites the subquery in order to make the inner query fully dependent of the outer one.
For exemple, it select * from student (depend of your database, but could return many results), then apply the inner query user_id=4 to the previous result.
The dev team are working on this problem and it should be "solved" in the 6.0 http://dev.mysql.com/doc/refman/5.5/en/optimizing-subqueries.html
EDIT:
In your case, you should use a JOIN method.
Not with a subquery but why don't you use a join here?
select
s.*
from
student s
inner join
user u
on s.id_student_id = u.student_id
where
u.user_id = 4
;

MySQL Joins, Group By, and Ordering the Group By Choice

Is it possible to order the GROUP BY chosen results of a MySQL query w/out using a subquery? I'm finding that, with my large dataset, the subquery adds a significant amount of load time to my query.
Here is a similar situation: how to sort order of LEFT JOIN in SQL query?
This is my code that works, but it takes way too long to load:
SELECT tags.contact_id, n.last
FROM tags
LEFT JOIN ( SELECT * FROM names ORDER BY timestamp DESC ) n
ON (n.contact_id=tags.contact_id)
WHERE tags.tag='$tag'
GROUP BY tags.contact_id
ORDER BY n.last ASC;
I can get a fast result doing a simple join w/ a table name, but the "group by" command gives me the first row of the joined table, not the last row.
I'm not really sure what you're trying to do. Here are some of the problems with your query:
selecting n.last, although it is neither in the group by clause, nor an aggregate value. Although MySQL allows this, it's really not a good idea to take advantage of.
needlessly sorting a table before joining, instead of just joining
the subquery isn't really doing anything
I would suggest carefully writing down the desired query results, i.e. "I want the contact id and latest date for each tag" or something similar. It's possible that will lead to a natural, easy-to-write and semantically correct query that is also more efficient than what you showed in the OP.
To answer the question "is it possible to order a GROUP BY query": yes, it's quite easy, and here's an example:
select a, b, sum(c) as `c sum`
from <table_name>
group by a,b
order by `c sum`
You are doing a LEFT JOIN on contact ID which implies you want all tag contacts REGARDLESS of finding a match in the names table. Is that really the case, or will the tags table ALWAYS have a "Names" contact ID record. Additionally, your column "n.Last". Is this the person's last name, or last time something done (which I would assume is actually the timestamp)...
So, that being said, I would just do a simple direct join
SELECT DISTINCT
t.contact_id,
n.last
FROM
tags t
JOIN names n
ON t.contact_id = n.contact_id
WHERE
t.tag = '$tag'
ORDER BY
n.last ASC

How do I join one table onto another where userid = userid but only for that date?

I'm looking to take the total time a user worked on each batch at his workstation, the total estimated work that was completed, the amount the user was paid, and how many failures the user has had for each day this year. If I can join all of this into one query then I can use it in excel and format things nicely in pivot tables and such.
EDIT: I've realized that is only possible to do this in multiple queries so I have narrowed my scope down to this:
SELECT batch_log.userid,
batches.operation_id,
SUM(TIME_TO_SEC(ramses.batch_log.time_elapsed)),
SUM(ramses.tasks.estimated_nonrecurring + ramses.tasks.estimated_recurring),
DATE(start_time)
FROM batch_log
JOIN batches ON batch_log.batch_id=batches.id
JOIN ramses.tasks ON ramses.batch_log.batch_id=ramses.tasks.batch_id
JOIN protocase.tblusers on ramses.batch_log.userid = protocase.tblusers.userid
WHERE DATE(ramses.batch_log.start_time) > "2011-01-01"
AND protocase.tblusers.active = 1
GROUP BY userid, batches.operation_id, start_time
ORDER BY start_time, userid ASC
The cross join was causing the problem.
No, in general a Having clause is used to filter the results of your Group by - for example, only reporting those who were paid for more than 24 hours in a day (HAVING SUM(ramses.timesheet_detail.paidTime) > 24). Unless you need to perform filtering of aggregate results, you shouldn't need a having clause at all.
Most of those conditions should be moved into a where clause, or as part of the joins, for two reasons - 1) Filtering should in general be done as soon as possible, to limit the work the query needs to perform. 2) If the filtering is already done, restating it may cause the query to perform additional, unneeded work.
From what I've seen so far, it appears that you're trying to roll things up by the day - try changing the last column in the group by clause to date(ramses.batch_log.start_time), or you're grouping by (what I assume is) a timestamp.
EDIT:
About schema names - yes, you can name them in the from and join sections. Often, too, the query may be able to resolve the needed schemas based on some default search list (how or if this is set up depends on your database).
Here is how I would have reformatted the query:
SELECT tblusers.userid, operations.name AS name,
SUM(TIME_TO_SEC(batch_log.time_elapsed)) AS time_elapsed,
SUM(tasks.estimated_nonrecurring + tasks.estimated_recurring) AS total_estimated,
SUM(timesheet_detail.paidTime) as hours_paid,
DATE(start_time) as date_paid
FROM tblusers
JOIN batch_log
ON tblusers.userid = batch_log.userid
AND DATE(batch_log.start_time) >= "2011-01-01"
JOIN batches
ON batch_log.batch_id = batches.id
JOIN operations
ON operations.id = batches.operation_id
JOIN tasks
ON batches.id = tasks.batch_id
JOIN timesheet_detail
ON tblusers.userid = timesheet_detail.userid
AND batch_log.start_time = timesheet_detail.for_day
AND DATE(timesheet_detail.for_day) = DATE(start_time)
WHERE tblusers.departmentid = 8
GROUP BY tblusers.userid, name, DATE(batch_log.start_time)
ORDER BY date_paid ASC
Of particular concern is the batch_log.start_time = timesheet_detail.for_day line, which is comparing (what are implied to be) timestamps. Are these really equal? I expect that one or both of these should be wrapped in a date() function.
As for why you may be getting unexpected data - you appear to have eliminated some of your join conditions. Without knowing the exact setup and use of your database, I cannot give the exact reason for your results (or even able to say they are wrong), but I think the fact that you join to the operations table without any join condition is probably to blame - if there are 2 records in that table, it will double all of your previous results, and it looks like there may be 12. You also removed operations.name from the group by clause, which may or may not give you the results you want. I would look into the rest of your table relationships, and see if there are any further restrictions that need to be made.

How do I optimize this query?

The following query gets the info that I need. However, I noticed that as the tables grow, my code gets slower and slower. I'm guessing it is this query. Can this written a different way to make it more efficient? I've heard a lot about using joins instead of subqueries, however, I don't "get" how to do it.
SELECT * FROM
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T
LEFT OUTER JOIN subtransactions AS S
ON T.id = S.transid
GROUP BY T.id
ORDER BY T.position) AS result2
WHERE MAXid = id
Your code:
(SELECT MAX(T.id) AS MAXid
FROM transactions AS T [<--- here ]
GROUP BY T.position
ORDER BY T.position) AS result1,
(SELECT T.id AS id, T.symbol, T.t_type, T.degree, T.position, T.shares, T.price, T.completed, T.t_date,
DATEDIFF(CURRENT_DATE, T.t_date) AS days_past,
IFNULL(SUM(S.shares), 0) AS subtrans_shares,
T.shares - IFNULL(SUM(S.shares),0) AS due_shares,
(SELECT IFNULL(SUM(IF(SO.t_type = 'sell', -SO.shares, SO.shares )), 0)
FROM subtransactions AS SO WHERE SO.symbol = T.symbol) AS owned_shares
FROM transactions AS T [<--- here ]
Notice the [<---- here ] marks I added to your code.
The first T is not in any way related to the second T. They have the same correlation alias, they refer to the same table, but they're entirely independent selects and results.
So what you're doing in the first, uncorrelated, subquery is getting the max id for all positions in transactions.
And then you're joining all transaction.position.max(id)s to result2 (which result2 happens to be a join of all transaction.positions to subtransactions). (And the internal order by is pointless and costly, too, but that's not the main problem.)
You're joining every transaction.position.max(id) to every (whatever result 2 selects).
On Edit, after getting home: Ok, you're not Cartesianing, the "where MAXid = id" does join result1 to result2. But you're still rolling up all rows of transaction in both queries.
So you're getting a Cartesian join -- every result1 joined to every result2, unconditionally (nothing tells the database, for example, that they ought to be joined by (max) id or by position).
So if you have ten unique position.max(id)s in transaction, you're getting 100 rows. 1000 unique positions, a million rows. Etc.
When you want to write a complicated query like this, it's a lot easier if you compose it out of simpler views. in particular, you can test each view on its own, to make sure you're getting reasonable results, and then just join the views.
I would split the query into smaller chunks, probably using a stored proc. For example get the max ids from transaction and put this in a table variable. Then join this with subtransactions. This will make it easier for you and the compiler to work out what is going on.
Also without knowing what indexes are on your table it is hard to offer more advice
Put a benchmark function in the code. Then time each section of the code to determine where the slow down is happening. Often times the slow down happens in a different query than you first guess. Determine the correct query that needs to be optimized before posting to stackoverflow.