What SQL indexes should I add for this bloated query? - mysql

I'm wondering if indexes will speed this query up. It takes 9 seconds last time I checked. The traffic table has about 300k rows, listings and users 5k rows. I'm open to ridicule/humiliation too, if this is just a crappy query altogether. I wrote it long ago.
It's supposed to get the listings with the most page views (traffic). Let me know if the explanation is lacking.
SELECT traffic_listingid AS listing_id,
COUNT(traffic_listingid) AS genuine_hits,
COUNT(DISTINCT traffic_ipaddress) AS distinct_ips,
users.username,
listings.listing_address,
listings.datetime_created,
DATEDIFF(NOW(), listings.datetime_created) AS listing_age_days
FROM traffic
LEFT JOIN listings
ON traffic.traffic_listingid = listings.listing_id
LEFT JOIN users
ON users.id = listings.seller_id
WHERE traffic_genuine = 1
AND listing_id IS NOT NULL
AND username IS NOT NULL
AND DATEDIFF(NOW(), traffic_timestamp) < 24
GROUP BY traffic_listingid
ORDER BY distinct_ips DESC
LIMIT 10
P.S.
ENGINE=MyISAM /
MySQL Server 4.3

Sidenotes:
1.You have
LEFT JOIN listings
ON traffic.traffic_listingid = listings.listing_id
...
WHERE ...
AND listing_id IS NOT NULL
This condition cancels the LEFT JOIN. Change your query into:
INNER JOIN listings
ON traffic.traffic_listingid = listings.listing_id
and remove the listing_id IS NOT NULL from the WHERE conditions.
The same thing applies to LEFT JOIN user and username IS NOT NULL.
2.The check on traffic_timestamp:
DATEDIFF(NOW(), traffic_timestamp) < 24
makes it difficult for the index to be used. Change it into something like this that can use an index
(and check that my version is equivalent, I may have mistakes):
traffic_timestamp >= CURRENT_DATE() - INTERVAL 23 DAY
3.The COUNT(non-nullable-column) is equivalent to COUNT(*). Change the:
COUNT(traffic_listingid) AS genuine_hits,
to:
COUNT(*) AS genuine_hits,
because it's bit faster in MySQL (although I'm not sure about that for version 4.3)
For the index question, you should have at least an index on every column that is used for joining. Adding one more for the traffic_timestamp will probably help, too.
If you tell us in which tables the traffic_ipaddress and traffic_timestamp are, and what the EXPLAIN EXTENDED shows, someone may have a better idea.
Reading again the query, it seems that it's actually a GROUP BY only in table traffic and the other 2 tables are used to get refrence data. So, the query is equivalent to a (traffic group by)-join-listing-join-user. Not sure if that helps in your MySQL old version but it may be good to have both versions of the query and test if one query runs faster in your system.

Indexes should always be put on columns you use in the where clause.
In this case the listingid looks like a good option, as well as the users.id, seller_id and traffic_timestamp.
Use a EXPLAIN EXTENDED in front of your query to see what MySQL recommends you (It shows how many rows are touched, and what indexes it uses)

Related

Optimizing/Alternative to slow MySQL query with multiple calculations, joins and order bys

I have a complex MySQL query which contains multiple calculations and joins to retrieve a list of contractors by location. The contractors table (users) contains 100,000+ rows and growing. The issue I'm having is that the query takes over 1.5 seconds to execute which is causing a significant delay to the page load.
I have found that removing the ORDER BY clause, the speed is increased significantly (< 0.05s). After looking through other related questions on Stack Overflow, I understand why this is the case but have not yet found a viable solution.
It's also worth noting I have added indexes as suggested in other posts but I believe you cannot optimize any further when sorting on calculated columns. (Please correct me if I'm wrong)
Here is the query (I have removed several columns and joins for simplicity but still the query takes the same time to execute):
SELECT `users`.`id`,
`users`.`username`,
IF (Max(up.premium_expires_at) > Now(), 1, 0) AS `is_premium`,
IF (users.last_online_at >= Now() - INTERVAL 30 day, 1, 0) AS `recent_login`,
IF (da.id IS NOT NULL, 1, 0) AS `is_available`,
( 3959 * Acos(Cos(Radians(53.80592)) * Cos(Radians(lat)) * Cos(
Radians(lng) - Radians(-1.53834
)) + Sin(Radians(53.80592)) *
Sin(Radians(lat))) ) AS
`distance`
FROM `users`
INNER JOIN `users_places` AS `up`
ON `users`.`id` = `up`.`user_id`
INNER JOIN `places` AS `mp`
ON `users`.`place_id` = `mp`.`id`
LEFT JOIN `users_dates_available` AS `da`
ON `da`.`user_id` = `users`.`id`
AND `from` <= Curdate()
AND `to` >= Curdate()
LEFT JOIN (SELECT user_id,
Sum(score) AS score
FROM users_feedback
WHERE status = 1
GROUP BY user_id) AS feedback
ON `users`.`id` = `feedback`.`user_id`
WHERE `users`.`status` = 1
AND `users`.`approved` = 1
GROUP BY `users`.`id`
HAVING `distance` < 50
ORDER BY `is_premium` DESC,
`recent_login` DESC
LIMIT 5
And here's the results of EXPLAIN
So I guess my question is: What is the quickest way to display this data on a web page?
What I've tried:
The query is part of a Laravel application. I've tried running the query without the ORDER BY and sorting by PHP. However execution times remain slow.
Running the query without LEFT joins and I noticed significant improvements to speed. However the query must use LEFT joins for the calculations in the SELECT criteria (we're checking for NULL values).
Using Views - still same query speeds with a pre-compiled view.
The only other option I can think of is to create a temporary table which contains all the calculated fields and query this. However this will not store the 'distance' column as this is specific to the user running the query and I will still be sorting by a calculated column.
Is there another option or another way to optimize this query that I'm missing? Thanks
The query seems not to use feedback, so remove the LEFT JOIN. That will save on some wasted effort.
Similarly, places seems to be useless, except as an existence test.
Which table are lat and lng in? (I cannot finish my analysis without knowing what table each column is in.)
Are from and to of datatype DATE? If so, the WHERE clause involving them seems to say "anytime today". Is that correct?
Get some of that cleaned up. After that, I may be able to suggest moving one of the Joins until after the GROUP BY and LIMIT. Or maybe the GROUP BY can be eliminated.
Some indexes that might be useful:
users: INDEX(status, approved, id, username, last_online_at, place_id)
up: INDEX(user_id, premium_expires_at)
da: INDEX(user_id, id)
users_feedback: INDEX(status, user_id)
Distance
The problem with distance queries is that the simple SELECT requires checking every row in the table. This is slow. I have a blog on what to do to improve performance for the general "find nearest" problem. It discusses 5 approaches, starting with the least efficient (what your code does): http://mysql.rjweb.org/doc.php/find_nearest_in_mysql

Query Speed Issue with NOT EXISTS condition

I have a query that works, but it is slow. Is there a way to speed this up? Basically I have a table with timecard entries, and then a second table with time breakdowns of that entry, related by the TimecardID. What I am looking for is timeblocks that there are no breakdowns for. I thought if I cut the criteria down to 2 months that it would speed it up. Thanks for your help
SELECT * FROM Timecards
WHERE NOT EXISTS (SELECT TimeCardID FROM TimecardBreakdown WHERE Timecards.ID = TimecardBreakdown.TimeCardID)
AND Status <> 0
AND DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
It seems you want to know the TimecardIDs which do not exist in the TimecardBreakdown table, in which case you can use the left outer join.
SELECT a.*
FROM Timecards a
LEFT OUTER JOIN TimecardBreakdown b ON a.TimecardID = b.TimecardID
WHERE b.TimecardID IS NULL
This would get rid of the subquery (which is expensive) and use join (which is more efficient).
MySQL stinks doing correlated subqueries fast. Try to make your subqueries independent and join them. You can use the LEFT JOIN ... IS NULL pattern to replace WHERE NOT EXISTS.
SELECT tc.*
FROM Timecards tc
LEFT JOIN TimecardBreakdown tcb ON tc.ID = tcb.TimeCardId
WHERE tc.DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
AND tc.Status <> 0
AND tcb.TimeCardId IS NULL
Some optimization points.
First, if you can change tc.Status <> 0 to tc.Status > 0 it makes an index range scan possible on that column.
Second, when you're optimizing stuff, SELECT * is considered harmful. Instead, if you can give the names of just the columns you need, things will be quicker. The database server has to sling around all the data you ask for; it can't tell if you're going to ignore some of it.
Third, this query will be helped by a compound index on Timecards (DateIn, Status, ID). That compound index can be used to do the heavy lifing of satisfying your query conditions.
That's called a covering index; it contains the data needed to satisfy much of your query. If you were to index just the DateIn column, then the query handler would have to bounce back to the main table to find the values of Status and ID. When those columns appear in the index, it saves that extra operation.
If you SELECT a certain set of columns rather than doing SELECT *, including those columns in the covering index can dramatically improve query performance. That's one of several reasons SELECT * is considered harmful.
(Some makes and model of DBMS have ways to specify lists of columns to ride along on indexes without actually indexing them. MySQL requires you to index them. But covering indexes still help.)
Read this: http://use-the-index-luke.com/

Is this the best approach to this complex MySQL multi table query?

I'm building a complex multi-table MySQL query, and even though it works, I'm wondering could I make it more simple.
The idea behind it is this, using the Events table that logs all site interaction, select the ID, Title, and Slug of the 10 most popular blog posts, and order by the most hits descending.
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits
FROM content, events
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND events.page_url REGEXP '^/posts/[0-9]'
AND content.id = events.content_id
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10
Blog post URLs have the following format:
/posts/2013-05-16-hello-world
As I mentioned it seems to work, but I'm sure I could be doing this cleaner.
Thanks,
The condition on created and the condition on page_url are both range conditions. You can get index-assistance for only one range condition per table in a SQL query, so you have to pick one or the other to index.
I would create an index on the events table over two columns (content_id, created).
ALTER TABLE events ADD KEY (content_id, created);
I'm assuming that restricting by created date is more selective than restricting by page_url, because I assume "/posts/" is going to match a large majority of the events.
After narrowing down the matching rows by created date, the page-url condition will have to be handled by the SQL layer, but hopefully that won't be too inefficient.
There is no performance difference between SQL-89 ("comma-style") join syntax and SQL-92 JOIN syntax. I do recommend SQL-92 syntax because it's more clear and it supports outer joins, but performance is not a reason to use it. The SQL query optimizer supports both join styles.
Temporary table and filesort are often costly for performance. This query is bound to create a temporary table and use a filesort, because you're using GROUP BY and ORDER BY against different columns. You can only hope that the temp table will be small enough to fit within your tmp_table_size limit (or increase that value). But that won't help if content.title or content.slug are BLOB/TEXT columns, the temp table will be forced to be spooled on disk anyway.
Instead of a regular expression, you can use the left function:
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits FROM content JOIN events ON content.id = events.content_id
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND left( events.page_url, 7) = '/posts/'
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10)
But that's just off the top of my head, and without a fiddle, untested. The JOIN suggestion, made in the comment, is also good and has been reflected in my answer.

Improve JOIN query speed

I have this simple join that works great but is HORRIBLY slow I think because the tech table is very large. There are many instances of uid as it tracks timestamp of the uid thus the distinct. What is the best way to speed this query up?
SELECT DISTINCT tech.uid,
listing.empno,
listing.firstname,
listing.lastname
FROM tech,
listing
WHERE tech.uid = listing.empno
ORDER BY listing.empno ASC
First add an Index to tech.UID and listing.EmpNo on their respective tables.
After you are sure there are indexes you can try to re-write your query like this:
SELECT DISTINCT tech.uid, listing.EmpNo, listing.FirstName, listing.LastName
FROM listing INNER JOIN tech ON tech.uid = listing.EmpNo
ORDER BY listing.EmpNo ASC;
If it's still not fast enough, put the word EXPLAIN before the query to get some hints about the execution plan of the query.
EXPLAIN SELECT DISTINCT tech.uid, listing.EmpNo, listing.FirstName, listing.LastName
FROM listing INNER JOIN tech ON tech.uid = listing.EmpNo
ORDER BY listing.EmpNo ASC;
Posts the Explain results so we can get better insight.
Hope it helps,
This is very simple query. Only thing you can do in SQL - you may add indexes on fields used in JOIN/WHERE and ORDER BY clauses (tech.uid, listing.empno), if there are no indexes.
If there are JOIN fields with NULL values - they may ruin your performance. You should filter them in WHERE clause (WHERE tech.uid is not null and listing.empno not null). If there are many rows with JOIN on NULL field - that data may produce cartesian result (not sure how is this called in english) with may contain enormous count of rows.
You may change MySQL configuration. There are many options useful for performance tuning, like key_buffer_size, sort_buffer_size, tmp_table_size, max_heap_table_size, read_buffer_size etc.

How do I join one table onto another where userid = userid but only for that date?

I'm looking to take the total time a user worked on each batch at his workstation, the total estimated work that was completed, the amount the user was paid, and how many failures the user has had for each day this year. If I can join all of this into one query then I can use it in excel and format things nicely in pivot tables and such.
EDIT: I've realized that is only possible to do this in multiple queries so I have narrowed my scope down to this:
SELECT batch_log.userid,
batches.operation_id,
SUM(TIME_TO_SEC(ramses.batch_log.time_elapsed)),
SUM(ramses.tasks.estimated_nonrecurring + ramses.tasks.estimated_recurring),
DATE(start_time)
FROM batch_log
JOIN batches ON batch_log.batch_id=batches.id
JOIN ramses.tasks ON ramses.batch_log.batch_id=ramses.tasks.batch_id
JOIN protocase.tblusers on ramses.batch_log.userid = protocase.tblusers.userid
WHERE DATE(ramses.batch_log.start_time) > "2011-01-01"
AND protocase.tblusers.active = 1
GROUP BY userid, batches.operation_id, start_time
ORDER BY start_time, userid ASC
The cross join was causing the problem.
No, in general a Having clause is used to filter the results of your Group by - for example, only reporting those who were paid for more than 24 hours in a day (HAVING SUM(ramses.timesheet_detail.paidTime) > 24). Unless you need to perform filtering of aggregate results, you shouldn't need a having clause at all.
Most of those conditions should be moved into a where clause, or as part of the joins, for two reasons - 1) Filtering should in general be done as soon as possible, to limit the work the query needs to perform. 2) If the filtering is already done, restating it may cause the query to perform additional, unneeded work.
From what I've seen so far, it appears that you're trying to roll things up by the day - try changing the last column in the group by clause to date(ramses.batch_log.start_time), or you're grouping by (what I assume is) a timestamp.
EDIT:
About schema names - yes, you can name them in the from and join sections. Often, too, the query may be able to resolve the needed schemas based on some default search list (how or if this is set up depends on your database).
Here is how I would have reformatted the query:
SELECT tblusers.userid, operations.name AS name,
SUM(TIME_TO_SEC(batch_log.time_elapsed)) AS time_elapsed,
SUM(tasks.estimated_nonrecurring + tasks.estimated_recurring) AS total_estimated,
SUM(timesheet_detail.paidTime) as hours_paid,
DATE(start_time) as date_paid
FROM tblusers
JOIN batch_log
ON tblusers.userid = batch_log.userid
AND DATE(batch_log.start_time) >= "2011-01-01"
JOIN batches
ON batch_log.batch_id = batches.id
JOIN operations
ON operations.id = batches.operation_id
JOIN tasks
ON batches.id = tasks.batch_id
JOIN timesheet_detail
ON tblusers.userid = timesheet_detail.userid
AND batch_log.start_time = timesheet_detail.for_day
AND DATE(timesheet_detail.for_day) = DATE(start_time)
WHERE tblusers.departmentid = 8
GROUP BY tblusers.userid, name, DATE(batch_log.start_time)
ORDER BY date_paid ASC
Of particular concern is the batch_log.start_time = timesheet_detail.for_day line, which is comparing (what are implied to be) timestamps. Are these really equal? I expect that one or both of these should be wrapped in a date() function.
As for why you may be getting unexpected data - you appear to have eliminated some of your join conditions. Without knowing the exact setup and use of your database, I cannot give the exact reason for your results (or even able to say they are wrong), but I think the fact that you join to the operations table without any join condition is probably to blame - if there are 2 records in that table, it will double all of your previous results, and it looks like there may be 12. You also removed operations.name from the group by clause, which may or may not give you the results you want. I would look into the rest of your table relationships, and see if there are any further restrictions that need to be made.