I'm building a complex multi-table MySQL query, and even though it works, I'm wondering could I make it more simple.
The idea behind it is this, using the Events table that logs all site interaction, select the ID, Title, and Slug of the 10 most popular blog posts, and order by the most hits descending.
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits
FROM content, events
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND events.page_url REGEXP '^/posts/[0-9]'
AND content.id = events.content_id
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10
Blog post URLs have the following format:
/posts/2013-05-16-hello-world
As I mentioned it seems to work, but I'm sure I could be doing this cleaner.
Thanks,
The condition on created and the condition on page_url are both range conditions. You can get index-assistance for only one range condition per table in a SQL query, so you have to pick one or the other to index.
I would create an index on the events table over two columns (content_id, created).
ALTER TABLE events ADD KEY (content_id, created);
I'm assuming that restricting by created date is more selective than restricting by page_url, because I assume "/posts/" is going to match a large majority of the events.
After narrowing down the matching rows by created date, the page-url condition will have to be handled by the SQL layer, but hopefully that won't be too inefficient.
There is no performance difference between SQL-89 ("comma-style") join syntax and SQL-92 JOIN syntax. I do recommend SQL-92 syntax because it's more clear and it supports outer joins, but performance is not a reason to use it. The SQL query optimizer supports both join styles.
Temporary table and filesort are often costly for performance. This query is bound to create a temporary table and use a filesort, because you're using GROUP BY and ORDER BY against different columns. You can only hope that the temp table will be small enough to fit within your tmp_table_size limit (or increase that value). But that won't help if content.title or content.slug are BLOB/TEXT columns, the temp table will be forced to be spooled on disk anyway.
Instead of a regular expression, you can use the left function:
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits FROM content JOIN events ON content.id = events.content_id
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND left( events.page_url, 7) = '/posts/'
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10)
But that's just off the top of my head, and without a fiddle, untested. The JOIN suggestion, made in the comment, is also good and has been reflected in my answer.
Related
I have a query that works, but it is slow. Is there a way to speed this up? Basically I have a table with timecard entries, and then a second table with time breakdowns of that entry, related by the TimecardID. What I am looking for is timeblocks that there are no breakdowns for. I thought if I cut the criteria down to 2 months that it would speed it up. Thanks for your help
SELECT * FROM Timecards
WHERE NOT EXISTS (SELECT TimeCardID FROM TimecardBreakdown WHERE Timecards.ID = TimecardBreakdown.TimeCardID)
AND Status <> 0
AND DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
It seems you want to know the TimecardIDs which do not exist in the TimecardBreakdown table, in which case you can use the left outer join.
SELECT a.*
FROM Timecards a
LEFT OUTER JOIN TimecardBreakdown b ON a.TimecardID = b.TimecardID
WHERE b.TimecardID IS NULL
This would get rid of the subquery (which is expensive) and use join (which is more efficient).
MySQL stinks doing correlated subqueries fast. Try to make your subqueries independent and join them. You can use the LEFT JOIN ... IS NULL pattern to replace WHERE NOT EXISTS.
SELECT tc.*
FROM Timecards tc
LEFT JOIN TimecardBreakdown tcb ON tc.ID = tcb.TimeCardId
WHERE tc.DateIn >= CURRENT_DATE() - INTERVAL 2 MONTH
AND tc.Status <> 0
AND tcb.TimeCardId IS NULL
Some optimization points.
First, if you can change tc.Status <> 0 to tc.Status > 0 it makes an index range scan possible on that column.
Second, when you're optimizing stuff, SELECT * is considered harmful. Instead, if you can give the names of just the columns you need, things will be quicker. The database server has to sling around all the data you ask for; it can't tell if you're going to ignore some of it.
Third, this query will be helped by a compound index on Timecards (DateIn, Status, ID). That compound index can be used to do the heavy lifing of satisfying your query conditions.
That's called a covering index; it contains the data needed to satisfy much of your query. If you were to index just the DateIn column, then the query handler would have to bounce back to the main table to find the values of Status and ID. When those columns appear in the index, it saves that extra operation.
If you SELECT a certain set of columns rather than doing SELECT *, including those columns in the covering index can dramatically improve query performance. That's one of several reasons SELECT * is considered harmful.
(Some makes and model of DBMS have ways to specify lists of columns to ride along on indexes without actually indexing them. MySQL requires you to index them. But covering indexes still help.)
Read this: http://use-the-index-luke.com/
I am running the below query to retrive the unique latest result based on a date field within a same table. But this query takes too much time when the table is growing. Any suggestion to improve this is welcome.
select
t2.*
from
(
select
(
select
id
from
ctc_pre_assets ti
where
ti.ctcassettag = t1.ctcassettag
order by
ti.createddate desc limit 1
) lid
from
(
select
distinct ctcassettag
from
ctc_pre_assets
) t1
) ro,
ctc_pre_assets t2
where
t2.id = ro.lid
order by
id
Our able may contain same row multiple times, but each row with different time stamp. My object is based on a single column for example assettag I want to retrieve single row for each assettag with latest timestamp.
It's simpler, and probably faster, to find the newest date for each ctcassettag and then join back to find the whole row that matches.
This does assume that no ctcassettag has multiple rows with the same createddate, in which case you can get back more than one row per ctcassettag.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
INNER JOIN
(
SELECT
ctcassettag,
MAX(createddate) AS createddate
FROM
ctc_pre_assets
GROUP BY
ctcassettag
)
newest
ON newest.ctcassettag = ctc_pre_assets.ctcassettag
AND newest.createddate = ctc_pre_assets.createddate
ORDER BY
ctc_pre_assets.id
EDIT: To deal with multiple rows with the same date.
You haven't actually said how to pick which row you want in the event that multiple rows are for the same ctcassettag on the same createddate. So, this solution just chooses the row with the lowest id from amongst those duplicates.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
WHERE
ctc_pre_assets.id
=
(
SELECT
lookup.id
FROM
ctc_pre_assets lookup
WHERE
lookup.ctcassettag = ctc_pre_assets.ctcassettag
ORDER BY
lookup.createddate DESC,
lookup.id ASC
LIMIT
1
)
This does still use a correlated sub-query, which is slower than a simple nested-sub-query (such as my first answer), but it does deal with the "duplicates".
You can change the rules on which row to pick by changing the ORDER BY in the correlated sub-query.
It's also very similar to your own query, but with one less join.
Nested queries are always known to take longer time than a conventional query since. Can you append 'explain' at the start of the query and put your results here? That will help us analyse the exact query/table which is taking longer to response.
Check if the table has indexes. Unindented tables are not advisable(until unless obviously required to be unindented) and are alarmingly slow in executing queries.
On the contrary, I think the best case is to avoid writing nested queries altogether. Bette, run each of the queries separately and then use the results(in array or list format) in the second query.
First some questions that you should at least ask yourself, but maybe also give us an answer to improve the accuracy of our responses:
Is your data normalized? If yes, maybe you should make an exception to avoid this brutal subquery problem
Are you using indexes? If yes, which ones, and are you using them to the fullest?
Some suggestions to improve the readability and maybe performance of the query:
- Use joins
- Use group by
- Use aggregators
Example (untested, so might not work, but should give an impression):
SELECT t2.*
FROM (
SELECT id
FROM ctc_pre_assets
GROUP BY ctcassettag
HAVING createddate = max(createddate)
ORDER BY ctcassettag DESC
) ro
INNER JOIN ctc_pre_assets t2 ON t2.id = ro.lid
ORDER BY id
Using normalization is great, but there are a few caveats where normalization causes more harm than good. This seems like a situation like this, but without your tables infront of me, I can't tell for sure.
Using distinct the way you are doing, I can't help but get the feeling you might not get all relevant results - maybe someone else can confirm or deny this?
It's not that subqueries are all bad, but they tend to create massive scaleability issues if written incorrectly. Make sure you use them the right way (google it?)
Indexes can potentially save you for a bunch of time - if you actually use them. It's not enough to set them up, you have to create queries that actually uses your indexes. Google this as well.
I have a very big unindexed table called table with rows like this:
IP entrypoint timestamp
171.128.123.179 /page-title/?kw=abc 2016-04-14 11:59:52
170.45.121.111 /another-page/?kw=123 2016-04-12 04:13:20
169.70.121.101 /a-third-page/ 2016-05-12 09:43:30
I want to make the fastest query that, given 30 IPs and one date, will search rows as far back a week before that date and return the most recent row that contains "?kw=" for each IP. So I want DISTINCT entrypoints but only the most recent one.
I'm stuck by this I know it's a relatively simple INNER JOIN but I don't know the fastest way to do it.
By the way: I can't add the index right now because it's very big and on a db that serves a website. I'm going to replace it with an indexed table don't worry.
Rows from the table
SELECT ...
FROM very_big_unindexed_table t
only within the past week...
WHERE t.timestamp >= NOW() + INTERVAL - 1 WEEK
that contains '?kw=' in the entry point
AND t.entrypoint LIKE '%?kw=%'
only the latest row for each IP. There's a couple of approaches to that. A correlated subquery on a very big unindexed table is going to eat your lunch and your lunch box. And without an index, there's no getting around a full scan of the table and a "Using filesort" operation.
Given the unfortunate circumstances, our best bet for performance is likely going to be getting the set whittled down as small as we can, and then perform the sort, and avoid any join operations (back to that table) and avoid correlated subqueries.
So, let's start with something like this, to return all of the rows from the past week with '?kw=' in entry point. This is going to be full scan of the table, and a sort operation...
SELECT t.ip
, t.timestamp
, t.entry_point
FROM very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
We can use an unsupported trick with user-defined variables. (The MySQL Reference Manual specifically warns against using a pattern like this, because the behavior is (officially) undefined. Unofficially, the optimizer in MySQL 5.1 and 5.5 (at least) is very predictable.
I think this is going to be about as good as you are going to get, if the number of rows from the past week are significant subset of the entire table. This is going to create a sizable intermediate resultset (derived table), if there are lot of rows that satisfy the predicates.
SELECT q.ip
, q.entrypoint
, q.timestamp
FROM (
SELECT IF(t.ip = #prev_ip, 0, 1) AS new_ip
, #prev_ip := t.ip AS ip
, t.timestamp AS timestamp
, t.entrypoint AS entrypoint
FROM (SELECT #prev_ip := NULL) i
CROSS
JOIN very_big_unindexed_table t
WHERE t.timestamp >= NOW() + INTERVAL -1 WEEK
AND t.entrypoint LIKE '%?kw=%'
ORDER BY t.ip DESC, t.timestamp DESC
) q
WHERE q.new_ip
Execution of that query will require (in terms of what's going to take the time)
a full scan of the table (there's no way to get around that)
a sort operation (again, there's no way around that)
materializing a derived table containing all of the rows that satisfy the predicates
a pass through the derived table to pull out the "latest" row for each IP
I have a problem optimizing a really slow SQL query. I think is an index problem, but I can´t find which index I have to apply.
This is the query:
SELECT
cl.ID, cl.title, cl.text, cl.price, cl.URL, cl.ID AS ad_id, cl.cat_id,
pix.file_name, area.area_name, qn.quarter_name
FROM classifieds cl
/*FORCE INDEX (date_created) */
INNER JOIN classifieds_pix pix ON cl.ID = pix.classified_id AND pix.picture_no = 0
INNER JOIN zip_codes zip ON cl.zip_id = zip.zip_id AND zip.area_id = 132
INNER JOIN area_names area ON zip.area_id = area.id
LEFT JOIN quarter_names qn ON zip.quarter_id = qn.id
WHERE
cl.confirmed = 1
AND cl.country = 'DE'
AND cl.date_created <= NOW() - INTERVAL 1 DAY
ORDER BY
cl.date_created
desc LIMIT 7
MySQL takes about 2 seconds to get the result, and start working in pix.picture_no, but if I force index to "date_created" the query goes much faster, and takes only 0.030 s. But the problem is that the "INNER JOIN zip_codes..." is not always in the query, and when is not, the forced index make the query slow again.
I've been thinking in make a solution by PHP conditions, but I would like to know what is the problem with indexes.
These are several suggestions on how to optimize your query.
NOW Function - You're using the NOW() function in your WHERE clause. Instead, I recommend to use a constant date / timestamp, to allow the value to be cached and optimized. Otherwise, the value of NOW() will be evaluated for each row in the WHERE clause. An alternative to a constant value in case you need a dynamic value, is to add the value from the application (for example calculate the current timestamp and inject it to the query as a constant in the application before executing the query.
To test this recommendation before implementing this change, just replace NOW() with a constant timestamp and check for performance improvements.
Indexes - in general, I would suggest adding an index the contains all columns of your WHERE clause, in this case: confirmed, country, date_created. Start with the column that will cut the amount of data the most and move forward from there. Make sure you adjust the WHERE clause to the same order of the index, otherwise the index won't be used.
I used EverSQL SQL Query Optimizer to get these recommendations (disclaimer: I'm a co-founder of EverSQL and humbly provide these suggestions).
I would actually have a compound index on all elements of your where such as
(country, confirmed, date_created)
Having the country first would keep your optimized index subset to one country first, then within that, those that are confirmed, and finally the date range itself. Don't query on just the date index alone. Since you are ordering by date, the index should be able to optimize it too.
Add explain in front of the query and run it again. This will show you the indexes that are being used.
See: 13.8.2 EXPLAIN Statement
And for an explanation of explain see MySQL Explain Explained. Or: Optimizing MySQL: Queries and Indexes
I have the following query which is a little expensive (currently 500ms):
SELECT * FROM events AS e, event_dates AS ed
WHERE e.id=ed.event_id AND ed.start >= DATE(NOW())
GROUP BY e.modified_datetime, e.id
ORDER BY e.modified_datetime DESC,e.created_datetime DESC
LIMIT 0,4
I have been trying to figure our how to speed it up and noticed that changing ed.start >= DATE(NOW()) to ed.start = DATE(NOW()) runs the query in 20ms. Can anyone help me with ways to speed up this date comparison? Would it help to calculate DATE(NOW()) before running the query??
EDIT: does this help, using EXPLAIN statement
BEFORE
table=event_dates
type=range
rows=25962
ref=null
extra=using where; Using temporary; Using filesort
AFTER
table=event_dates
type=ref
rows=211
ref=const
extra=Using temporary; Using filesort
SELECT * FROM events AS e
INNER JOIN event_dates AS ed ON (e.id=ed.event_id)
WHERE ed.start >= DATE(NOW())
GROUP BY e.modified_datetime, e.id
ORDER BY e.modified_datetime DESC,e.created_datetime DESC
LIMIT 0,4
Remarks
Please don't using implicit SQL '89 syntax, it is an SQL anti-pattern.
Make sure you have an index on all fields used in the join, in the where, in the group by and the order by clauses.
Don't do select * (another anti-pattern), explicitly state the fields you need instead.
Try using InnoDB instead of MyISAM, InnoDB has more optimization tricks for select statements, especially if you only select indexed fields.
For MyISAM tables try using REPAIR TABLE tablename.
For InnoDB that's not an option, but forcing the tabletype from MyISAM to InnoDB will obviously force a full rebuild of the table and all indexes.
Group by implicitly sorts the rows in ASC order, try changing the group by to group by -e.modified_datetime, e.id to minimize the reordering needed by the order by clause. (not sure about this point, would like to know the result)
For reference, using , notation for joins is poor practice AND has been a cause for poor execution plans.
SELECT
*
FROM
events AS e
INNER JOIN
event_dates AS ed
ON e.id=ed.event_id
WHERE
ed.start >= DATE(NOW())
GROUP BY
e.modified_datetime,
e.id
ORDER BY
e.modified_datetime DESC,
e.created_datetime DESC
LIMIT 0,4
Why = is faster than >= is simply because >= is a Range of values, not a very specific value. It's like saying "get me ever page in the book from page 101 onwards" instead of "get me page 101". It's more intensive by definition, especially as your query then involves aggregating and sorting many more records.
In terms of optimisation, your best option is to ensure relevant indexes...
event_dates:
- an index just on start should be sufficient
events:
- an index on id will dramatically improve the join performance
- adding modified_datetime and created_datetime to that index may help
Probably missing indexes on fields you are grouping and searching. Please provide us with: SHOW INDEXES FROM events and SHOW INDEXES FROM event_dates
If there are no indexes then you can add them:
ALTER TABLE events ADD INDEX(modified_datetime);
ALTER TABLE events ADD INDEX(created_datetime);
ALTER TABLE event_dates ADD INDEX(start);
Also be sure you have them on id fields. But here you would probably like to have them as primary keys.
Calculating DATE(NOW()) in advance will not have any impact on performance. It's computed only once (not for each row). But you have 2 different queries (one with >=, another with =). It seems natural that the first one (>=) takes longer time to execute since it returns many more rows. Also, it may decide to use different execution plan compared to query with = , for example, full table scan instead index seek/scan
You can do something like this
DECLARE #CURRENTDATE AS DATETIME
SET #CURRENTDATE = GETDATE()
then change your code to use
#CURRENTDATE variable.... "e.start >= #CURRENTDATE