Ok, so we have a rather large database on a Wordpress website, we run a query every month or so that removes old logs from the database and I've been using the query below to accomplish it.
I essentially look at the total rows that need removing and keep running this query until the logs are all gone.
However, with a LIMIT of 1000 the query takes around 30 seconds to complete. The posts table contains around 400,000 entries and on this occasion, around 60,000 of these need removing.
Any help is much much appreciated! I'm a novice when it comes to SQL so this query will probably hurt some peoples eyes, but please be gentle, I'm learning! :)
CREATE TEMPORARY TABLE LOGSTOCLEAN
SELECT
ID
FROM
wp_posts WTCUP
LEFT JOIN wp_term_relationships WTCUTR ON WTCUP.ID = WTCUTR.object_id
LEFT JOIN wp_term_taxonomy WTCUTT ON WTCUTR.term_taxonomy_id = WTCUTT.term_taxonomy_id
WHERE
WTCUTT.term_id IN (10)
AND WTCUP.post_date < DATE_SUB(NOW(), INTERVAL 45 DAY)
GROUP BY
WTCUP.ID
LIMIT 1000;
DELETE FROM wp_posts
WHERE wp_posts.ID IN(SELECT ID FROM LOGSTOCLEAN)
LIMIT 1000;
Rather than a rows limit, a time would allow the oldest rows to be removed, which we might hope are co-located by clustering or partitioning, or at least quickly locatable by index. Select min date-time + some interval and delete all < that.
Partitioning by day means you can just clean/drop one partition at a time. Your RDBMS may vary! Dropping a day (partition?) every day at midnight seems sensible!
PARTITIONing is out of the question since you are purging only rows with term_id=10, and that test is not in wp_posts.
It sound like there are more than 1000 rows to remove on a typical day? Then consider removing only, say, 300. But do it every hour instead of every month. A "cron" job or a MySQL "Event" could take care of running that automatically. (If there is more than 300 in an hour, don't worry, the next invocation will take care of more.)
You have not really asked a Question. Is what you are doing too slow? Interferes with other activities? Too 'manual'? Or something else?
If it is a performance question, then please provide SHOW CREATE TABLE for each table.
Related
Which is the faster way to execute a query that updates a lot of rows?
The first query example will update the points column to 0, for every rows that the last_visit is made from 7 days and before.
After, in both cases, there is an additional query that writes in all rows that the last_visit was made on the last 7 days.
The table now has about 140.000 rows. The first query updates 110.000 rows and the second 140.000
UPDATE the_table SET points = 0 where DATE(last_visit) <= DATE_SUB(CURDATE(), INTERVAL 7 DAY) AND type is NULL
or
UPDATE the_table SET points = 0 where type is NULL
Your two UPDATES do far different things and both use WHERE clauses, so any speed comparison is useless. The first checks both for a 7 day period AND the type being NULL, while the second only checks the second condition. They can potentially affect vastly different amounts of data (which your edit shows).
Asking which is faster is akin to saying "I have a dump truck and a Ferrari. Which is faster?" - the answer depends on whether you're going to move 10 tons of sand or go zero to 60 to merge into highway traffic. Your UPDATE performance doesn't make any more sense - it depends on which rows you actually want to UPDATE. Use the one that does what you really want to do and stop worrying about which is faster.
Before doing an UPDATE or DELETE that will affect a lot of rows, it's always a good idea to run a SELECT using the same WHERE clause to see if the data that is going to be updated is what you expect. You'll appreciate it the first time you realize that you were about to execute an UPDATE with the wrong conditions that would have caused major problems or a DELETE that would have lost valuable data.
[site_list] ~100,000 rows... 10mb in size.
site_id
site_url
site_data_most_recent_record_id
[site_list_data] ~ 15+ million rows and growing... about 600mb in size.
record_id
site_id
site_connect_time
site_speed
date_checked
columns in bold are unique index keys.
I need to return 50 most recently updated sites AND the recent data that goes with it - connect time, speed, date...
This is my query:
SELECT SQL_CALC_FOUND_ROWS
site_list.site_url,
site_list_data.site_connect_time,
site_list_data.site_speed,
site_list_data.date_checked
FROM site_list
LEFT JOIN site_list_data
ON site_list.site_data_most_recent_record_id = site_list_data.record_id
ORDER BY site_data.date_checked DESC
LIMIT 50
Without the ORDER BY and SQL_CALC_FOUND_ROWS(I need it for pagination), the query takes about 1.5 seconds, with those it takes over 2 seconds or more which is not good enough because that particular page where this data will be shown is getting 20K+ pageviews/day and this query is apparently too heavy(server almost dies when I put this live) and too slow.
Experts of mySQL, how would you do this? What if the table got to 100 million records? Caching this huge result into a temp table every 30 seconds is the only other solution I got.
You need to add a heuristic to the query. You need to gate the query to get reasonable performance. It is effectively sorting your site_list_date table by date descending -- the ENTIRE table.
So, if you know that the top 50 will be within the last day or week, add a "and date_checked > <boundary_date>" to the query. Then it should reduce the overall result set first, and THEN sort it.
SQL_CALC_ROWS_FOUND is slow use COUNT instead. Take a look here
A couple of observations.
Both ORDER BY and SQL_CALC_FOUND_ROWS are going to add to the cost of your performance. ORDER BY clauses can potentially be improved with appropriate indexing -- do you have an index on your date_checked column? This could help.
What is your exact need for SQL_CALC_FOUND_ROWS? Consider replacing this with a separate query that uses COUNT instead. This can be vastly better assuming your Query Cache is enabled.
And if you can use COUNT, consider replacing your LEFT JOIN with an INNER JOIN as this will help performance as well.
Good luck.
I've spent a few hours playing around with this one, without success so far.
I'm outputting a very large query, and trying to split it into chunks before processing the data. This query will basically run every day, and one of the fields ('last_checked') will be used to ensure the same data isn't processed more than once a day.
Here's my existing query;
<cfquery name="getprice" maxrows="100">
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
ORDER BY ID ASC
</cfquery>
I then run a cfoutput query on the results to do various updates. The table currently holds just over 100,000 records and is starting to struggle to process everything in one hit, hence the need to split it into chunks.
My intention is to cfschedule it to run every so often (I'll increase the maxrows and probably have it run every 15 minutes, for example). However, I need it to only return results that haven't been updated within the last 24 hours - this is where I'm getting stuck.
I know MySQL has it's own DateDiff and TimeDiff functions, but I don't seem to be able to grasp the syntax for that - if indeed its applicable for my use (docs seem to contradict themselves in that regard - or, at the least the ones I've read).
Any pointers very much appreciated!
Try this with MySQL first:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 24 HOUR
ORDER BY ID ASC
I would caution you against using maxrows=100 in your cfquery. This will still return the full recordset to CF from the database, and only then will CF filter out all but the first 100 rows. When you are dealing with a 100,000 row dataset, then this is going to be hugely expensive. Presumably, your filter for only the last 24 hours will dramatically reduce the size of your base result set, so perhaps this won't really be a big problem. However, if you find that even by limiting your set to those changed within the last 24 hours you still have a very large set of records to work with, you could change the way you do this to work much more efficiently. Instead of using CF to filter your results, have MySQL do it using the LIMIT keyword in your query:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 1 DAY
ORDER BY ID ASC
LIMIT 0,100
You could also easily set between "pages" of 100 rows by adding the offset value before the LIMIT: LIMIT 300, 100 would be rows 300-400 from your result set. Doing the paging this way will be much faster than offloading it to CF.
I have a game application in which users users answer questions and rating is based on the time elapsed on answering this questions.
I am trying to build a query that returns a the rating for top 20 players. the game has some stages and I need to retrieve the players which played all stages (assume the number of stages are 5)
This is what have I wrote:
SELECT `usersname` , `time`
FROM `users`
WHERE `users`.`id`
IN (
SELECT `steps`.`user_id`
FROM `steps`
GROUP BY `steps`.`user_id`
HAVING COUNT( `steps`.`id` ) = 5
)
ORDER BY `time` ASC
LIMIT 20
In the inner Select I am selecting all user_id-s who have played 5 stages (steps). The query works correctly but It's horribly slow. It takes about minute and a half to execute. can you provide some tips on optimizing it. Inner Select returns about 2000 rows.
Feel free to ask me if you need additional information.
Try with JOIN, instead of IN (SELECT ...):
SELECT usersname , `time`
FROM users
JOIN
( SELECT steps.user_id
FROM steps
GROUP BY steps.user_id
HAVING COUNT(*) = 5
) grp
ON grp.user_id = users.id
ORDER BY `time` ASC
LIMIT 20
Assuming that you have an index on users.time, which is the first obvious optimization, replacing HAVING with WHERE in the inner query may be worth a try.
The query optimizer might do this already if you are lucky, but you cannot rely on it, and strictly to the specification, HAVING runs after fetching every record whereas WHERE prunes them before.
If that does not help, simply having a counter that increments for every stage completed in the users table might speed up things, eleminating the sub-query. This will make completing a stage minimally slower (but this won't happen a million times per second!), but will be very fast to query only the users who have completed all 5 stages (especially if you have an index on that field).
Also, using memcached or some similar caching technology may be worthwile for something like a highscore, which is typically of the kind of "not necessarily 100% accurate to the second, changing slowly, queried billions of times" data.
If memcached is not an option, even writing the result to a temp file and re-using that for 1-2 seconds (or even longer) would be an option. Nobody will notice. Even if you cache highscores for as long as 1-2 minutes, still nobody will take offense because that is just "how long it takes".
I think you should use where instead of having. Also, in my opinion you should do this in a stored function. In my opinion the best way is to use where instead of having, also, run the inner query, store the results and run the outer query based on the results of your inner query.
This use case may benefit from de-normalization. There is no need to search through all 2000 user records to determine if a user is better than 20 records.
Create a Top_20_Users table.
After the 5th stage check if user's time is less than any in the Top_20_Users table. If yes, update the slowest/worst record.
Things you can do with this.
Since the Top_20_Users table will be so small, add a field for stage and include Top 20 times for each stage as well as for all five stages completed.
Let the Top_20_Users table grow. A history of all top 20 users ever, their times and the date when that time was good enough to be a top 20. Show trends as users learn the game and the top 20 times get better and better.
Here's the query (the largest table has about 40,000 rows)
SELECT
Course.CourseID,
Course.Description,
UserCourse.UserID,
UserCourse.TimeAllowed,
UserCourse.CreatedOn,
UserCourse.PassedOn,
UserCourse.IssuedOn,
C.LessonCnt
FROM
UserCourse
INNER JOIN
Course
USING(CourseID)
INNER JOIN
(
SELECT CourseID, COUNT(*) AS LessonCnt FROM CourseSection GROUP BY CourseID
) C
USING(CourseID)
WHERE
UserCourse.UserID = 8810
If I run this, it executes very quickly (.05 seconds roughly). It returns 13 rows.
When I add an ORDER BY clause at the end of the query (ordering by any column) the query takes about 10 seconds.
I'm using this database in production now, and everything is working fine. All my other queries are speedy.
Any ideas of what it could be? I ran the query in MySQL's Query Browser, and from the command line. Both places it was dead slow with the ORDER BY.
EDIT: Tolgahan ALBAYRAK solution works, but can anyone explain why it works?
maybe this helps:
SELECT * FROM (
SELECT
Course.CourseID,
Course.Description,
UserCourse.UserID,
UserCourse.TimeAllowed,
UserCourse.CreatedOn,
UserCourse.PassedOn,
UserCourse.IssuedOn,
C.LessonCnt
FROM
UserCourse
INNER JOIN
Course
USING(CourseID)
INNER JOIN
(
SELECT CourseID, COUNT(*) AS LessonCnt FROM CourseSection GROUP BY CourseID
) C
USING(CourseID)
WHERE
UserCourse.UserID = 8810
) ORDER BY CourseID
Is the column you're ordering by indexed?
Indexing drastically speeds up ordering and filtering.
You are selecting from "UserCourse" which I assume is a joining table between courses and users (Many to Many).
You should index the column that you need to order by, in the "UserCourse" table.
Suppose you want to "order by CourseID", then you need to index it on UserCourse table.
Ordering by any other column that is not present in the joining table (i.e. UserCourse) may require further denormalization and indexing on the joining table to be optimized for speed;
In other words, you need to have a copy of that column in the joining table and index it.
P.S.
The answer given by Tolgahan Albayrak, although correct for this question, would not produce the desired result, in cases where one is doing a "LIMIT x" query.
Have you updated the statistics on your database? I ran into something similar on mine where I had 2 identical queries where the only difference was a capital letter and one returned in 1/2 a second and the other took nearly 5 minutes. Updating the statistics resolved the issue
Realise answer is too late, however I have just had a similar problem, adding order by increased the query time from seconds to 5 minutes and having tried most other suggestions for speeding it up, noticed that the /tmp files where getting to be 12G for this query. Changed the query such that a varchar(20000) field being returned was "trim("ed and performance dramatically improved (back to seconds). So I guess its worth checking whether you are returning large varchars as part of your query and if so, process them (maybe substring(x, 1, length(x))?? if you dont want to trim them.
Query was returning 500k rows and the /tmp file indicated that each row was using about 20k of data.
A similar question was asked before here.
It might help you as well. Basically it describes using composite indexes and how order by works.
Today I was running into a same kind of problem. As soon as I was sorting the resultset by a field from a joined table, the whole query was horribly slow and took more than a hundred seconds.
The server was running MySQL 5.0.51a and by chance I noticed that the same query was running as fast as it should have always done on a server with MySQL 5.1. When comparing the explains for that query I saw that obviously the usage and handling of indexes has changed a lot (at least from 5.0 -> 5.1).
So if you encounter such a problem, maybe your resolution is to simply upgrade your MySQL