Using Coldfusion DateDiff within a MySQL Query? - mysql

I've spent a few hours playing around with this one, without success so far.
I'm outputting a very large query, and trying to split it into chunks before processing the data. This query will basically run every day, and one of the fields ('last_checked') will be used to ensure the same data isn't processed more than once a day.
Here's my existing query;
<cfquery name="getprice" maxrows="100">
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
ORDER BY ID ASC
</cfquery>
I then run a cfoutput query on the results to do various updates. The table currently holds just over 100,000 records and is starting to struggle to process everything in one hit, hence the need to split it into chunks.
My intention is to cfschedule it to run every so often (I'll increase the maxrows and probably have it run every 15 minutes, for example). However, I need it to only return results that haven't been updated within the last 24 hours - this is where I'm getting stuck.
I know MySQL has it's own DateDiff and TimeDiff functions, but I don't seem to be able to grasp the syntax for that - if indeed its applicable for my use (docs seem to contradict themselves in that regard - or, at the least the ones I've read).
Any pointers very much appreciated!

Try this with MySQL first:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 24 HOUR
ORDER BY ID ASC

I would caution you against using maxrows=100 in your cfquery. This will still return the full recordset to CF from the database, and only then will CF filter out all but the first 100 rows. When you are dealing with a 100,000 row dataset, then this is going to be hugely expensive. Presumably, your filter for only the last 24 hours will dramatically reduce the size of your base result set, so perhaps this won't really be a big problem. However, if you find that even by limiting your set to those changed within the last 24 hours you still have a very large set of records to work with, you could change the way you do this to work much more efficiently. Instead of using CF to filter your results, have MySQL do it using the LIMIT keyword in your query:
SELECT ID, source, last_checked, price
FROM product_prices
WHERE source='api'
AND last_checked >= current_timestamp - INTERVAL 1 DAY
ORDER BY ID ASC
LIMIT 0,100
You could also easily set between "pages" of 100 rows by adding the offset value before the LIMIT: LIMIT 300, 100 would be rows 300-400 from your result set. Doing the paging this way will be much faster than offloading it to CF.

Related

SQL IN Query performance - better split it or not

I get up to 1000 id's from another server to display them for visitors so I have to use the IN query like:
SELECT * FROM `table` WHERE `id` IN (23221, 42422, 2342342....) // and so on, up to 1000
Let's say 1/3 of the visitors will watch though all of the 1000 id's while 2/3 of the them will only watch the first 50.
What would be a better for performance/workload, one query for all the 1000 id's or split them into like 20 queries so 50 id's each? So when the first 50 have been watched, query for the next 50 etc.
EDIT:
I don't need to use LIMIT when spliting, which means the id's in the query would be 50 max. So what's better, one query with 1000 id's at once or 20 queries each 50 id's?
EDIT:
Ok I ask it shortly and more directly: are 1000 id's in one query not too much? I have read here How to optimize an SQL query with many thousands of WHERE clauses that tons of WHERE/OR are bad??
Let's say 1/3 of the visitors will watch though all of the 1000 id's while 2/3 of the them will only watch the first 50.
Since you want to optimize your response as you assumed how visitors will treat it.
What would be a better for performance/workload, one query for all the 1000 id's or split them into like 20 queries so 50 id's each? So when the first 50 have been watched, query for the next 50 etc.
Yes, you are correct you should limit the return response.
This is one example of how you can implement your requirement (I don't know much mysql but this is how you could get desired result).
SELECT * FROM `table` WHERE `id` IN (23221, 42422, 2342342....)
order by `id`
LIMIT 10 OFFSET 10
if it was SQL SERVER:
create stored proc sp_SomeName
#id varchar(8000)
#skip int,
#take int
as
begin
SELECT * FROM some_table WHERE id IN (23221, 42422, 2342342....)
order by id
OFFSET #skip ROWS --if 0 then start selecting from 0
FETCH NEXT #take ROWS ONLY --if 10 then this is the max returning limit
end
what above query will do is : It will get all the data of the posted ids, then it will order by id in ascending order. Then from their it will choose just first 10/50/100, next time, it will choose the next 10/50/100 or whatever your take choice is and skip choice is. Hope this helps man :)
You can look at the answer provided here:
MySQL Data - Best way to implement paging?
With the LIMIT statement you can return only a portion of the result. And by changing the parameters in the LIMIT statement, you can re-use the query.
Do know that unless you use an 'ORDER BY', an SQL server does not always return the same records. In other words, should a record by unavailable to read due to an update that occurs, while the database-server can read the next record, it will fetch the next record (to give a result as soon as possible). I do not know for sure if the LIMIT forces a database-server to take some sort of order into consideration (I am not that familiar with MySql).

MySQL optimization problems with LIMIT keyword

I'm trying to optimize a MySQL query. The below query runs great as long as there are greater than 15 entries in the database for a particular user.
SELECT activityType, activityClass, startDate, endDate, activityNum, count(*) AS activityType
FROM (
SELECT activityType, activityClass, startDate, endDate, activityNum
FROM ActivityX
WHERE user=?
ORDER BY activityNum DESC
LIMIT 15) temp
WHERE startDate=? OR endDate=?
GROUP BY activityType
When there are less than 15 entries, the performance is terrible. My timing is roughly 25 ms vs. 4000 ms. (I need "15" to ensure I get all the relevant data.)
I found these interesting sentences:
"LIMIT N" is the keyword and N is any number starting from 0, putting 0 as the limit does not return any records in the query. Putting a number say 5 will return five records. If the records in the specified table are less than N, then all the records from the queried table are returned in the result set. [source: guru99.com]
To get around this problem, I'm using a heuristic to guess if the number of entries for a user is small - if so, I use a different query that takes about 1500 ms.
Is there anything I'm missing here? I can not use an index since the data is encrypted.
Thanks much,
Jon
I think an index on ActivityX(user, ActivityNum) will solve your problem.
I am guessing that you have an index on (ActivityNum) and the optimizer is trying to figure out if it should use the index. This causes thresholding. The composite index should better match the query.

Is it faster to update the whole table, or to use WHERE clause?

Which is the faster way to execute a query that updates a lot of rows?
The first query example will update the points column to 0, for every rows that the last_visit is made from 7 days and before.
After, in both cases, there is an additional query that writes in all rows that the last_visit was made on the last 7 days.
The table now has about 140.000 rows. The first query updates 110.000 rows and the second 140.000
UPDATE the_table SET points = 0 where DATE(last_visit) <= DATE_SUB(CURDATE(), INTERVAL 7 DAY) AND type is NULL
or
UPDATE the_table SET points = 0 where type is NULL
Your two UPDATES do far different things and both use WHERE clauses, so any speed comparison is useless. The first checks both for a 7 day period AND the type being NULL, while the second only checks the second condition. They can potentially affect vastly different amounts of data (which your edit shows).
Asking which is faster is akin to saying "I have a dump truck and a Ferrari. Which is faster?" - the answer depends on whether you're going to move 10 tons of sand or go zero to 60 to merge into highway traffic. Your UPDATE performance doesn't make any more sense - it depends on which rows you actually want to UPDATE. Use the one that does what you really want to do and stop worrying about which is faster.
Before doing an UPDATE or DELETE that will affect a lot of rows, it's always a good idea to run a SELECT using the same WHERE clause to see if the data that is going to be updated is what you expect. You'll appreciate it the first time you realize that you were about to execute an UPDATE with the wrong conditions that would have caused major problems or a DELETE that would have lost valuable data.

how group by having limit works

Can someone explain how construction group by + having + limit exactly work? MySQL query:
SELECT
id,
avg(sal)
FROM
StreamData
WHERE
...
GROUP BY
id
HAVING
avg(sal)>=10.0
AND avg(sal)<=50.0
LIMIT 100
Query without limit and having clauses executes for 7 seconds, with limit - instantly if condition covers a large amount of data or ~7 seconds otherwise.
Documentation says that limit executes after having which after group by, this means that query should always execute for ~7 seconds. Please help to figure out what is limited by LIMIT clause.
Using LIMIT 100 simply tells MySQL to return only the first 100 records from your result set. Assuming that you are measuring the query time as the round trip from Java, then one component of the query time is the network time needed to move the result set from MySQL across the network. This can take a considerable time for a large result set, and using LIMIT 100 should reduce this time to zero or near zero.
Things are logically applied in a certain pipeline in SQL:
Table expressions are generated and executed (FROM, JOIN)
Rows filtered (WHERE)
Projections and aggregations applied (column list, aggregates, GROUP BY)
Aggregations filtered (HAVING)
Results limited (LIMIT, OFFSET)
Now these may be composed into a different execution order by the planner if that is safe but you always get the proper data out if you think through them in this order.
So group by groups, then these are filtered with having, then the results of that are truncated.
As soon as MySQL has sent the required number of rows to the client,
it aborts the query unless you are using SQL_CALC_FOUND_ROWS. The
number of rows can then be retrieved with SELECT FOUND_ROWS(). See
Section 13.14, “Information Functions”.
http://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html
This effectively means that if your table has a rather hefty number of rows, the server doesn't need to look at all of them. It can stop as soon as it has found a 100 because it knows that's all that you need.

mysql left join with a VERY large table - super slow

[site_list] ~100,000 rows... 10mb in size.
site_id
site_url
site_data_most_recent_record_id
[site_list_data] ~ 15+ million rows and growing... about 600mb in size.
record_id
site_id
site_connect_time
site_speed
date_checked
columns in bold are unique index keys.
I need to return 50 most recently updated sites AND the recent data that goes with it - connect time, speed, date...
This is my query:
SELECT SQL_CALC_FOUND_ROWS
site_list.site_url,
site_list_data.site_connect_time,
site_list_data.site_speed,
site_list_data.date_checked
FROM site_list
LEFT JOIN site_list_data
ON site_list.site_data_most_recent_record_id = site_list_data.record_id
ORDER BY site_data.date_checked DESC
LIMIT 50
Without the ORDER BY and SQL_CALC_FOUND_ROWS(I need it for pagination), the query takes about 1.5 seconds, with those it takes over 2 seconds or more which is not good enough because that particular page where this data will be shown is getting 20K+ pageviews/day and this query is apparently too heavy(server almost dies when I put this live) and too slow.
Experts of mySQL, how would you do this? What if the table got to 100 million records? Caching this huge result into a temp table every 30 seconds is the only other solution I got.
You need to add a heuristic to the query. You need to gate the query to get reasonable performance. It is effectively sorting your site_list_date table by date descending -- the ENTIRE table.
So, if you know that the top 50 will be within the last day or week, add a "and date_checked > <boundary_date>" to the query. Then it should reduce the overall result set first, and THEN sort it.
SQL_CALC_ROWS_FOUND is slow use COUNT instead. Take a look here
A couple of observations.
Both ORDER BY and SQL_CALC_FOUND_ROWS are going to add to the cost of your performance. ORDER BY clauses can potentially be improved with appropriate indexing -- do you have an index on your date_checked column? This could help.
What is your exact need for SQL_CALC_FOUND_ROWS? Consider replacing this with a separate query that uses COUNT instead. This can be vastly better assuming your Query Cache is enabled.
And if you can use COUNT, consider replacing your LEFT JOIN with an INNER JOIN as this will help performance as well.
Good luck.