I have a system that checks websites for certain data at set frequencies. Each website has its own check frequency in the crawl_frequency column. This value is in days.
I have a table like this
CREATE TABLE `websites` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`domain` VARCHAR(191) NOT NULL COLLATE 'utf8mb4_unicode_ci',
`crawl_frequency` TINYINT(3) UNSIGNED NOT NULL DEFAULT '3',
`last_crawled_start` TIMESTAMP NULL DEFAULT NULL,
PRIMARY KEY (`id`)
)
I want to run queries to find new websites to check at their specified check frequency/interval. At the moment I have this query which works fine if the crawl_frequency for a website is set to one day.
SELECT domain
FROM websites
WHERE last_crawled_start <= (now() - INTERVAL 1 DAY)
LIMIT 1
Is there any way in a MySQL query I can use the value that is in the crawl_frequency column for each row in the WHERE clause.
So example I'd like to do something like:
SELECT domain
FROM websites
WHERE last_crawled_start <= (now() - INTERVAL {{INSERT VALUE OF CRAWL FREQUENCY FOR THIS PARTICULAR WEBSITE}} DAY)
LIMIT 1
You can do it like so:
SELECT domain
FROM websites
WHERE last_crawled_start <= NOW() - INTERVAL crawl_frequency DAY
LIMIT 1
Yes, really.
You can try to use DATEDIFF function, like this:
SELECT domain FROM websites
WHERE DATEDIFF(NOW(), last_crawled_start) > crawl_frequency
LIMIT 1;
Everything i read for mysql said it can't be variable, but you can use another function e.g.
SELECT * FROM websites
WHERE
(unix_timestamp() - unix_timestamp(last_crawled_start))/86400.0 > crawl_frequency
Related
My table is defined as following:
CREATE TABLE `tracking_info` (
`tid` int(25) NOT NULL AUTO_INCREMENT,
`tracking_customer_id` int(11) NOT NULL DEFAULT '0',
`tracking_content` text NOT NULL,
`tracking_type` int(11) NOT NULL DEFAULT '0',
`time_recorded` int(25) NOT NULL DEFAULT '0',
PRIMARY KEY (`tid`),
KEY `time_recorded` (`time_recorded`),
KEY `tracking_idx` (`tracking_customer_id`,`tracking_type`,
`time_recorded`,`tid`)
) ENGINE=MyISAM
The table contains about 150 million records. Here is the query:
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE FROM_UNIXTIME(time_recorded) > DATE_SUB( NOW( ) ,
INTERVAL 90 DAY )
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10
It takes about a minute to run the query even without ORDER BY. Any thoughts? Thanks in advance!
First, refactor the query so it's sargable.
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE time_recorded > UNIX_TIMESTAMP(DATE_SUB( NOW( ) , INTERVAL 90 DAY )
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10;
Then add this multi-column index:
ALTER TABLE tracking_info
ADD INDEX cust_time (tracking_customer_id, time_recorded DESC);
Why will this help?
It compares the raw data in a column with a constant, rather than using the FROM_UNIXTIME() function to transform all the data in that column of the table. That makes the query sargable.
The query planner can random-access the index I suggest to the first eligible row, then read ten rows sequentially from the index and look up what it needs from the table, then stop.
You can rephrase the query to isolate time_recorded, as in:
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE time_recorded > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 90 DAY))
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10
Then, the following index will probably make the query faster:
create index ix1 on tracking_info (tracking_customer_id, time_recorded);
There are 3 things to do:
Change to InnoDB.
Add INDEX(tracking_customer_id, time_recorded)
Rephrase to time_recorded > NOW() - INTERVAL 90 DAY)
Non-critical notes:
int(25) -- the "25" has no meaning. You get a 4-byte signed number regardless.
There are datatypes DATETIME and TIMESTAMP; consider using one of them instead of an INT that represents seconds since sometime. (It would be messy to change, so don't bother.)
When converting to InnoDB, the size on disk will double or triple.
I have a date table, which has a column date (PK). The CREATE script is here:
CREATE TABLE date_table (
date DATE
,year INT(4)
,month INT(2)
,day INT(2)
,month_pad VARCHAR(2)
,day_pad VARCHAR(2)
,month_name VARCHAR(10)
,year_month_index INT(6)
,year_month_hypname VARCHAR(7)
,year_month_name VARCHAR(15)
,week_day_index INT(1)
,day_name VARCHAR(9)
,week INT(2)
,week_interval VARCHAR(13)
,weekend_fl INT(1)
,quarter_num INT(1)
,quarter_num_pad VARCHAR(2)
,quarter_name VARCHAR(2)
,year_quarter_index INT(6)
,year_quarter_name VARCHAR(7)
,PRIMARY KEY (date)
);
Now I would like select rows from this table with dynamic values, using such as LAST_DAY() or DATE_SUB(DATE_FORMAT(SYSDATE(),'%Y-01-01'), INTERVAL X YEAR), etc.
When one of my queries failed and didn't execute in 30 secs, I knew something was fishy, and it looks like the reason is that the index on the primary key column is not used. Here are my results (sorry for using an image instead of copying the queries, but I thought it's concise enough for this purpose, and the queries are short/simple enough):
First of all, it's strange that the BETWEEN works differently than using >= and <=. Secondly, it looks like the index is only used for constant values. If you look closely, you can see that on the right side (where >= and <= is used), it shows ~9K rows, which is half of the rows in the table (the table has about ~18k rows, dates from 2000-01-01 to `2050-12-31).
SYSDATE() returns the time at which it executes. This differs from the behavior for NOW(), which returns a constant time that indicates the time at which the statement began to execute. (Within a stored function or trigger, NOW() returns the time at which the function or triggering statement began to execute.)
-- https://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html#function_sysdate
That is, the Optimizer does not see this as a "constant". Otherwise, the Optimizer eagerly evaluates any "constant expressions", then tries to take advantage of knowing the value.
See also the sysdate_is_now option.
Bottom line: Don't use SYSDATE() for normal datetime usage; use NOW() or CURDATE().
Looks like if I use CURRENT_DATE() (or NOW()) instead of SYSDATE(), it's working. Both of these queries:
SELECT *
FROM date_table t
WHERE 1 = 1
AND t.ddate >= LAST_DAY(CURRENT_DATE()) AND t.ddate <= LAST_DAY(CURRENT_DATE());
SELECT *
FROM date_table t
WHERE 1 = 1
AND t.ddate >= LAST_DAY(NOW()) AND t.ddate <= LAST_DAY(NOW());
Give the same result, which is this:
I will accept my answer as a solution, but I'm still looking for an explanation. I thought it might has to do something with SYSDATE() not being a DATE, but NOW() is also not a DATE...
EDIT: Forgot to add, BETWEEN is also working as I see.
We have a logging table which is growing as new events happening. At the moment we have around 120.000 rows of log events stored.
The events table looks like this:
'CREATE TABLE `EVENTS` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`EVENT` varchar(255) NOT NULL,
`ORIGIN` varchar(255) NOT NULL,
`TIME_STAMP` TIMESTAMP NOT NULL,
`ADDITIONAL_REMARKS` json DEFAULT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=137007 DEFAULT CHARSET=utf8'
Additional_Remarks is a JSON field because different applications log into this table and can add more information to the event which happened. I did not want to put any data structure here, because this information can be different. For example one project management application can log:
ID, "new task created", "app", NOW(), {"project": {"id": 1}, "creator": {"id": 1}}
While other applications do not have projects or creator, but maybe cats and owners they want to store in the Additional_Remarks field.
Queries can use the Additional_Remarks field to filter information for one specific application like:
SELECT
DISTINCT(ADDITIONAL_REMARKS->"$.project.id") as 'project',
COUNT(CASE WHEN EVENT = 'new task created' THEN 1 END) AS 'new_task'
FROM EVENTS
WHERE DATE(TIMESTAMP) >= DATE(NOW()) - INTERVAL 30 DAY
AND ORIGIN = "app"
GROUP BY project
ORDER BY new_task DESC
LIMIT 10;
Output EXPLAIN query:
'1', 'SIMPLE', 'EVENTS', NULL, 'ALL', NULL, NULL, NULL, NULL, '136459', '100.00', 'Using where; Using temporary; Using filesort'
With this query I get the top 10 projects with the most created tasks for the last 30 days. Works fine, but this queries get slower and slower as our table grows. With 120.000 rows this query needs over 30 seconds.
Do you know any way to improve the speed? The newest information in the table with the highest id is more important then older entries. Often I look only for entries which happened in the last X days. It would be useful to stop the query after the first entry is older as X days from the where clause, as all further entries are even older.
if TIME_STAMP is indexed, the DATE function will not allow the index to be used because it is non-deterministic.
WHERE DATE(TIMESTAMP) >= DATE(NOW()) - INTERVAL 30 DAY
Can be rewritten as.
WHERE TIMESTAMP >= UNIX_TIMESTAMP(DATE(NOW()) - INTERVAL 30 DAY)
Do you know any way to improve the speed?
The only way i can see to speed up the query is to multicolumn index TIMESTAMP and ORIGIN like so ALTER TABLE EVENTS ADD KEY timestamp_origin (TIME_STAMP, ORIGIN); and mine query adjustment above
EDIT
And a delivered table may improve query speed because it will use the new index.
SELECT
ADDITIONAL_REMARKS->"$.project.id" AS 'project',
COUNT(CASE WHEN EVENT = 'new task created' THEN 1 END) AS 'new_task'
FROM (
SELECT
*
FROM EVENTS
WHERE
TIME_STAMP >= UNIX_TIMESTAMP(DATE(NOW()) - INTERVAL 30 DAY)
AND
ORIGIN = "app"
)
AS events_within_30_days
GROUP BY project
ORDER BY new_task DESC
LIMIT 10;
A inner select where I already reduce the amount of rows could reduce the query time from 30 sec to 0.05 sec.
It looks like:
SELECT
ADDITIONAL_REMARKS->"$.project.id" AS 'project',
COUNT(CASE WHEN EVENT = 'new task created' THEN 1 END) AS 'new_task'
FROM (
SELECT *
FROM EVENTS WHERE
EVENT = 'new task created'
AND TIME_STAMP >= UNIX_TIMESTAMP(DATE(NOW()) - INTERVAL 30 DAY)
AND ORIGIN = "app" ) AS events_within_30_days
GROUP BY project
ORDER BY new_task DESC
LIMIT 10;
I have an sql query to select randomly 1200 top retweeted tweets at least 50 times retweeted and the tweetDate should be 4 days older from 40 million records. The query I pasted below works but It takes 40 minutes, so is there any faster version of that query?
SELECT
originalTweetId, Count(*) as total, tweetContent, tweetDate
FROM
twitter_gokhan2.tweetentities
WHERE
originalTweetId IS NOT NULL
AND originalTweetId <> - 1
AND isRetweet = true
AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
GROUP BY originalTweetId
HAVING total > 50
ORDER BY RAND()
limit 0 , 1200;
---------------------------------------------------------------
Table creation sql is like:
CREATE TABLE `tweetentities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tweetId` bigint(20) NOT NULL,
`tweetContent` varchar(360) DEFAULT NULL,
`tweetDate` datetime DEFAULT NULL,
`userId` bigint(20) DEFAULT NULL,
`userName` varchar(100) DEFAULT NULL,
`retweetCount` int(11) DEFAULT '0',
`keyword` varchar(500) DEFAULT NULL,
`isRetweet` bit(1) DEFAULT b'0',
`isCompleted` bit(1) DEFAULT b'0',
`applicationId` int(11) DEFAULT NULL,
`latitudeData` double DEFAULT NULL,
`longitude` double DEFAULT NULL,
`originalTweetId` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index` (`originalTweetId`),
KEY `index3` (`applicationId`),
KEY `index2` (`tweetId`),
KEY `index4` (`userId`),
KEY `index5` (`userName`),
KEY `index6` (`isRetweet`),
KEY `index7` (`tweetDate`),
KEY `index8` (`originalTweetId`),
KEY `index9` (`isCompleted`),
KEY `index10` (`tweetContent`(191))
) ENGINE=InnoDB AUTO_INCREMENT=41501628 DEFAULT CHARSET=utf8mb4$$
You are, of course, summarizing a huge number of records, then randomizing them. This kind of thing is hard to make fast. Going back to the beginning of time makes it worse. Searching on a null condition just trashes it.
If you want this to perform reasonably, you must get rid of the IS NOT NULL selection. Otherwise, it will perform badly.
But let us try to find a reasonable solution. First, let's get the originalTweetId values we need.
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate,
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
This summary query gives us the lowest id number and date in your database for each subject tweet.
To get this to run fast, we need a compound index on (originalTweetId, isRetweet, tweetDate, id). The query will do a range scan of this index on tweetDate, which is about as fast as you can hope for. Debug this query, both for correctness and performance, then move on.
Now do the randomization. Let's do this with the minimum amount of data we can, to avoid sorting some enormous amount of stuff.
SELECT originalTweetId, tweetDate, total, RAND() AS randomOrder
FROM (
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
) AS retweets
ORDER BY randomOrder
LIMIT 1200
Great. Now we have a list of 1200 tweet ids and dates in random order. Now let's go get the content.
SELECT a.originalTweetId, a.total, b.tweetContent, a.tweetDate
FROM (
/* that whole query above */
) AS a
JOIN twitter_gokhan2.tweetentities AS b ON (a.id = b.id)
ORDER BY a.randomOrder
See how this goes? Use a compound index to do your summary, and do it on the minimum amount of data. Then do the randomizing, then go fetch the extra data you need.
You're selecting a huge number of records by selecting every record older than 4 days old....
Since the query takes a huge amount of time, why not simply prepare the results using an independant script which runs repeatedly in the background....
You might be able to make the assumption that if its a retweet, the originalTweetId cannot be null/-1
Just to clarify... did you really mean to query everything OLDER than 4 days???
AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
OR... Did you mean you only wanted to aggregate RECENT TWEETS WITHIN the last 4 days. To me, tweets that happened 2 years ago would be worthless to current events... If thats the case, you might be better to just change to
AND (tweetDate >= DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
See if this isn't a bit faster than 40 minutes:
Test first without the commented lines, then re-add them to compare performance impact. (especially ORDER BY RAND() is known to be horrible)
SELECT
originalTweetId,
total,
-- tweetContent, -- may slow things somewhat
tweetDate
FROM (
SELECT
originalTweetId,
COUNT(*) AS total,
-- tweetContent, -- may slow things somewhat
MIN(tweetDate) AS tweetDate,
MAX(isRetweet) AS isRetweet
FROM twitter_gokhan2.tweetentities
GROUP BY originalTweetId
) AS t
WHERE originalTweetId > 0
AND isRetweet
AND tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY)
AND total > 50
-- ORDER BY RAND() -- very likely to slow performance,
-- test with and without...
LIMIT 0, 1200;
PS - originalTweetId should be indexed hopefully
table:
--duedate timestamp
--submissiondate timestamp
--blocksreq numeric
--file clob
--email varchar2(60)
Each entry is a file which will take blocksreq to accomplish. There are 8 blocks allotted per day (but could be modified later). before i insert into the table, i want to make sure there are enough blocks to accomplish it in the timeframe of NOW() and #duedate
I was thinking of the following, but i think i am doing it wrong:
R1 = select DAY(), #blocksperday - sum(blocksreq) as free
from table
where #duedate between NOW() and #duedate
group by DAY()
order by DAY() desc
R2 = select sum(a.free) from R1 as a;
if(R2[0] <= #blocksreq){ insert into table; }
pardon the partial pseudocode.
SQL FIDDLE: http://sqlfiddle.com/#!2/5bda5
warning: My sql fiddle has garbage code... as i dont know how to make a lot of test cases. nor set the duedate to NOW()+5 days
Something like this? (wasn't sure how partial days were handled so ignored that part)
CREATE TABLE `DatTable` (
`duedate` datetime DEFAULT NULL,
`submissiondate` datetime DEFAULT NULL,
`blocksreq` smallint(6) DEFAULT NULL
)
SET #duedate:='2012-10-15';
SET #submissiondate:=CURRENT_TIMESTAMP;
SET #blocksreq:=5;
INSERT INTO DatTable(duedate,submissiondate,blocksreq)
SELECT #duedate,#submissiondate,#blocksreq
FROM DatTable AS b
WHERE duedate > #submissiondate
HAVING COALESCE(SUM(blocksreq),0) <= DATEDIFF(#duedate,#submissiondate)*8-#blocksreq;