I have a table with posts and I want to generate a graph that shows how many posts were made the previous last 30 minutes, and the last 30 minutes before that etc. The posts are selected by their post_handler and post_status.
The table structure looks like this.
CREATE TABLE IF NOT EXISTS `posts` (
`post_title` varchar(255) NOT NULL,
`post_content` text NOT NULL,
`post_date_added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`post_handler` varchar(255) NOT NULL,
`post_status` tinyint(4) NOT NULL,
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
KEY `post_status` (`post_status`),
KEY `post_status_2` (`post_status`,`id`),
KEY `post_handler` (`post_handler`),
KEY `post_date_added` (`post_date_added`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=2300131 ;
The results I'd like to receive, sorted after post_date_added.
period_start period_end posts
2011-12-06 19:23:44 2011-12-06 19:53:44 10
2011-12-06 19:53:44 2011-12-06 20:23:44 39
2011-12-06 20:23:44 2011-12-06 20:53:44 40
Right now I use solution where I have to run this query many times over, and then insert the data into another table from the PHP script.
SELECT COUNT(*) FROM posts WHERE post_handler = 'test' AND post_status = 1 AND post_date_added BETWEEN '2011-12-06 19:23:44' AND '2011-12-06 19:53:44'
Do you know any other solution? Is there any way to run a query that also inserts results into the database, all in one query?
Its fairly easy to group by distinctive time parameters, like hour, minute, day or whatever. If you want to group this by an hour, a possible query might look like this:
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND post_status = 1
GROUP BY _Date;
(run this with a mysql query tool of your choice to see the output).
However, if you want to consider 30mins as the base of your group, the SQL part will get more tricky. For this special purpose, since you've only have to divide into two different subsets, maybe work with this approach:
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
"00" AS "semihour",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND DATE_FORMAT(post_date_added,"%i") < 30
AND post_status = 1
GROUP BY _Date
UNION
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
"30" AS "semihour",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND DATE_FORMAT(post_date_added,"%i") >= 30
AND post_status = 1
GROUP BY _Date;
Again, run this with a mysql query tool of your choice to see the output. You could add mathematical distinguishments there too working with CASE or IF and such, but personally I'd either group by hour or minute just to keep the SQL part way easier.
To directly add those numbers into your graph database, use this syntax:
INSERT INTO yourtable (yourfields)
SELECT ...
More details about this can be found here in the MySQL documentation.
In (very) brief: yes, you can insert the results of a query into another table. Take a look at INSERT ... SELECT here: http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
Essentially, you'd just change what you have to something like
INSERT INTO post_statistics_table (period_start, period_end, posts)
SELECT ?, ?, COUNT(*) FROM posts
WHERE post_handler = 'test'
AND post_status = 1
AND post_date_added BETWEEN ? AND ?
and then fill in the four ?s with the same two DATETIMEs, repeated. ($from, $to, $from, $to)
Related
I have a query that I'm testing on my database, but for some weird reason, and randomly, it returns a different set of results. Interestingly, there are only two distinct result-sets that it returns, from thousands of rows, and the query will randomly return one or the other, but nothing else.
Is there a reason the query only returns one of two datasets? Query and schema below.
My goal is to select the fastest laps for a given track, in a given time period, but only the fastest lap for each user (so there are always 10 different users in the top 10).
Most of the time the correct results are returned, but randomly, a totally different result set is returned.
SELECT `lap`.`ID`, `lap`.`qualificationTime`, `lap`.`userId`
FROM `lap`
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
Schema:
CREATE TABLE IF NOT EXISTS `lap` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`userId` int(11) DEFAULT NULL,
`trackId` int(11) DEFAULT NULL,
`raceDateTime` datetime NOT NULL,
`qualificationTime` decimal(7,4) DEFAULT '0.0000',
`isTestLap` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`)
(DB create script trimmed of un-needed columns)
You are using a (mis)feature of MySQL called hidden columns. As others have pointed out, you are allowed to put columns in the select statement that are not in the group by. But, the returned values are arbitrary, and not even guaranteed to be the same from one run to the next.
The solution is to find the max qualification time for each user. Then join this information back to get the other fields. Here is one way:
select l.*
from (SELECT userId, min(qualificationtime) as minqf
FROM lap
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
) lu join
lap l
on lu.minqf = l.qualificationtime
ORDER BY l.`qualificationTime` ASC
LIMIT 10
You are selecting lap.ID, lap.qualificationTime and lap.userId, but you are not GROUPing BY them. You can only select fields you group by, or else aggregate functions on the other fields (MIN, MAX, AVG, etc). Otherwise, results are undefined.
I think you mean that sometimes values for lap.ID, lap.qualificationTime are different. And it's right behaviour for mysql. Because you group by userId and you don't know what values for other fields will be returned. Mysql can select different values depend on first value or last rows reading.
I would check something like this:
SELECT `l1`.`qualificationTime`, `l1`.`userId`,
(SELECT l2.ID FROM `lap` AS l2 WHERE l2.`userId` = l1.userId AND
l2.qualificationTime = min(l1.`qualificationTime`))
FROM `lap` AS `l1`
WHERE (l1.trackID =4)
AND (l1.raceDateTime >= "2013-07-25 10:00:00")
AND (l1.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
It's likely to be your ORDER BY on a decimal entity, and how the DB stores this and then retrieves it.
What I have is a table statistieken with an ip, hash of browser info, url visited and last visited date in timestamp.
What I could compile from different sources led to this query, the only problem is that this query takes forever(9 minutes) to complete on a table with about 15000 rows, so this query is very inefficient.
I think I'm going to this the wrong way around, but I can't find a decent post or tutorial how to use the results of a select as basis for getting the results I want.
What I simply want is an overview of every entry in the table that matches the hash of the results that are returned that have visted more than 25 pages in the last 12 hours.
CREATE TABLE IF NOT EXISTS `statsitieken` (
`hash` varchar(35) NOT NULL,
`ip` varchar(24) NOT NULL,
`visits` int(11) NOT NULL,
`lastvisit` int(11) NOT NULL,
`browserinfo` text NOT NULL,
`urls` text NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is the query I have tried to compile so far.
SELECT * FROM `database`.`statsitieken` WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
group by hash
having count(urls) > 25
order by urls)
I need this to compile in a decent time, like < 1 second which should be possible in my opinion...
I suggest trying this modified query. The subquery is now computed only once instead of being run for each record returned:
SELECT s.*
FROM `database`.`statsitieken` s, (SELECT *
FROM `database`.`statsitieken`
WHERE `lastvisit` > UNIX_TIMESTAMP(DATE_SUB(NOW(),INTERVAL 12 HOUR))
GROUP BY hash
HAVING COUNT(urls)>25) tmp
WHERE s.`hash`=tmp.`hash`
ORDER BY s.urls
Be sure you have indexes on the following fields:
hash to speed up the GROUP BY and WHERE
urls to speed up the ORDER BY
Derived table with INNER JOIN is faster than a subquery. try this optimized query:
SELECT *
FROM statsitieken a
INNER JOIN (SELECT hash
FROM statsitieken
WHERE lastvisit > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
) b
ON a.hash = b.hash
GROUP BY a.hash
HAVING COUNT(urls) > 25
ORDER BY urls;
For better performance of this select query you should add indexes as:
ALTER TABLE statsitieken ADD KEY ix_hash(hash);
ALTER TABLE statsitieken ADD KEY ix_lastvisit(lastvisit);
WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
You are "subquerying" (i don't know if exists that word :P, 'doing a subquery') in the same table, why not to:
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
do it directly?
I have a creative query request, with a few examples of my own.
I have a table that logs user's hits with the following fields:
id unique value for each logged hit
referrer text value of a URL
date integer value of unix timestamp
unique a string identifying users uniquely (md5 of IP + salt, basically)
(Note that I realize that using "unique" as a field name ended up being a terrible design choice, but putting it in backticks has helped avoid any issues...)
I would like a query which returns a list of uniques and their first referrer.
If you are looking the first referrer by date for each user you can do something like that:
CREATE TEMPORARY TABLE tmp_hits
SELECT
`unique`
, `date`
, `referrer`
FROM log_table
ORDER BY `date` ASC
;
SELECT
`unique`
, `referrer`
FROM tmp_hits
GROUP BY `unique`
;
If you don't have a hit_id field, you'll have to use the pair of (unique, date) as a row identifier. You should be able to get what you are looking for with something like this.
SELECT `referrer` FROM `hits` h1 INNER JOIN
(SELECT `unique`, MIN(`date`) FROM `hits` GROUP BY `unique`) h2
ON h1.`unique` = h2.`unique` AND h1.`date` = h2.`date`
GROUP BY `referrer`
If you have a primary key you didn't mention, like hit_id, it gets a bit shorter and saves you from the rare case that two hits occur from the same user in the same second:
SELECT `referrer` FROM `hits` h1 INNER JOIN
(SELECT MIN(`hit_id`) FROM `hits` GROUP BY `unique`) h2
ON h1.`hit_id` = h2.`hit_id`
GROUP BY `referrer`
In both cases, the last GROUP BY is just to remove dups in your final result set.
Im running the following query to get the stats for a user, based on which I pay them.
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, (sum(hit_uniques)/1000)*hit_paylevel as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user
The table in question looks like this:
CREATE TABLE IF NOT EXISTS `daily_hits` (
`hit_itemid` varchar(255) NOT NULL,
`hit_mainid` int(11) NOT NULL,
`hit_user` int(11) NOT NULL,
`hit_date` date NOT NULL,
`hit_hits` int(11) NOT NULL DEFAULT '0',
`hit_uniques` int(11) NOT NULL,
`hit_embed` int(11) NOT NULL,
`hit_paylevel` int(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`hit_itemid`,`hit_date`),
KEY `hit_user` (`hit_user`),
KEY `hit_mainid` (`hit_mainid`,`hit_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The problem in the calculation has to do with the hit_paylevel which acts as a multiplier. Default is one, the other option is 2 or 3, which essentially doubles or triples the earnings for that day.
If I loop through the days, the daily day_earnings is correct, its just that when I group them, it calculates everything as paylevel 1. This happens if the user was paylevel 1 in the beginning, and was later upgraded to a higher level. if user is pay level 2 from the start, it also calculates everything correctly.
Shouldn't this be sum(hit_uniques * hit_paylevel) / 1000?
Like #Denis said:
Change the query to
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, sum(hit_uniques * hit_paylevel) / 1000 as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user;
Why this fixes the problem
Doing the hit_paylevel outside the sum, first sums all hit_uniques and then picks a random hit_paylevel to multiply it by.
Not what you want. If you do both columns inside the sum MySQL will pair up the correct hit_uniques and hit_paylevels.
The dangers of group by
This is an important thing to remember on MySQL.
The group by clause works different from other databases.
On MSSQL *(or Oracle or PostgreSQL) you would have gotten an error
non-aggregate expression must appear in group by clause
Or words to that effect.
In your original query hit_paylevel is not in an aggregate (sum) and it's also not in the group by clause, so MySQL just picks a value at random.
Let be a table like this :
CREATE TABLE `amoreAgentTST01` (
`moname` char(64) NOT NULL DEFAULT '',
`updatetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`data` longblob,
PRIMARY KEY (`moname`,`updatetime`)
I have a query to find the oldest records for each distinct 'moname', but only if there are multiple records for this 'moname' :
SELECT moname, updatetime FROM amoreAgentTST01 a
WHERE (SELECT count(*) FROM amoreAgentTST01 x WHERE x.moname = a.moname) > 1
AND a.updatetime = (SELECT min(updatetime) FROM amoreAgentTST01 y WHERE y.moname = a.moname) ;
My question is : how to do the same but selecting the X oldest values ?
I now simply run this, delete the oldest values and rerun it... which is not so nice.
Seconds question is : what do you think of the above query ? can it be improved ? is there any obvious bad practice ?
Thank you in advance for your advices and help.
Barth
Would something like this work (untested):
SELECT moname, MIN(updatetime) FROM amoreAgentTST01
GROUP BY moname HAVING COUNT(moname)>1
Edit - the above is meant only as a replacement for your existing code, so it doesn't directly answer your question.
I think something like this should work for your main question:
SELECT moname, updatetime FROM amoreAgentTST01
GROUP BY moname, updatetime
HAVING COUNT(moname)>1
ORDER BY updatetime LIMIT 0, 10
Edit - sorry, the above won't work because it's returning only 10 records for all the monames - rather than the 10 oldest for each. Let me have a think.
One more go at this (admittedly, this one looks a bit convoluted):
SELECT a.moname, a.updatetime FROM amoreAgentTST01 a
WHERE EXISTS
(SELECT * FROM amoreAgentTST01 b
WHERE a.moname = b.moname AND a.updatetime = b.updatetime
ORDER BY b.updatetime LIMIT 0, 10)
AND (SELECT COUNT(*) FROM amoreAgentTST01 x WHERE x.moname = a.moname) > 1
I should add that if there is an ID column - generally the primary key- then that should be used for the sub-query joins for improved performance.