MySQL - Finding a user's first referrer - mysql

I have a creative query request, with a few examples of my own.
I have a table that logs user's hits with the following fields:
id unique value for each logged hit
referrer text value of a URL
date integer value of unix timestamp
unique a string identifying users uniquely (md5 of IP + salt, basically)
(Note that I realize that using "unique" as a field name ended up being a terrible design choice, but putting it in backticks has helped avoid any issues...)
I would like a query which returns a list of uniques and their first referrer.

If you are looking the first referrer by date for each user you can do something like that:
CREATE TEMPORARY TABLE tmp_hits
SELECT
`unique`
, `date`
, `referrer`
FROM log_table
ORDER BY `date` ASC
;
SELECT
`unique`
, `referrer`
FROM tmp_hits
GROUP BY `unique`
;

If you don't have a hit_id field, you'll have to use the pair of (unique, date) as a row identifier. You should be able to get what you are looking for with something like this.
SELECT `referrer` FROM `hits` h1 INNER JOIN
(SELECT `unique`, MIN(`date`) FROM `hits` GROUP BY `unique`) h2
ON h1.`unique` = h2.`unique` AND h1.`date` = h2.`date`
GROUP BY `referrer`
If you have a primary key you didn't mention, like hit_id, it gets a bit shorter and saves you from the rare case that two hits occur from the same user in the same second:
SELECT `referrer` FROM `hits` h1 INNER JOIN
(SELECT MIN(`hit_id`) FROM `hits` GROUP BY `unique`) h2
ON h1.`hit_id` = h2.`hit_id`
GROUP BY `referrer`
In both cases, the last GROUP BY is just to remove dups in your final result set.

Related

SQL alternative to sub-query in FROM

I have a table containing user to user messages. A conversation has all messages between two users. I am trying to get a list of all the different conversations and display only the last message sent in the listing.
I am able to do this with a SQL sub-query in FROM.
CREATE TABLE `messages` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`from_user_id` bigint(20) DEFAULT NULL,
`to_user_id` bigint(20) DEFAULT NULL,
`type` smallint(6) NOT NULL,
`is_read` tinyint(1) NOT NULL,
`is_deleted` tinyint(1) NOT NULL,
`text` longtext COLLATE utf8_unicode_ci NOT NULL,
`heading` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at_utc` datetime DEFAULT NULL,
`read_at_utc` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
);
SELECT * FROM
(SELECT * FROM `messages` WHERE TYPE = 1 AND
(from_user_id = 22 OR to_user_id = 22)
ORDER BY created_at_utc DESC
) tb
GROUP BY from_user_id, to_user_id;
SQL Fiddle:
http://www.sqlfiddle.com/#!2/845275/2
Is there a way to do this without a sub-query?
(writing a DQL which supports sub-queries only in 'IN')
You seem to be trying to get the last contents of messages to or from user 22 with type = 1. Your method is explicitly not guaranteed to work, because the extra columns (not in the group by) can come from arbitrary rows. As explained in the [documentation][1]:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.
The query that you want is more along the lines of this (assuming that you have an auto-incrementing id column for messages):
select m.*
from (select m.from_user_id, m.to_user_id, max(m.id) as max_id
from message m
where m.type = 1 and (m.from_user_id = 22 or m.to_user_id = 22)
) lm join
messages m
on lm.max_id = m.id;
Or this:
select m.*
from message m
where m.type = 1 and (m.from_user_id = 22 or m.to_user_id = 22) and
not exists (select 1
from messages m2
where m2.type = m.type and m2.from_user_id = m.from_user_id and
m2.to_user_id = m.to_user_id and
m2.created_at_utc > m.created_at_utc
);
For this latter query, an index on messages(type, from_user_id, to_user_id, created_at_utc) would help performance.
Since this is a rather specific type of data query which goes outside common ORM use cases, DQL isn't really fit for this - it's optimized for walking well-defined relationships.
For your case however Doctrine fully supports native SQL with result set mapping. Using a NativeQuery with ResultSetMapping like this you can easily use the subquery this problem requires, and still map the results on native Doctrine entities, allowing you to still profit from all caching, usability and performance advantages.
Samples found here.
If you mean to get all conversations and all their last messages, then a subquery is necessary.
SELECT a.* FROM messages a
INNER JOIN (
SELECT
MAX(created_at_utc) as max_created,
from_user_id,
to_user_id
FROM messages
GROUP BY from_user_id, to_user_id
) b ON a.created_at_utc = b.max_created
AND a.from_user_id = b.from_user_id
AND a.to_user_id = b.to_user_id
And you could append the where condition as you like.
THE SQL FIDDLE.
I don't think your original query was even doing this correctly. Not sure what the GROUP BY was being used for other than maybe try to only return a single (unpredictable) result.
Just add a limit clause:
SELECT * FROM `messages`
WHERE `type` = 1 AND
(`from_user_id` = 22 OR `to_user_id` = 22)
ORDER BY `created_at_utc` DESC
LIMIT 1
For optimum query performance you need indexes on the following fields:
type
from_user_id
to_user_id
created_at_utc

Strange query results from MySQL

I have a query that I'm testing on my database, but for some weird reason, and randomly, it returns a different set of results. Interestingly, there are only two distinct result-sets that it returns, from thousands of rows, and the query will randomly return one or the other, but nothing else.
Is there a reason the query only returns one of two datasets? Query and schema below.
My goal is to select the fastest laps for a given track, in a given time period, but only the fastest lap for each user (so there are always 10 different users in the top 10).
Most of the time the correct results are returned, but randomly, a totally different result set is returned.
SELECT `lap`.`ID`, `lap`.`qualificationTime`, `lap`.`userId`
FROM `lap`
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
Schema:
CREATE TABLE IF NOT EXISTS `lap` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`userId` int(11) DEFAULT NULL,
`trackId` int(11) DEFAULT NULL,
`raceDateTime` datetime NOT NULL,
`qualificationTime` decimal(7,4) DEFAULT '0.0000',
`isTestLap` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`)
(DB create script trimmed of un-needed columns)
You are using a (mis)feature of MySQL called hidden columns. As others have pointed out, you are allowed to put columns in the select statement that are not in the group by. But, the returned values are arbitrary, and not even guaranteed to be the same from one run to the next.
The solution is to find the max qualification time for each user. Then join this information back to get the other fields. Here is one way:
select l.*
from (SELECT userId, min(qualificationtime) as minqf
FROM lap
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
) lu join
lap l
on lu.minqf = l.qualificationtime
ORDER BY l.`qualificationTime` ASC
LIMIT 10
You are selecting lap.ID, lap.qualificationTime and lap.userId, but you are not GROUPing BY them. You can only select fields you group by, or else aggregate functions on the other fields (MIN, MAX, AVG, etc). Otherwise, results are undefined.
I think you mean that sometimes values for lap.ID, lap.qualificationTime are different. And it's right behaviour for mysql. Because you group by userId and you don't know what values for other fields will be returned. Mysql can select different values depend on first value or last rows reading.
I would check something like this:
SELECT `l1`.`qualificationTime`, `l1`.`userId`,
(SELECT l2.ID FROM `lap` AS l2 WHERE l2.`userId` = l1.userId AND
l2.qualificationTime = min(l1.`qualificationTime`))
FROM `lap` AS `l1`
WHERE (l1.trackID =4)
AND (l1.raceDateTime >= "2013-07-25 10:00:00")
AND (l1.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
It's likely to be your ORDER BY on a decimal entity, and how the DB stores this and then retrieves it.

How to format this mysql Query

What I have is a table statistieken with an ip, hash of browser info, url visited and last visited date in timestamp.
What I could compile from different sources led to this query, the only problem is that this query takes forever(9 minutes) to complete on a table with about 15000 rows, so this query is very inefficient.
I think I'm going to this the wrong way around, but I can't find a decent post or tutorial how to use the results of a select as basis for getting the results I want.
What I simply want is an overview of every entry in the table that matches the hash of the results that are returned that have visted more than 25 pages in the last 12 hours.
CREATE TABLE IF NOT EXISTS `statsitieken` (
`hash` varchar(35) NOT NULL,
`ip` varchar(24) NOT NULL,
`visits` int(11) NOT NULL,
`lastvisit` int(11) NOT NULL,
`browserinfo` text NOT NULL,
`urls` text NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is the query I have tried to compile so far.
SELECT * FROM `database`.`statsitieken` WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
group by hash
having count(urls) > 25
order by urls)
I need this to compile in a decent time, like < 1 second which should be possible in my opinion...
I suggest trying this modified query. The subquery is now computed only once instead of being run for each record returned:
SELECT s.*
FROM `database`.`statsitieken` s, (SELECT *
FROM `database`.`statsitieken`
WHERE `lastvisit` > UNIX_TIMESTAMP(DATE_SUB(NOW(),INTERVAL 12 HOUR))
GROUP BY hash
HAVING COUNT(urls)>25) tmp
WHERE s.`hash`=tmp.`hash`
ORDER BY s.urls
Be sure you have indexes on the following fields:
hash to speed up the GROUP BY and WHERE
urls to speed up the ORDER BY
Derived table with INNER JOIN is faster than a subquery. try this optimized query:
SELECT *
FROM statsitieken a
INNER JOIN (SELECT hash
FROM statsitieken
WHERE lastvisit > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
) b
ON a.hash = b.hash
GROUP BY a.hash
HAVING COUNT(urls) > 25
ORDER BY urls;
For better performance of this select query you should add indexes as:
ALTER TABLE statsitieken ADD KEY ix_hash(hash);
ALTER TABLE statsitieken ADD KEY ix_lastvisit(lastvisit);
WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
You are "subquerying" (i don't know if exists that word :P, 'doing a subquery') in the same table, why not to:
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
do it directly?

Generate statistics in MySQL

I have a table with posts and I want to generate a graph that shows how many posts were made the previous last 30 minutes, and the last 30 minutes before that etc. The posts are selected by their post_handler and post_status.
The table structure looks like this.
CREATE TABLE IF NOT EXISTS `posts` (
`post_title` varchar(255) NOT NULL,
`post_content` text NOT NULL,
`post_date_added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`post_handler` varchar(255) NOT NULL,
`post_status` tinyint(4) NOT NULL,
`id` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`),
KEY `post_status` (`post_status`),
KEY `post_status_2` (`post_status`,`id`),
KEY `post_handler` (`post_handler`),
KEY `post_date_added` (`post_date_added`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=2300131 ;
The results I'd like to receive, sorted after post_date_added.
period_start period_end posts
2011-12-06 19:23:44 2011-12-06 19:53:44 10
2011-12-06 19:53:44 2011-12-06 20:23:44 39
2011-12-06 20:23:44 2011-12-06 20:53:44 40
Right now I use solution where I have to run this query many times over, and then insert the data into another table from the PHP script.
SELECT COUNT(*) FROM posts WHERE post_handler = 'test' AND post_status = 1 AND post_date_added BETWEEN '2011-12-06 19:23:44' AND '2011-12-06 19:53:44'
Do you know any other solution? Is there any way to run a query that also inserts results into the database, all in one query?
Its fairly easy to group by distinctive time parameters, like hour, minute, day or whatever. If you want to group this by an hour, a possible query might look like this:
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND post_status = 1
GROUP BY _Date;
(run this with a mysql query tool of your choice to see the output).
However, if you want to consider 30mins as the base of your group, the SQL part will get more tricky. For this special purpose, since you've only have to divide into two different subsets, maybe work with this approach:
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
"00" AS "semihour",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND DATE_FORMAT(post_date_added,"%i") < 30
AND post_status = 1
GROUP BY _Date
UNION
SELECT DATE_FORMAT(post_date_added,"%Y-%m-%d %H") AS "_Date",
"30" AS "semihour",
COUNT(*)
FROM posts
WHERE post_handler = 'test'
AND DATE_FORMAT(post_date_added,"%i") >= 30
AND post_status = 1
GROUP BY _Date;
Again, run this with a mysql query tool of your choice to see the output. You could add mathematical distinguishments there too working with CASE or IF and such, but personally I'd either group by hour or minute just to keep the SQL part way easier.
To directly add those numbers into your graph database, use this syntax:
INSERT INTO yourtable (yourfields)
SELECT ...
More details about this can be found here in the MySQL documentation.
In (very) brief: yes, you can insert the results of a query into another table. Take a look at INSERT ... SELECT here: http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
Essentially, you'd just change what you have to something like
INSERT INTO post_statistics_table (period_start, period_end, posts)
SELECT ?, ?, COUNT(*) FROM posts
WHERE post_handler = 'test'
AND post_status = 1
AND post_date_added BETWEEN ? AND ?
and then fill in the four ?s with the same two DATETIMEs, repeated. ($from, $to, $from, $to)

Doing some calculations in mysql, numbers off when using GROUP BY

Im running the following query to get the stats for a user, based on which I pay them.
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, (sum(hit_uniques)/1000)*hit_paylevel as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user
The table in question looks like this:
CREATE TABLE IF NOT EXISTS `daily_hits` (
`hit_itemid` varchar(255) NOT NULL,
`hit_mainid` int(11) NOT NULL,
`hit_user` int(11) NOT NULL,
`hit_date` date NOT NULL,
`hit_hits` int(11) NOT NULL DEFAULT '0',
`hit_uniques` int(11) NOT NULL,
`hit_embed` int(11) NOT NULL,
`hit_paylevel` int(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`hit_itemid`,`hit_date`),
KEY `hit_user` (`hit_user`),
KEY `hit_mainid` (`hit_mainid`,`hit_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The problem in the calculation has to do with the hit_paylevel which acts as a multiplier. Default is one, the other option is 2 or 3, which essentially doubles or triples the earnings for that day.
If I loop through the days, the daily day_earnings is correct, its just that when I group them, it calculates everything as paylevel 1. This happens if the user was paylevel 1 in the beginning, and was later upgraded to a higher level. if user is pay level 2 from the start, it also calculates everything correctly.
Shouldn't this be sum(hit_uniques * hit_paylevel) / 1000?
Like #Denis said:
Change the query to
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, sum(hit_uniques * hit_paylevel) / 1000 as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user;
Why this fixes the problem
Doing the hit_paylevel outside the sum, first sums all hit_uniques and then picks a random hit_paylevel to multiply it by.
Not what you want. If you do both columns inside the sum MySQL will pair up the correct hit_uniques and hit_paylevels.
The dangers of group by
This is an important thing to remember on MySQL.
The group by clause works different from other databases.
On MSSQL *(or Oracle or PostgreSQL) you would have gotten an error
non-aggregate expression must appear in group by clause
Or words to that effect.
In your original query hit_paylevel is not in an aggregate (sum) and it's also not in the group by clause, so MySQL just picks a value at random.