MySQL query to find number of unique new visitors per week - mysql

I have 3 tables in MySQL:
1) page (id, title)
2) visitor (id, name)
3) page_visit (page_id, visitor_id, timestamp_of_visit)
Visitors can visit pages multiple times, across several days. Hence, while we will have one row for a page, and one row for a visitor, we can have several page_visit rows, each with a timestamp of the visit.
I'm trying to find the number of unique visitors, by week. I know how to get the 'by week count' query for non-uniques (i.e. 'how many visitors did I see each week'). I'm not sure how to pick the unique visitors by week, though, with the visitor showing up on the list ONLY the first time they are ever seen.
----------- ----------- ----------------------------
| page | | visitor | | page_visit |
----------- ----------- ----------------------------
|id |title| |id |name | |pid|vid|timestamp of visit|
----------- ----------- ----------------------------
| 1 | p1 | | 1 | v1 | | 1 | 1 | 02-18-2016:08:30 |
| 2 | p2 | | 2 | v2 | | 1 | 1 | 02-18-2016:10:00 |
| 3 | p3 | | 3 | v3 | | 1 | 3 | 02-20-2016:23:45 |
| 4 | p4 | | 4 | v4 | | 2 | 3 | 02-22-2016:07:30 |
| 5 | p5 | | 5 | v5 | | 3 | 1 | 02-23-2016:08:30 |
| 6 | p6 | | 6 | v6 | | 3 | 6 | 02-24-2016:09:30 |
What the result set should show:
------------------------
| results |
------------------------
| Week of | Net new |
------------------------
| 02-15-2016 | 2 |
| 02-22-2016 | 1 |
As mentioned, I can figure out how to show ALL visitors by week. I'm not sure how to get the unique visitors.
I tried doing a min(timestamp of visit), but, based on where I tried it, it returned the lowest timestamp across all rows (understandably...).
Any help would be much appreciated!

This is a tricky question when you first encounter it. It requires two levels of aggregation. The first gets the first visit for each visitor, the second summarizes by time. The following does the summary by day:
select date(minvd), count(*) as numvisitors
from (select vid, min(visitdate) as minvd
from page_visit pv
group by vid
) v
group by date(minvd)
order by date(minvd);
Translating to weeks is always a bit tricky -- do they begin on Mondays? End on Saturdays? On Fridays? (I've seen all of these.) However, the above is additive, so you can just add all the values for a given week to get your value.

In case you want to do this without a subquery:
SELECT
<week>,
COUNT(DISTINCT PV.vid)
FROM
Page_Visit PV
LEFT OUTER JOIN Page_Visit PV2 ON
PV2.vid = PV.vid AND
PV2.visit_date < PV.visit_date
WHERE
PV2.vid IS NULL
GROUP BY
<week>
As Gordon mentions, how you determine the week can be tricky. Just add in that calculation where you see <week>. Personally, I like to use a Calendar table for that kind of functionality, but it's up to you. You can run any expressions directly against PV.visit_date to determine it.

Related

MySql add relationships without creating dupes

I created a table (t_subject) like this
| id | description | enabled |
|----|-------------|---------|
| 1 | a | 1 |
| 2 | b | 1 |
| 3 | c | 1 |
And another table (t_place) like this
| id | description | enabled |
|----|-------------|---------|
| 1 | d | 1 |
| 2 | e | 1 |
| 3 | f | 1 |
Right now data from t_subject is used for each of t_place records, to show HTML dropdowns, with all the results from t_subject.
So I simply do
SELECT * FROM t_subject WHERE enabled = 1
Now just for one of t_place records, one record from t_subject should be hidden.
I don't want to simply delete it with javascript, since I want to be able to customize all of the dropdowns if anything changes.
So the first thing I though was to add a place_id column to t_subject.
But this means I have to duplicate all of t_subject records, I would have 3 of each, except one that would have 2.
Is there any way to avoid this??
I thought adding an id_exclusion column to t_subject so I could duplicate records only whenever a record is excluded from another id from t_place.
How bad would that be?? This way I would have no duplicates, so far.
Hope all of this makes sense.
While you only need to exclude one course, I would still recommend setting up a full 'place-course' association. You essentially have a many-to-many relationship, despite not explicitly linking your tables.
I would recommend an additional 'bridging' or 'associative entity' table to represent which courses are offered at which places. This new table would have two columns - one foreign key for the ID of t_subject, and one for the ID of t_place.
For example (t_place_course):
| place_id | course_id |
|----------|-----------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 3 |
As you can see in my example above, place 3 doesn't offer course 2.
From here, you can simply query all of the courses available for a place by querying the place_id:
SELECT * from t_place_course WHERE place_id = 3
The above will return both courses 1 and 3.
You can optionally use a JOIN to get the other information about the course or place, such as the description:
SELECT `t_course`.`description`
FROM `t_course`
INNER JOIN `t_place_course`
ON `t_course`.`id` = `t_place_course`.`course_id`
INNER JOIN `t_place`
ON `t_place`.`id` = `place_id`

Only return an ordered subset of the rows from a joined table

Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.

Creating Temporary Column Order By Slow

So basically what I'm trying to do is get gained experience, ordering it, then only displaying top 5 or 50. Now note that I'm not SQL expert but I have knowledge of indexes as well as file sorting. The query that I have is filesorting most likely due to "gained_xp" not being a index-- let alone even a column as it's only temporary. There's no clear explanation how to fix this as I'm trying to contain it all in one query. I'm trying to sort nearly 13k rows with that number only expanding. I'd also need the number of rows to be dynamic as well as the time since. Any help would be appreciated. Thank you
Explain Output: Using where; Using temporary; Using filesort
Indexes include: time userid override overallXp overallLevel overallRank
The closest I've gotten to order all rows (which never ends up completing and ends in a mysql reboot) are:
SELECT FROM_UNIXTIME(time, '%Y-%m-%d'), t_u.userid as uid, MAX(t_u.OverallXP)-(SELECT overallXP FROM track_updates WHERE `userid` = t_u.userid AND `time`>'1394323200' ORDER BY id ASC LIMIT 1) as gained_xp
FROM track_updates t_u
WHERE t_u.time>'1394323200'
GROUP BY t_u.userid
If I'd run a query that selects only one user and works correctly is:
SELECT FROM_UNIXTIME(time, '%Y-%m-%d'), (t_u.overallXP)-(SELECT overallXP FROM track_updates WHERE `userid`='1' ORDER BY `id` ASC LIMIT 1) as gained_xp, t_u.userid
FROM track_updates t_u
WHERE t_u.userid='1' AND t_u.time>'1393632000'
ORDER BY t_u.time DESC
LIMIT 1
Sample Data per request:
____________________________________________________________________
| id | userid | time | overallLevel | overallXP | overallRank |
| 1 | 1 | 1394388114 | 1 | 1 | 1 |
| 2 | 1 | 1394389114 | 2 | 10 | 1 |
| 3 | 2 | 1394388114 | 1 | 1 | 2 |
| 4 | 2 | 1394389114 | 1 | 5 | 2 |
| 5 | 2 | 1394390114 | 2 | 7 | 2 |
Output (most recent time; gained xp current-initial; ordered by gain_xp):
____________________________________________
| id | time | userid | gained_xp |
| 1 | March 9th 2014 | 1 | 9 |
| 2 | March 9th 2014 | 2 | 6 |

MySQL: optimize query for scoring calculation

I have a data table that I use to do some calculations. The resulting data set after calculations looks like:
+------------+-----------+------+----------+
| id_process | id_region | type | result |
+------------+-----------+------+----------+
| 1 | 4 | 1 | 65.2174 |
| 1 | 5 | 1 | 78.7419 |
| 1 | 6 | 1 | 95.2308 |
| 1 | 4 | 1 | 25.0000 |
| 1 | 7 | 1 | 100.0000 |
+------------+-----------+------+----------+
By other hand I have other table that contains a set of ranges that are used to classify the calculations results. The range tables looks like:
+----------+--------------+---------+
| id_level | start | end | status |
+----------+--------------+---------+
| 1 | 0 | 75 | Danger |
| 2 | 76 | 90 | Alert |
| 3 | 91 | 100 | Good |
+----------+--------------+---------+
I need to do a query that add the corresponding 'status' column to each value when do calculations. Currently, I can do that adding the following field to calculation query:
select
...,
...,
[math formula] as result,
(select status
from ranges r
where result between r.start and r.end) status
from ...
where ...
It works ok. But when I have a lot of rows (more than 200K), calculation query become slow.
My question is: there is some way to find that 'status' value without do that subquery?
Some one have worked on something similar before?
Thanks
Yes, you are looking for a subquery and join:
select s.*, r.status
from (select s.*
from <your query here>
) s left outer join
ranges r
on s.result between r.start and r.end
Explicit joins often optimize better than nested select. In this case, though, the ranges table seems pretty small, so this may not be the performance issue.

SQL 'COUNT' not returning what I expect, and somehow limiting results to one row

Some background: an 'image' is part of one 'photoshoot', and may be a part of zero or many 'galleries'. My tables:
'shoots' table:
+----+--------------+
| id | name |
+----+--------------+
| 1 | Test shoot |
| 2 | Another test |
| 3 | Final test |
+----+--------------+
'images' table:
+----+-------------------+------------------+
| id | original_filename | storage_location |
+----+-------------------+------------------+
| 1 | test.jpg | store/test.jpg |
| 2 | test.jpg | store/test.jpg |
| 3 | test.jpg | store/test.jpg |
+----+-------------------+------------------+
'shoot_images' table:
+----------+----------+
| shoot_id | image_id |
+----------+----------+
| 1 | 1 |
| 1 | 2 |
| 3 | 3 |
+----------+----------+
'gallery_images' table:
+------------+----------+
| gallery_id | image_id |
+------------+----------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 1 |
| 4 | 1 |
+------------+----------+
What I'd like to get back, so I can say 'For this photoshoot, there are X images in total, and these images are featured in Y galleries:
+----+--------------+-------------+---------------+
| id | name | image_count | gallery_count |
+----+--------------+-------------+---------------+
| 3 | Final test | 1 | 1 |
| 2 | Another test | 0 | 0 |
| 1 | Test shoot | 2 | 4 |
+----+--------------+-------------+---------------+
I'm currently trying the SQL below, which appears to work correctly but only ever returns one row. I can't work out why this is happening. Curiously, the below also returns a row even when 'shoots' is empty.
SELECT shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
ORDER BY shoots.id DESC
Thanks for taking the time to look at this :)
You are missing the GROUP BY clause:
SELECT
shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
GROUP BY 1, 2 -- Added this line
ORDER BY shoots.id DESC
Note: The SQL standard allows GROUP BY to be given either column names or column numbers, so GROUP BY 1, 2 is equivalent to GROUP BY shoots.id, shoots.name in this case. There are many who consider this "bad coding practice" and advocate always using the column names, but I find it makes the code a lot more readable and maintainable and I've been writing SQL since before many users on this site were born, and it's never cause me a problem using this syntax.
FYI, the reason you were getting one row before, and not getting and error, is that in mysql, unlike any other database I know, you are allowed to omit the group by clause when using aggregating functions. In such cases, instead of throwing a syntax exception, mysql returns the first row for each unique combination of non-aggregate columns.
Although at first this may seem abhorrent to SQL purists, it can be incredibly handy!
You should look into the MySQL function group by.