Controlling what value appears in a column while doing a join - mysql

This one's kind of complicated, so hopefully I can make it clear.
I have two tables:
views:
+---------------------+-------------+------------------+
| time | remote_host | referer |
+---------------------+-------------+------------------+
| 0000-00-00 00:00:00 | 10.0.13.2 | http://foo.com/a |
| 0000-00-00 00:00:00 | 10.0.13.1 | http://foo.com/b |
| 0000-00-00 00:00:00 | 10.0.13.2 | http://moo.com |
| 0000-00-00 00:00:00 | 10.0.13.2 | http://hi.com |
| 0000-00-00 00:00:00 | 10.0.13.1 | http://foo.com/c |
+---------------------+-------------+------------------+
test_websites:
+----+----------------+------+
| id | url | name |
+----+----------------+------+
| 1 | http://foo.com | |
| 2 | http://moo.com | |
+----+----------------+------+
I have a query that very nearly does what I want:
SELECT COUNT(*) as count, remote_host, url FROM test_websites
JOIN views ON referer LIKE CONCAT(url, '%')
GROUP BY test_websites.url
ORDER BY count DESC LIMIT 10;
Results look like this:
+-------+-------------+----------------+
| count | remote_host | url |
+-------+-------------+----------------+
| 3 | 10.0.13.2 | http://foo.com |
| 1 | 10.0.13.2 | http://moo.com |
+-------+-------------+----------------+
To explain, I'm trying to get the top 10 viewed websites, however the website URLs are defined in test_websites. Since http://foo.com is an entry in test_websites, all entries that start with http://foo.com should be counted as "one website." Hence the join is based on a LIKE condition, and it's correctly counting 3 for http://foo.com in the results.
So, the problem is that I want remote_host to be that entry that appears the most for those rows in views that start with http://foo.com. In this case, there are two rows starting with http://foo.com in the views table that have 10.0.13.1 as the remote_host, and so the results should show that 10.0.13.1 the remote_host column, and not the remote_host that appears with the first entry that starts with http://foo.com, as it is doing now.
Thanks.

UPDATED
Please try the following corrected query:
SELECT
COUNT(*) as count,
(
SELECT A.remote_host
FROM views AS A
WHERE A.referer LIKE CONCAT(test_websites.url, '%')
GROUP BY A.remote_host
ORDER BY COUNT(1) DESC
LIMIT 1
) AS max_count_remote_host,
test_websites.url
FROM
test_websites
JOIN views ON views.referer LIKE CONCAT(test_websites.url, '%')
GROUP BY
test_websites.url
ORDER BY
count DESC LIMIT 10;
Here you could find a working SQL Fiddle example.

Related

SQL: How to get last date in groupby for getting unique records?

I am working on a data where I have to use multiple joins and figures out that one of the table is producing duplicates as I applied Group by on dates as well and b/c of different dates my query takes in duplicate values.
I wrote following query
SELECT
ll.ID,
ll.EST_DT
gg.col1 ,
ll.EST_CLAIM_DT,
gg.col2
FROM table gg
inner join
(select substr(ID,1,instr(ID,'-',7)-1) EST_ID,
max(est_dt) as EST_DT,
max(EST_CLAIM_DT) as EST_CLAIM_DT
from table group by substr(gg.ID,1,instr(ID,'-',7)-1)) ll
on substr(ID,1,instr(gg.ID,'-',7)-1)=substr(ll.ID,1,instr(ll.ID,'-',7)-1)
GROUP BY
ll.ID,
ll.EST_DT
gg.col1 ,
ll.EST_CLAIM_DT,
gg.col2
Table looks like this:
+-----------------+------------+----------------+------+------+
| ID | est_date | est_claimed_dt | col1 | col2 |
+-----------------+------------+----------------+------+------+
| EST-U-1040452-1 | 28/02/2019 | 28/02/2019 | 50 | 50 |
| EST-U-1040452-2 | 5/10/2020 | 5/10/2020 | 50 | 50 |
+-----------------+------------+----------------+------+------+
Desired output
+---------+-----------+----------------+------+------+
| ID | est_date | est_claimed_dt | col1 | col2 |
+---------+-----------+----------------+------+------+
| 1040452 | 5/10/2020 | 5/10/2020 | 50 | 50 |
+---------+-----------+----------------+------+------+
I get this error as well
Negative sub string length not allowed
P.S. I have search SO for this issue and it helped but couldn't get it to work.

Why should I write the rest of columns into GROUP BY when there is an aggregate function?

I have this table structure:
// mytable
+----+------+-------+-------------+
| id | type | score | unix_time |
+----+------+-------+-------------+
| 1 | 1 | 5 | 1463508841 |
| 2 | 1 | 10 | 1463508842 |
| 3 | 2 | 5 | 1463508843 |
| 4 | 1 | 5 | 1463508844 |
| 5 | 2 | 15 | 1463508845 |
| 6 | 1 | 10 | 1463508846 |
+----+------+-------+-------------+
And here is my query:
SELECT SUM(score), unix_time
FROM mytable
WHERE 1
GROUP BY type
And here is the output:
+-------+-------------+
| score | unix_time |
+-------+-------------+
| 30 | 1463508841 |
| 20 | 1463508843 |
+-------+-------------+
Ok, all fine .. Just there is a thing: Professional people suggest me to write unix_time into GROUP BY. They believe doing that is the base of grouping and aggregate function.
Well why really should I write a (almost) unique column into GROUP BY? If I do that then each row will be a separated group and there will be a lot of extra rows which are useless:
+-------+-------------+
| score | unix_time |
+-------+-------------+
| 30 | 1463508841 |
| 30 | 1463508842 |
| 20 | 1463508843 |
| 30 | 1463508844 |
| 20 | 1463508845 |
| 30 | 1463508846 |
+-------+-------------+
See? There is a lot of extra rows. So why doing that is an standard thing? Why everybody tell me MySQL does work without doing that but no database else doesn't .. Well I really don't understand why should I do that ..!
May please someone make it clear for me and explain me how GROUP BY works exactly? Is that different than my understanding?
Not having unix_time in the GROUP BY clause is a non-standard MySQL hack that I would totally stay away from. The values for unix_type across all the rows with the same type are completely different. How do you know which unix_time should appear?
In your example, you seem perfectly content to use a completely arbitrary value of unix_time per group.
However this is a recipe for disaster. What does it even mean to pick some totally arbitrary value from a group? What if the unix_times were spread out by days or weeks or even years? Which one would you take then?
The reason the pros are telling you to put it in the group by clause is so that the result makes sense! Another approach is to leave unix_time out of the select completely, as the result you are getting shouldn't be relied upon.
Maybe you need something like this:
SELECT type,
SUM(score) as sum_of_score,
MIN(unix_time) as start_unix_time,
MAX(unix_time) as end_unix_time
FROM mytable
WHERE 1
GROUP BY type

MySQL: optimize query for scoring calculation

I have a data table that I use to do some calculations. The resulting data set after calculations looks like:
+------------+-----------+------+----------+
| id_process | id_region | type | result |
+------------+-----------+------+----------+
| 1 | 4 | 1 | 65.2174 |
| 1 | 5 | 1 | 78.7419 |
| 1 | 6 | 1 | 95.2308 |
| 1 | 4 | 1 | 25.0000 |
| 1 | 7 | 1 | 100.0000 |
+------------+-----------+------+----------+
By other hand I have other table that contains a set of ranges that are used to classify the calculations results. The range tables looks like:
+----------+--------------+---------+
| id_level | start | end | status |
+----------+--------------+---------+
| 1 | 0 | 75 | Danger |
| 2 | 76 | 90 | Alert |
| 3 | 91 | 100 | Good |
+----------+--------------+---------+
I need to do a query that add the corresponding 'status' column to each value when do calculations. Currently, I can do that adding the following field to calculation query:
select
...,
...,
[math formula] as result,
(select status
from ranges r
where result between r.start and r.end) status
from ...
where ...
It works ok. But when I have a lot of rows (more than 200K), calculation query become slow.
My question is: there is some way to find that 'status' value without do that subquery?
Some one have worked on something similar before?
Thanks
Yes, you are looking for a subquery and join:
select s.*, r.status
from (select s.*
from <your query here>
) s left outer join
ranges r
on s.result between r.start and r.end
Explicit joins often optimize better than nested select. In this case, though, the ranges table seems pretty small, so this may not be the performance issue.

SQL 'COUNT' not returning what I expect, and somehow limiting results to one row

Some background: an 'image' is part of one 'photoshoot', and may be a part of zero or many 'galleries'. My tables:
'shoots' table:
+----+--------------+
| id | name |
+----+--------------+
| 1 | Test shoot |
| 2 | Another test |
| 3 | Final test |
+----+--------------+
'images' table:
+----+-------------------+------------------+
| id | original_filename | storage_location |
+----+-------------------+------------------+
| 1 | test.jpg | store/test.jpg |
| 2 | test.jpg | store/test.jpg |
| 3 | test.jpg | store/test.jpg |
+----+-------------------+------------------+
'shoot_images' table:
+----------+----------+
| shoot_id | image_id |
+----------+----------+
| 1 | 1 |
| 1 | 2 |
| 3 | 3 |
+----------+----------+
'gallery_images' table:
+------------+----------+
| gallery_id | image_id |
+------------+----------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 1 |
| 4 | 1 |
+------------+----------+
What I'd like to get back, so I can say 'For this photoshoot, there are X images in total, and these images are featured in Y galleries:
+----+--------------+-------------+---------------+
| id | name | image_count | gallery_count |
+----+--------------+-------------+---------------+
| 3 | Final test | 1 | 1 |
| 2 | Another test | 0 | 0 |
| 1 | Test shoot | 2 | 4 |
+----+--------------+-------------+---------------+
I'm currently trying the SQL below, which appears to work correctly but only ever returns one row. I can't work out why this is happening. Curiously, the below also returns a row even when 'shoots' is empty.
SELECT shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
ORDER BY shoots.id DESC
Thanks for taking the time to look at this :)
You are missing the GROUP BY clause:
SELECT
shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
GROUP BY 1, 2 -- Added this line
ORDER BY shoots.id DESC
Note: The SQL standard allows GROUP BY to be given either column names or column numbers, so GROUP BY 1, 2 is equivalent to GROUP BY shoots.id, shoots.name in this case. There are many who consider this "bad coding practice" and advocate always using the column names, but I find it makes the code a lot more readable and maintainable and I've been writing SQL since before many users on this site were born, and it's never cause me a problem using this syntax.
FYI, the reason you were getting one row before, and not getting and error, is that in mysql, unlike any other database I know, you are allowed to omit the group by clause when using aggregating functions. In such cases, instead of throwing a syntax exception, mysql returns the first row for each unique combination of non-aggregate columns.
Although at first this may seem abhorrent to SQL purists, it can be incredibly handy!
You should look into the MySQL function group by.

MySQL using GROUP BY to group by multiple columns

I'd like to use GROUP BY multiple columns, I think it's best to start with an example:
SELECT
eventsviews.eventId,
showsActive.showId,
showsActive.venueId,
COUNT(*) AS count
FROM eventsviews
INNER JOIN events ON events.eventId = eventsviews.eventId
INNER JOIN showsActive ON showsActive.eventId = eventsviews.eventId
WHERE events.status = 1
GROUP BY showsActive.venueId, showsActive.showId, showsActive.eventId
ORDER BY count DESC
LIMIT 100;
Output:
| *eventId* | *showId* | *venueId* | *count* |
+-----------+----------+-----------+---------+
[...snip...]
| 95 | 92099 | 9770 | 32 |
| 95 | 105472 | 10702 | 32 |
| 3804 | 41225 | 8165 | 17 |
| 3804 | 41226 | 8165 | 17 |
| 923 | 2866 | 5451 | 14 |
| 923 | 20184 | 5930 | 14 |
[...snip...]
What I would like instead:
| *eventId* | *showId* | *venueId* | *count* |
+-----------+----------+-----------+---------+
| 95 | 92099 | 9770 | 32 |
| 3804 | 41226 | 8165 | 17 |
| 923 | 20184 | 5930 | 14 |
So, I want my data grouped by eventId, but only once for each showId and venueId ...
I actually have a SQL query that does that, but it has 8 subqueries and is as slow as a T-Ford ... And since this is executed on every page load, speeding things up looks like a good idea!
There are a few questions like this, and I've tried many different things, but I've been at this query for an hour and I can't seem to get it to work as I want :-(
Thanks!
You probably want either a min or a max on showid, and then not include it in the group by, I can't tell which because looking at your "prefered" resultset, you have both.
If you want your data grouped by eventId, group just by eventId and you'll get exactly the result you're looking for.
This is a MySQL feature (?) that it allows you to select non-aggregate columns, in which case it will return the first row available. In other DBMS it's achieved by DISTINCT ON, which is not available in MySQL.