I'm not sure I picked the correct title, but I did my best to explain what I am trying to do. I am just learning about joins and I have two tables that I am trying to combine in a certain way, but they both have WHERE clauses.
I started out by building both SELECT statements separately. Here is my first one from the table: "shipping_zones"
SELECT MIN(cal_zone) AS output_zone
FROM (
SELECT carrier, dest_zip, origin_zip, zone, MIN(zone) OVER(PARTITION BY carrier) as cal_zone
FROM shipping_zones z
WHERE (origin_zip = 402 OR origin_zip = 950) AND dest_zip = 015
) as t
WHERE zone=cal_zone;
This returns:
+-------------+
| output_zone |
+-------------+
| 5 |
+-------------+
My second table is: "shipping_prices" and my query is:
SELECT carrier, speed, zone, min_price
FROM (SELECT carrier, zone, speed, price, MIN(price) OVER(PARTITION BY speed) as min_price
FROM shipping_prices
WHERE total_wt = 66 and zone = 6
) t
WHERE price=min_price
ORDER BY speed DESC;
and the result is:
+---------+-------+------+-----------+
| carrier | speed | zone | min_price |
+---------+-------+------+-----------+
| fedex | slow | 6 | 45.66 |
| usps | med | 6 | 96.05 |
| usps | fast | 6 | 347.15 |
+---------+-------+------+-----------+
What I want to do is "pass" the value for output_zone from the first query as an "argument" into the 2nd query. I put argument word in quotes because I'm not sure that is the correct word.
I the best to accomplish this in SQL is to use a join correct? I understand the basic syntax of a join but am a bit lost because of clauses I'm using in both (WHERE, MIN, ORDER BY, etc.)
EDIT: This data is bring queried with Impala and was created in MySQL before being imported into HDFS with HIVE.
EDIT2: I should also mention that the "shipping_prices" table already has a field in it called "zone". So I guess I wouldn't be "passing" it so much as using its value from the output of the first query to find the appropriate tuples in the "shipping_prices" table.
Any help or tips would be appreciated.
Simply you can put your first query into one zone in (first_query) statement to replace zone=6.
The codes will be like below:
SELECT carrier, speed, zone, min_price
FROM (SELECT carrier, zone, speed, price, MIN(price) OVER(PARTITION BY speed) as min_price
FROM shipping_prices
WHERE total_wt = 66
and zone in (
SELECT MIN(cal_zone) AS output_zone
FROM (
SELECT carrier, dest_zip, origin_zip, zone, MIN(zone) OVER(PARTITION BY carrier) as cal_zone
FROM shipping_zones z
WHERE (origin_zip = 402 OR origin_zip = 950) AND dest_zip = 015
) as t
WHERE zone=cal_zone;
)
) t
WHERE price=min_price
ORDER BY speed DESC;
It seems you are using Mysql 8.0 (Development Release), The Mysql Engine will do reasonable query optimization which will most likely rewrite both IN and JOIN queries to the same plan. Check this URL for the details Convert IN to JOIN
Related
I am trying to get the 3 successful (success =1) recent records and then see their average response time.
I have manipulated the results so that the average response is always 2ms.
I have 20,000 records in this table right now, but I plan on have 1-2 million. It takes 40 seconds just with 20,000 records, so I need to optimize this query.
Here is the fiddle: http://sqlfiddle.com/#!9/dc91eb/1/0
The fiddle contains my indices too, so I am open to adding more indices if needed.
SELECT proxy,
Avg(a.responsems) AS avgResponseMs,
COUNT(*) as Count
FROM proxylog a
WHERE
a.success = 1
AND ( (SELECT Count(0)
FROM proxylog b
WHERE ( ( b.success = a.success )
AND ( b.proxy = a.proxy )
AND ( b.datetime >= a.datetime ) )) <= 3 )
GROUP BY proxy
ORDER BY avgResponseMs
Here is the result of EXPLAIN
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 1 | PRIMARY | a | index | NULL | proxy | 61 | NULL | 19110 | Using where; Using temporary; Using filesort |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 2 | DEPENDENT SUBQUERY | b | ref | proxy,datetime | proxy | 52 | wwwim_iroom.a.proxy | 24 | Using where; Using index |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
Before you suggest windowed functions, I am using MariaDB 10.1.21 which is ~Mysql 5.6 AFAIK
An index on (success, proxy, datetime, responsems) should help. success, proxy and datetime are the columns shared between both queries. datetime should come after the other two, because it is used to filter a range whereas the other two filter on a point. responsems comes last as this is the column the calculation is done on. That way the needed values can be taken directly from the index.
And please edit the question and include the DDL and DML also in the question it self. The fiddle might be down some day and the question therefore useless for future readers.
I was able to mimic row_number and follow #Gordon Linoff answer
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (
SELECT
#row_number:=CASE
WHEN #g = proxy
THEN #row_number + 1
ELSE 1
END AS RN,
#g:=proxy g,
pl.*
FROM proxyLog pl,
(SELECT #g:=0,#row_number:=0) as t
WHERE pl.success = 1
ORDER BY proxy,datetime DESC
) pl
WHERE RN <= 3
GROUP BY proxy
ORDER BY avgResponseMs
From your comment back to my question, I think I know what your problem is.
If you have a proxy that has 900 requests, your first is still counting 900 (at or greater). Second is counting 899, Third, 898 and so on. That is what is killing your performance. Now add that to having millions of records will choke the crud out of your query.
What you may want to do is have a max date applied to the first one you are querying against where it makes reasonable sense. If you have proxy requests such as times are (and all are success values)
8:00:00
8:00:18
8:00:57
9:02:12
9:15:27
Do really care about the success time between 8:00:57 and 9:02 and 9:15? If a computer is getting pounded with activity in one hour vs light in another, is that really a fair assessment of success times?
What you MAY want is to have some (your discretion) cutoff time, such as within 3 minutes. What if someone does not even resume work going through a proxy for some time. Is that really it? Again, your discresion
AND ( a.datetime <= b.datetime and b.datetime < date_add( a.datetime, interval 5 minutes )) )) <= 3 )
And the <= 3 is not giving you what I THINK you expect. Again, your innermost COUNT(*) is counting all records >= a.datetime, so it would not be until you were at the end of a given batch of proxy times that you would get these counts.
So are you looking for the HISTORICAL average times, or only the most recent 3 time cycles for a given proxy. What you are requesting and querying may be two completely different things.
You may want to edit your original post to clarify. I end here until I hear back to possible offer additional assistance.
I would advise you to try writing the query using window functions:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE pl.success = 1
) pl
WHERE seqnum <= 3
GROUP BY proxy
ORDER BY avgResponseMs;
For this, you want an index on proxylog(success, proxy, datetime, responsems).
In older versions, I would replace your version of the subquery with:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE
) pl
WHERE pl.success = 1 AND
pl.datetime >= ANY (SELECT pl2.datetime
FROM proxylog pl2
WHERE pl2.success = pl.success AND
pl2.proxy = pl.proxy
ORDER BY pl2.datetime DESC
LIMIT 1 OFFSET 2
)
GROUP BY proxy
ORDER BY avgResponseMs;
The index you want for this is the same as above.
I have a table that contains custom user analytics data. I was able to pull the number of unique users with a query:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users'
FROM `events`
WHERE client_id = 123
And this will return 16728
This table also has a column of type DATETIME that I would like to group the counts by. However, if I add a GROUP BY to the end of it, everything groups properly it seems except the totals don't match. My new query is this:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users', DATE(server_stamp) AS 'date'
FROM `events`
WHERE client_id = 123
GROUP BY DATE(server_stamp)
Now I get the following values:
|-----------------------------|
| unique_users | date |
|---------------|-------------|
| 2650 | 2019-08-26 |
| 3486 | 2019-08-27 |
| 3475 | 2019-08-28 |
| 3631 | 2019-08-29 |
| 3492 | 2019-08-30 |
|-----------------------------|
Totaling to 16734. I tried using a sub query to get the distinct users then count and group in the main query but no luck there. Any help in this would be greatly appreciated. Let me know if there is further information to help diagnosis.
A user, who is connected with events on multiple days (e.g. session starts before midnight and ends afterwards), will occur the number of these days times in the new query. This is due to the fact, that the first query performs the DISTINCT over all rows at once while the second just removes duplicates inside each groups. Identical values in different groups will stay untouched.
So if you have a combination of DISTINCT in the select clause and a GROUP BY, the GROUP BY will be executed before the DISTINCT. Thus without any restrictions you cannot assume, that the COUNT(DISTINCT user_id) of the first query and the sum over the COUNT(DISTINCT user_id) of all groups is the same.
Xandor is absolutely correct. If a user logged on 2 different days, There is no way your 2nd query can remove them. If you need data grouped by date, You can try below query -
SELECT COUNT(user_id) AS 'unique_users', DATE(MIN_DATE) AS 'date'
FROM (SELECT user_id, MIN(DATE(server_stamp)) MIN_DATE -- Might be MAX
FROM `events`'
WHERE client_id = 123
GROUP BY user_id) X
GROUP BY DATE(server_stamp);
I was searching for querys but i cant find an answer that helps me or if exit a similar question.
i need to get the info of the customers that made their last purchase between two dates
+--------+------------+------------+
| client | amt | date |
+--------+------------+------------+
| 1 | 2440.9100 | 2014-02-05 |
| 1 | 21640.4600 | 2014-03-11 |
| 2 | 6782.5000 | 2014-03-12 |
| 2 | 1324.6600 | 2014-05-28 |
+--------+------------+------------+
for example if i want to know all the cust who make the last purchase between
2014-02-11 and 2014-03-16, in that case the result must be
+--------+------------+------------+
| client | amt | date |
+--------+------------+------------+
| 1 | 21640.4600 | 2014-03-11 |
+--------+------------+------------+
cant be the client number 2 cause have a purchease on 2014-05-28,
i try to make a
SELECT MAX(date)
FROM table
GROUP BY client
but that only get the max of all dates,
i dont know if exist a function or something that can help, thanks.
well i dont know how to mark this question as resolved but this work for me
to complete the original query
SELECT client, MAX(date)
FROM table
GROUP BY client
HAVING MAX(date) BETWEEN date1 AND date2
thanks to all that took a minute to help me with my problem,
special thanks to Ollie Jones and Peter Pei Guo
Something in this format, replace date1 and date 2 with the real values.
SELECT client, max(date)
from table
group by client
having max(date) between date1 AND date2
There is more than one way to do this. Here is one of them.
select * from
(
select client, max(date) maxdate
from table
group by client ) temp
where maxdate between '2014-02-11' and '2014-03-06'
This will allow you to grab the amount column of the applicable rows as well:
select t.*
from tbl t
join (select client, max(date) as last_date
from tbl
group by client
having max(date) between date1 and date2) v
on t.client = v.client
and t.date = v.last_date
I had to change the field "Date" to "TheDate" since date is a reserved word. I assume you are using SQL? My table name is Table1. You need to group records:
SELECT Table1.Client, Sum(Table1.Amt) AS SumOfAmt, Table1.TheDate
FROM Table1
GROUP BY Table1.Client, Table1.TheDate
HAVING (((Table1.TheDate) Between #2/11/2014# And #3/16/2014#));
Query Results:
Client SumOfAmt TheDate
1 21640 03/11/14
2 6792 03/12/14
You may want to get yourself a copy of MS Access. You can generate SQL statements using their query builder which I used to generate this SQL. When I make a post here I will always test it first to make sure it works! I have never written even 1 line of SQL code, but have executed thousands of them from within MS Access.
Good luck,
Dan
Once again i need yours help ;). I have a lot data and mysql request are slower and slower so the need request that i need i want group in one comand.
My example DB structure:
|product|opinion (pos/neg)|reason|
__________________________________
|milk | pos | good |
|milk | pos |fresh |
|chocolate| neg | old |
|milk | neg | from cow|
So i need information about all diffrent product (GROUP BY) count of it, and count of pos opinion for each product. I want output like that:
|product|count|pos count|
_________________________
|milk | 3 | 2 |
|chocolate| 1 | 0 |
I hope that my explain was good enought ;)
Or go to work: I write two commands
SELECT COUNT(*) as pos count FROM table WHERE product = "milk" AND opinion = "pos" GROUP BY `opinion`
And Second one
SELECT product, COUNT(*) FROM table GROUP BY `product`
I don't know how to join this two request, maybe that is impossible? In my real use code i have additional category collumn and i use WHERE in second command too
select product,
count(*) as total_count
sum(
case when opinion='pos' then 1
else 0 end
) as pos_count
from the_table
group by product;
SELECT product,
COUNT(*) TotalCount,
SUM(opinion = 'pos') POSCount
FROM tableName
GROUP BY product
SUM(opinion = 'pos') is MySQL specific syntax that counts the total of result based from the result of boolean arithmethic. If you want it to be more RDBMS friends, use CASE
SUM(CASE WHEN opinion = 'pos' THEN 1 ELSE 0 END)
SQLFiddle Demo
My table stores revision data for my CMS entries. Each entry has an ID and a revision date, and there are multiple revisions:
Table: old_revisions
+----------+---------------+-----------------------------------------+
| entry_id | revision_date | entry_data |
+----------+---------------+-----------------------------------------+
| 1 | 1302150011 | I like pie. |
| 1 | 1302148411 | I like pie and cookies. |
| 1 | 1302149885 | I like pie and cookies and cake. |
| 2 | 1288917372 | Kittens are cute. |
| 2 | 1288918782 | Kittens are cute but puppies are cuter. |
| 3 | 1288056095 | Han shot first. |
+----------+---------------+-----------------------------------------+
I want to transfer some of this data to another table:
Table: new_revisions
+--------------+----------------+
| new_entry_id | new_entry_data |
+--------------+----------------+
| | |
+--------------+----------------+
I want to transfer entry_id and entry_data to new_entry_id and new_entry_data. But I only want to transfer the most recent version of each entry.
I got as far as this query:
INSERT INTO new_revisions (
new_entry_id,
new_entry_data
)
SELECT
entry_id,
entry_data,
MAX(revision_date)
FROM old_revisions
GROUP BY entry_id
But I think the problem is that I'm trying to insert 3 columns of data into 2 columns.
How do I transfer the data based on the revision date without transferring the revision date as well?
You can use the following query:
insert into new_revisions (new_entry_id, new_entry_data)
select o1.entry_id, o1.entry_data
from old_revisions o1
inner join
(
select max(revision_date) maxDate, entry_id
from old_revisions
group by entry_id
) o2
on o1.entry_id = o2.entry_id
and o1.revision_date = o2.maxDate
See SQL Fiddle with Demo. This query gets the max(revision_date) for each entry_id and then joins back to your table on both the entry_id and the max date to get the rows to be inserted.
Please note that the subquery is only returning the entry_id and date, this is because we want to apply the GROUP BY to the items in the select list that are not in an aggregate function. MySQL uses an extension to the GROUP BY clause that allows columns in the select list to be excluded in a group by and aggregate but this could causes unexpected results. By only including the columns needed by the aggregate and the group by will ensure that the result is the value you want. (see MySQL Extensions to GROUP BY)
From the MySQL Docs:
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. ... You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values the server chooses.
If you want to enter the last entry you need to filter it before:
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id;
Then use this as a subquery to filter the data you need:
insert into new_revisions (new_entry_id, new_entry_data)
select entry_id, entry_data
from old_revisions as o
inner join (
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id
) as a on o.entry_id = a.entry_id and o.revision_date = a.maxDate