Related
I'm trying to run an SQL query on Vertica but I can't find a way to get the results I need.
Let's say I have a table showing:
productID
campaignID (ID of the sales campaign)
calendarYearWeek (calendar week when the campaign was active [usually they're active for 5 days)
countryOrigin (in which country was the product sold, as it's international sales)
valueLocal (price in local currency)
What I need to do is to find products sold in different countries and compare their prices between markets.
Sometimes the campaigns are available only in one country, sometimes in more, so to avoid having hundreds of thousands of unnecessary rows that I can't compare to others, I want to distill only those products that were available in more than 1 countryOrigin.
What's important - a product can be available in different campaigns with a different price.
That's why in my SELECT statement I added a new column:
calendarYearWeek||productID||campaignID AS uniqueItem - that way I know that I'm checking the price only for a specific product in a specific campaign during a specific week of year.
The table is also joined with another table to get exchange rates etc., so it's also GROUPed BY, so in each row I have a price and average exchange rate for a given uniqueItem in a specific country.
If I run this query, it works but even just for this year it gives me several million results, most of which I don't need because these are products sold only in one country and I need to compare prices across different markets.
So what I thought I need is to assign to each row a number of times a uniqueItem value appears in the whole table. If it's 1 - then the product is sold only in one country and I don't have to care about it. If it's 2 or 3 - this is what I need. Then I can filter out the unnecessary results in the WHERE clause ( > 1) and I can work on a smaller, better data set.
I tried different combinations of COUNT, I tried row_number + OVER(PARTITION BY) (works only partially, as when a product is available in 2 or more countries it counts the rows, but still I cannot filter out "1" because then I'll lose the "first" country on the list). I thought about MATCH_RECOGNIZED, but I've never used it before and I think it's not available in Vertica.
Sorry if it's messy, but I'm not really advanced in SQL and English is not my native language.
Do you have any ideas how to get only the data I need?
What I have now is:
SELECT
a.originCountry,
a.calendarYearWeek,
a.productID,
a.campaignId,
a.valueLocal,
ROUND(AVG(b.exchange_rate),4),
a.calendarYearWeek||a.productID||a.campaignID AS uniqueItem
FROM table1 a
LEFT JOIN table2 b
ON a.reportDate = b.reportDate
AND a.originCountry = b.originCountry
WHERE a.originCountry IN ('ES', 'DE', 'FR')
GROUP BY 3, 4, 7, 1, 5, 2
ORDER BY 3, 4, 1
----------
I need some sample data - so I make up a few rows.
You need to find the identifying grouping columns of those combinations that occur more than once in a sub select or a common table expression, to join with table1.
You need to formulate the average as an OLAP function if you want the country back in the report.
WITH
-- input, don't use in final query ..
table1(originCountry,calendarYearWeek,productID,campaignId,valuelocal,reportDate) AS (
SELECT 'ES',202203,43,142,100.50, DATE '2022-01-19'
UNION ALL SELECT 'DE',202203,43,142,135.00, DATE '2022-01-19'
UNION ALL SELECT 'FR',202203,43,142, 98.75, DATE '2022-01-19'
UNION ALL SELECT 'ES',202203,44,147,198.75, DATE '2022-01-19'
UNION ALL SELECT 'DE',202203,44,147,205.00, DATE '2022-01-19'
UNION ALL SELECT 'FR',202203,44,147,198.75, DATE '2022-01-19'
UNION ALL SELECT 'es',202203,49,150, 1.25, DATE '2022-01-19'
)
,
table2(originCountry,reportDate,exchange_rate) AS (
SELECT 'ES',DATE '2022-01-19', 1
UNION ALL SELECT 'DE',DATE '2022-01-19', 1
UNION ALL SELECT 'FR',DATE '2022-01-19', 1
)
-- end of input; real query starts here, replace following comma with "WITH" ..
,
-- you need the unique ident grouping values to join with ..
selgrp AS (
SELECT
a.calendarYearWeek
, a.productID
, a.campaignId
FROM table1 a
GROUP BY
a.calendarYearWeek
, a.productID
, a.campaignId
HAVING COUNT(*) > 1
-- chk calendarYearWeek | productID | campaignId
-- chk ------------------+--------+--------
-- chk 202203 | 43 | 142
-- chk 202203 | 44 | 147
)
SELECT
a.originCountry
, a.calendarYearWeek
, a.productID
, a.campaignId
, a.valueLocal
, AVG(b.exchange_rate) OVER w::NUMERIC(9,4) AS avg_exch_rate
-- a.calendarYearWeek||a.productID||a.campaignID AS uniqueItem
FROM table1 a
JOIN selgrp USING(calendarYearWeek,productID,campaignId)
LEFT JOIN table2 b
ON a.reportDate = b.reportDate
AND a.originCountry = b.originCountry
WHERE UPPER(a.originCountry) IN ('ES', 'DE', 'FR')
WINDOW w AS (PARTITION BY a.calendarYearWeek,a.productID,a.campaignID)
ORDER BY 3, 4, 1
-- out originCountry | calendarYearWeek | productID | campaignId | valueLocal | avg_exch_rate
-- out ---------------+------------------+-----------+------------+------------+---------------
-- out DE | 202203 | 43 | 142 | 135.00 | 1.0000
-- out ES | 202203 | 43 | 142 | 100.50 | 1.0000
-- out FR | 202203 | 43 | 142 | 98.75 | 1.0000
-- out DE | 202203 | 44 | 147 | 205.00 | 1.0000
-- out ES | 202203 | 44 | 147 | 198.75 | 1.0000
-- out FR | 202203 | 44 | 147 | 198.75 | 1.0000
I am trying to compare the sum of two column values to a third column values, then display a string literal as result set value. Here's my query
Here's my schema with sample data
Player_Games
------------------------
Game1 | Game 2 | Game 3
------------------------
20 | 13 | 45
------------------------
14 | 27 | 25
------------------------
18 | 17 | 36
------------------------
20 | 20 | 29
------------------------
32 | 10 | 33
------------------------
SELECT
CASE
WHEN((
SELECT SUM(Game1 + Game2) as total FROM Player_Games
)
<
(
SELECT Game3 FROM Player_Games
))
THEN "Expected_Performance"
END as Result
FROM Player_Games;
Expected Result
Expected Performance
NULL
Expected Performance
NULL
NULL
However, an error is thrown ERROR 1242 (21000) at line 4: Subquery returns more than 1 row
What am I missing here? Do I need to GROUP BY Game3?
you dont need sum here. just case :
SELECT
CASE
WHEN Game1 + Game2 < Game3
THEN 'Expected_Performance'
END as Result
FROM Player_Games
You can just check using a CASE expression whether the sum of game1 and game2 is less than game3 then add the needed text.
Query
select
case when game1 + game2 < game3
then 'Expected performance'
else null end as result
from player_games;
SELECT CASE WHEN t1.Game3 > t2.total THEN "Expected_Performance" END
FROM Player_Games t1
CROSS JOIN (SELECT SUM(Game1 + Game2) as total FROM Player_Games) t2
On 8+
SELECT CASE WHEN Game3 > SUM(Game1 + Game2) OVER () THEN "Expected_Performance" END
FROM Player_Games
If you need to summarize not over whole table but only in single row then
SELECT CASE WHEN Game3 > Game1 + Game2 THEN "Expected_Performance" END
FROM Player_Games
PS. There is no guarantee that the output rows order matches the source one (in all queries). So you must add some column into the output list which identifies separate rows.
I have data of an event with duration (say, eating a meal at a restaurant) and I want to know for any given hour how many events were taking place. The data looks like this:
Event | Start Time | End Time
-----------------------------------------
1 | 12:03 | 14:20
2 | 12:30 | 12:50
3 | 13:05 | 14:45
4 | 14:01 | 14:49
I also have "Duration" available as an alternative to "End Time". The result I'm looking for would be like this:
Hour | Count
-----------------------
12 | 2
13 | 2
14 | 3
During hour 12, there were two events happening (1 & 2), hour 13 also had two events (1 & 3) and hour 14 had three events (1, 3, & 4).
I can do this programmatically with a loop. I can count when the events start (or end) in SQL. But I'd really like to bridge the gap and do this in SQL, but I can't think of a way.
One possible solution (works with MySQL v5.6+ and SQLite3):
create table hours(Hour int);
insert into hours values
(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),
(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23);
create table log(Event int,StartTime varchar(5),EndTime varchar(5));
insert into log values
(1,'12:03','14:20'),
(2,'12:30','12:50'),
(3,'13:05','14:45'),
(4,'14:01','14:49');
-- ------------------------------------------------------------------------------
select Hour,count(Event) Count
from log join hours
on Hour between substr(StartTime,1,2) and substr(EndTime,1,2)
group by Hour;
If you are running MySQL 8.0, you could use UNION ALL, window functions and aggregation, like so:
select hr, sum(sum(cnt)) over(order by hr) cnt
from (
select hour(start_time) hr, 1 cnt from mytable
union all select hour(end_time) + 1, -1 from mytable
) t
group by hr
Demo on DB Fiddle:
hr | cnt
-: | --:
12 | 2
13 | 2
14 | 3
15 | 0
If you do not have MySql 8, then create a table hour:
CREATE TABLE hour (
hr INT PRIMARY KEY
);
INSERT INTO hour(hr) VALUES
(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23);
And then:
select h.hr, count(*) as cnt from hour h
join mytable m on h.hr between hour(m.Start_Time) and hour(m.End_Time)
group by hr
order by hr
;
See Db-Fiddle
I'm in trouble understanding how the MAX function works.
Here is my table MD_board :
idPlayer | matchday | total
---------+----------+-------
1 | 7 | 354
---------+----------+-------
2 | 7 | 122
---------+----------+-------
3 | 7 | 672
---------+----------+-------
1 | 8 | 452
---------+----------+-------
2 | 8 | 90
---------+----------+-------
3 | 8 | 654
---------+----------+-------
I want to have the max total and the idPlayer of the matchday 8. But the query is a mystery to me.
I tried the simple query :
SELECT MAX(total), idPlayer FROM MD_board WHERE matchday=8
The max value returned is good ( 654 ) but the idPlayer is wrong ( 1 ).
I tried a lot of other queries but I'm unable to get the correct result :(
I'm not really comfortable about more complex queries, so, if you could help ...
There are three idPlayers for matchday = 8! You probably use MySQL which allows such wrong queries and you should be aware that the MySQL returns random idPlayer value. Therefore, you can obtain different idPlayer tomorrow than you get today for the same query.
You probably want player with highest total in a specific matchday:
SELECT *
FROM MD_board MD1
WHERE matchday=8 and total =
(
SELECT MAX(total)
FROM MD_board MD2
WHERE MD2.matchday = MD1.matchday
)
Your query:
SELECT MAX(total), idPlayer FROM MD_board WHERE matchday=8
By applying MAX without GROUP BY you aggregate your data to one result row. You select the maximum total, which is 654 for that day and the idPlayer for the day. But there is no one player, it's three different ones. This is invalid SQL according to the SQL standard, but MySQL let's this slip and returns one of the three players arbitrarily.
If you want more data from the record(s) with the maximum total, then select those records again:
SELECT *
FROM MD_board
WHERE matchday = 8
AND total = (SELECT MAX(total) FROM MD_board WHERE matchday = 8);
Try this:
SELECT M.*
FROM MD_board M
JOIN (
SELECT matchday,MAX(Total)Total
FROM MD_board
GROUP BY matchday
)D on D.matchday=M.matchday AND D.Total=M.Total
I'm trying to get a list of the*usedpc values across multiple similar columns, and order desc to get worst offenders. Also, I need to only select the values from the most recent timestamp for each sys_id.
Example data:
Sys_id | timestamp | disk0_usedpc | disk1_usedpc | disk2_usedpc
---
1 | 2016-05-06 15:24:10 | 75 | 45 | 35
1 | 2016-04-06 15:24:10 | 70 | 40 | 30
2 | 2016-05-06 15:24:10 | 23 | 28 | 32
3 | 2016-05-06 15:24:10 | 50 | 51 | 55
Desired result (assuming limit 2 for example):
1 | 2016-05-06 15:24:10 | disk0_usedpc | 75
3 | 2016-05-06 15:24:10 | disk2_usedpc | 55
I know I can get the max from each column using greatest, max and group timestamp to get only the latest values, but I can't figure out how to get the whole ordered list (not just max/greatest from each column, but the "5 highest values across all 3 disk columns").
EDIT: I set up a SQLFiddle page:
http://sqlfiddle.com/#!9/82202/1/0
EDIT2: I'm very sorry about the delay. I was able to get all three solutions to work, thank you. If #PetSerAl can put his solution in an answer, I'll mark it as accepted, as this solution allowed me to very smoothly customise further.
You can join vm_disk table with three row table to create separate row for each of yours disks. Then, as you have row per disk now, you can easily filter or sort them.
select
`sys_id`,
`timestamp`,
concat('disk', `disk`, '_usedpc') as `name`,
case `disk`
when 0 then `disk0_usedpc`
when 1 then `disk1_usedpc`
when 2 then `disk2_usedpc`
end as `usedpc`
from
`vm_disk` join
(
select 0 as `disk`
union all
select 1
union all
select 2
) as `t`
where
(`sys_id`, `timestamp`) in (
select
`sys_id`,
max(`timestamp`)
from `vm_disk`
group by `sys_id`
)
order by `usedpc` desc
limit 5
Maybe something like this would work... I know it may look pretty redundant but it could save overhead caused by doing multiple joins to the same table:
SELECT md.Sys_id,
md.timestamp,
CASE
WHEN
md.disk0_usedpc > md.disk1_usedpc
AND
md.disk0_usedpc > md.disk2_usedpc
THEN 'disk0_usedpc'
WHEN
md.disk1_usedpc > md.disk0_usedpc
AND
md.disk1_usedpc > md.disk2_usedpc
THEN 'disk1_usedpc'
ELSE 'disk2_usedpc'
END AS pcname,
CASE
WHEN
md.disk0_usedpc > md.disk1_usedpc
AND
md.disk0_usedpc > md.disk2_usedpc
THEN md.disk0_usedpc
WHEN
md.disk1_usedpc > md.disk0_usedpc
AND
md.disk1_usedpc > md.disk2_usedpc
THEN md.disk1_usedpc
ELSE md.disk2_usedpc
END AS pcusage
FROM mydatabase md
GROUP BY md.Sys_id HAVING MAX(md.timestamp)
ORDER BY pcusage DESC
Try this:
select
t1.sys_id, t1.`timestamp`,
case locate(greatest(disk0_usedpc ,disk1_usedpc ,disk2_usedpc), concat_ws(',' ,disk0_usedpc ,disk1_usedpc ,disk2_usedpc))
when 1 then 'disk0_usedpc'
when 1 + length(concat(disk0_usedpc, ',')) then 'disk1_usedpc'
when 1 + length(concat(disk0_usedpc, ',', disk1_usedpc, ',')) then 'disk2_usedpc'
end as usedpc,
greatest(disk0_usedpc ,disk1_usedpc ,disk2_usedpc) as amount
from yourtable t1
join (
select max(`timestamp`) as `timestamp`, sys_id
from yourtable
group by sys_id
) t2 on t1.sys_id = t2.sys_id and t1.`timestamp` = t2.`timestamp`
order by t1.`timestamp` desc
-- limit 2
SQLFiddle Demo
How it works, the sub query here is try to get the latest row for each group sys_id, as one kind of way in many solutions. Then you should get the greatest column in disk0_usedpc ,disk1_usedpc ,disk2_usedpc, as you wrote in your question, the function greatest is the plan. So greatest(disk0_usedpc ,disk1_usedpc ,disk2_usedpc) as amount can help you get the amount.
But also you want that column's name, here I used locate and concat, concat_ws(which avoids writing so many separators, here is comma ,).
Let's take row 1 | 2016-05-06 15:24:10 | 75 | 45 | 35 as an example:
concat_ws(',' ,disk0_usedpc ,disk1_usedpc ,disk2_usedpc) will give us "75,45,35", here 75's index in this string is 1, 45 is 4, 35 is 7.
As you see, locate(greatest(disk0_usedpc ,disk1_usedpc ,disk2_usedpc), concat_ws(',' ,disk0_usedpc ,disk1_usedpc ,disk2_usedpc)) will return 1, so the greatest row is disk0_usedpc, here it makes.