I want to do a merge operation in DBT.
my target table is nested. I wrote the code in 2 steps.
In the 1st stage, I bring it to the level I want in my target table.
In part 2, I merge it. but since I use expect in the select step, I see 6 joins running in the back.
Actually, there should be one full join from the first temp and one merge needle. What can I use instead of Except?
Do you have any suggested solution?
Query:
{{config(
materialized='incremental',
unique_key='customer_id'
)}}
WITH scd_target_array_tmp1 AS(
select ifnull(r.customer_id,l.customer_id) AS customer_id,
ifnull(R.customerNO,l.customerNO) as customerNO,
ARRAY_AGG(STRUCT(ifnull(l.name,r.name) as name ,ifnull(l.start_date,r.start_date) as start_date,
case when r.customer_id is null then coalesce(l.end_date,current_date()) end as end_date,
coalesce(r.is_current_record,'0')as is_current_record,ifnull(r.table_key,l.table_key) as table_key)) AS authors
from(
select p.customer_id,
p.customerNO, b.name as name,
b.start_date as start_date,
b.end_date as end_date,
b.is_current_record,
b.table_key
from presales-sandbox-346209.jaffle_shop.scd_target_array p, unnest(authors)b
) L
full join(
select customer_id,
customerNO,Name,
current_date() as start_date,
cast(null as DATE) as end_date,
'1'as is_current_record,
to_hex(sha256(to_json_string(struct(customer_id,customerNO,Name)))) as table_key
from presales-sandbox-346209.stripe.scd_table
) R on l.customer_id = R.customer_id and l.table_key=r.table_key
GROUP BY ifnull(r.customer_id,l.customer_id), ifnull(R.customerNO,l.customerNO)
)
SELECT customer_id,
customerNO,
authors
FROM scd_target_array_tmp1 a
{% if is_incremental() %}
WHERE customer_id IN (SELECT customer_id
FROM scd_target_array_tmp1 a , UNNEST(authors)b
WHERE b. table_key in (select b.table_key from scd_target_array_tmp1 , unnest(authors) b
except distinct
select b.table_key from presales-sandbox-346209.jaffle_shop.scd_target_array, unnest(authors) b
)
)
{% endif %}
I used expect in the select step in the last part of the query. What else can be?
This image shows how my raw table looks like:
Following are the conditions to get the transposed table from the image below:
Each row has a unique id
We only need columns for groups A,B,C in the group field and not others.
There could be single or multiple id for group A for the same app id, I need to get those rows for which date is minimum.
There could be single or multiple id for group B and C for the same app id, I need to get those rows for which date is maximum
The image below shows how my final table should look like:
Each row has a unique id
We only need columns for groups A,B,C in the group field and not others.
add this to your query
WHERE `GROUP` IN ('A','B','C')
There could be single or multiple id for group A for the same app id, I need to get those rows for which date is minimum.
add somewhere after the SELECT:
MIN(date) OVER (PARTIITON BY appid)
There could be single or multiple id for group B and C for the same app id, I need to get those rows for which date is maximum
change the added option on point 3 to:
CASE WHEN `group` IN ('B','C')
THEN MAX(date) OVER (PARTIITON BY appid)
ELSE MIN(date) OVER (PARTIITON BY appid)
END
Maybe this helps you to try and take a serious request of solving this yourself (and learn from it) in stead of asking for a solution and then do copy/paste...
BTW: Naming fiels with reserved words, like GROUP and DATE is not a very smart thing to do. A better name for the column GROUP might be CategoryGroup (or whatever this group is referring to)
I took a different approach to this. The SQL is longer but I think it's more auditable.
The main logic point is that I broke A and BC into 2 different subqueries, and used QUALIFY ROW_NUMBER() to choose the correct row, based on either ASC or DESC per your requirements.
I know you are using mysql and this might not work since I don't have an instance to test this one, but here is the SQL I got from building this logic in Rasgo, which I tested on Snowflake and it worked.
-- This splits the data into group A only
WITH CTE_A AS (
SELECT
*
FROM
{{ your_table }}
WHERE
my_group = 'A'
),
-- This splits the data into group B and C only
CTE_B AS (
SELECT
*
FROM
{{ your_table }}
WHERE
my_group IN('B', 'C')
),
-- Selecting from A only, it keeps the most recent row ASCENDING
CTE_A_FIRST AS (
SELECT
*
FROM
CTE_A QUALIFY ROW_NUMBER() OVER (
PARTITION BY APP_ID,
MY_GROUP
ORDER BY
MY_DATE ASC
) = 1
),
-- Selecting from A only, it keeps the most recent row DESCENDING
CTE_B_LAST AS (
SELECT
*
FROM
CTE_B QUALIFY ROW_NUMBER() OVER (
PARTITION BY APP_ID,
MY_GROUP
ORDER BY
MY_DATE DESC
) = 1
),
-- Here we just union A and BC back to one another
CTE_ABC AS (
SELECT
ID,
APP_ID,
MY_DATE,
MY_GROUP,
SCORE1,
SCORE2
FROM
CTE_B_LAST
UNION ALL
SELECT
ID,
APP_ID,
MY_DATE,
MY_GROUP,
SCORE1,
SCORE2
FROM
CTE_B
),
-- We pivot the date horizontally so we get a date for A B C
-- the MIN does not matter, since at this point, we only have 1
CTE_PVT_DATE AS (
SELECT
APP_ID,
B,
C,
A
FROM
(
SELECT
APP_ID,
MY_DATE,
MY_GROUP
FROM
CTE_ABC
) PIVOT (
MIN (MY_DATE) FOR MY_GROUP IN ('B', 'C', 'A')
) as p (APP_ID, B, C, A)
),
-- We pivot the SCORE1 horizontally so we get a date for A B C
-- the MIN does not matter, since at this point, we only have 1
CTE_PVT_SCORE1 AS (
SELECT
APP_ID,
B,
C,
A
FROM
(
SELECT
APP_ID,
SCORE1,
MY_GROUP
FROM
CTE_ABC
) PIVOT (
MIN (SCORE1) FOR MY_GROUP IN ('B', 'C', 'A')
) as p (APP_ID, B, C, A)
),
-- We pivot the SCORE2 horizontally so we get a date for A B C
-- the MIN does not matter, since at this point, we only have 1
CTE_PVT_SCORE2 AS (
SELECT
APP_ID,
B,
C,
A
FROM
(
SELECT
APP_ID,
SCORE2,
MY_GROUP
FROM
CTE_ABC
) PIVOT (
MIN (SCORE2) FOR MY_GROUP IN ('B', 'C', 'A')
) as p (APP_ID, B, C, A)
),
-- We join the subqueries above together on the APP_IDs
CTE_JOINED AS (
SELECT
t0.*,
t1.APP_ID as SCORE1_APP_ID,
t1.B as SCORE1_B,
t1.C as SCORE1_C,
t1.A as SCORE1_A,
t2.APP_ID as SCORE2_APP_ID,
t2.B as SCORE2_B,
t2.C as SCORE2_C,
t2.A as SCORE2_A
FROM
CTE_PVT_DATE t0
INNER JOIN CTE_PVT_SCORE1 t1 ON t0.APP_ID = t1.APP_ID
INNER JOIN CTE_PVT_SCORE2 t2 ON t0.APP_ID = t2.APP_ID
)
-- The final select is really just renaming ...
-- the magic has already happened
SELECT
A AS DATE_A,
B AS DATE_B,
C AS DATE_C,
APP_ID,
SCORE1_B,
SCORE1_C,
SCORE1_A,
SCORE2_B,
SCORE2_C,
SCORE2_A
FROM
CTE_JOINED
I'll roll out my attempt along several steps and then show you the full solution made up of these steps, so that you can understand it piece by piece, given the following definition of your input table:
CREATE TABLE tab(
id INT,
app_id INT,
date VARCHAR(20),
group VARCHAR(20),
score1 INT,
score2 INT
);
STEP 1. Formatting date using a proper DATE format ("YYYY-MM-DD"). For this purpose the function STR_TO_DATE can come in handy.
WITH formatted_tab AS (
SELECT id,
app_id,
STR_TO_DATE(date, '%m/%d/%Y') AS date,
group,
score1,
score2
FROM tab
)
STEP 2. Extracting the useful dates according to the group field. As long as you treat group "A" differently with respect to group "B" and "C" specifically, the idea here is to address each group with a different query, where
in the former case the MIN aggregation function is applied,
in the latter case the MAX aggregation function is applied,
Then the two output result sets are combined with a UNION operation.
(
SELECT app_id,
MIN(date) AS date,
group
FROM formatted_tab
WHERE group IN ('A')
GROUP BY app_id,
group
UNION
SELECT app_id,
MAX(date) AS date,
group
FROM formatted_tab
WHERE group IN ('B', 'C')
GROUP BY app_id,
group
) needed_dates
STEP 3. Getting back scores corresponding to group and date field. This is done with a simple INNER JOIN between the last generated table and the formatted table.
(
SELECT needed_dates.*,
formatted_tab.score1,
formatted_tab.score2
FROM needed_dates
INNER JOIN formatted_tab
ON needed_dates.app_id = formatted_tab.app_id
AND needed_dates.date = formatted_tab.date
AND needed_dates.group = formatted_tab.group
) needed_infos
STEP 4. Pivoting the table exploiting MySQL tools like:
the IF statement to retrieve the values corresponding to a specific group
the MAX aggregation function, to aggregate on the same group
These tools are applied for each group you specified ('A', 'B' and 'C').
SELECT app_id,
MAX(IF(group='A', date , NULL)) AS date_groupA,
MAX(IF(group='B', date , NULL)) AS date_groupB,
MAX(IF(group='C', date , NULL)) AS date_groupC,
MAX(IF(group='A', score1, NULL)) AS score1_groupA,
MAX(IF(group='A', score2, NULL)) AS score2_groupA,
MAX(IF(group='B', score1, NULL)) AS score1_groupB,
MAX(IF(group='B', score2, NULL)) AS score2_groupB,
MAX(IF(group='C', score1, NULL)) AS score1_groupC,
MAX(IF(group='C', score2, NULL)) AS score2_groupC
FROM needed_infos
GROUP BY app_id
Full attempt. This is the combination of the previous snippets. The only difference is the presence of backticks for the field names, that avoid MySQL to misunderstand them with MySQL private keywords like "date" (indicating the DATE type), "group" (use as keyword in the GROUP BY clause) or similar.
WITH `formatted_tab` AS (
SELECT `id`,
`app_id`,
STR_TO_DATE(`date`, '%m/%d/%Y') AS `date`,
`group`,
`score1`,
`score2`
FROM `tab`
)
SELECT `app_id`,
MAX(IF(`group`='A', `date` , NULL)) AS date_groupA,
MAX(IF(`group`='B', `date` , NULL)) AS date_groupB,
MAX(IF(`group`='C', `date` , NULL)) AS date_groupC,
MAX(IF(`group`='A', `score1`, NULL)) AS score1_groupA,
MAX(IF(`group`='A', `score2`, NULL)) AS score2_groupA,
MAX(IF(`group`='B', `score1`, NULL)) AS score1_groupB,
MAX(IF(`group`='B', `score2`, NULL)) AS score2_groupB,
MAX(IF(`group`='C', `score1`, NULL)) AS score1_groupC,
MAX(IF(`group`='C', `score2`, NULL)) AS score2_groupC
FROM ( SELECT needed_dates.*,
formatted_tab.score1,
formatted_tab.score2
FROM ( SELECT `app_id`,
MIN(`date`) AS `date`,
`group`
FROM `formatted_tab`
WHERE `group` IN ('A')
GROUP BY `app_id`,
`group`
UNION
SELECT `app_id`,
MAX(`date`) AS `date`,
`group`
FROM `formatted_tab`
WHERE `group` IN ('B', 'C')
GROUP BY `app_id`,
`group`
) needed_dates
INNER JOIN formatted_tab
ON needed_dates.app_id = formatted_tab.app_id
AND needed_dates.date = formatted_tab.date
AND needed_dates.group = formatted_tab.group
) needed_infos
GROUP BY `app_id`
You'll find a tested SQL Fiddle here.
Quick overview, I have worked out a mysql query but need to optimize the performance.
My original post was here but its gone cold and im getting desperate to elaborate on some of the suggestions which I tried to implement. So its not a dupe post but it is related.
Here is the query that takes 45 seconds plus, the group by on the second sub query really slows things down.
SELECT * FROM
(
SELECT DISTINCT email,
title,
first_name,
last_name,
'chauntry' AS source,
post_code AS postcode
FROM chauntry
WHERE mailing_indicator = 1
) AS x
JOIN
(
SELECT email,
Avg(amount_paid) AS avg_paid,
Count(*) AS no_times_booked,
Count(DISTINCT( Date_format(added, '%M %Y') )) AS unique_months
FROM chauntry
WHERE added >= Now() - INTERVAL 1 year
GROUP BY email
) AS y
ON x.email = y.email
Based on the index suggestions from here I looked around for a few examples of indexing and came up with the below
ALTER TABLE `chauntry`
ADD INDEX(`mailing_indicator`, `email`);
ALTER TABLE `chauntry`
ADD INDEX covering_index (`added`, `email`, `amount_paid`);
This makes no difference to the query time and im not sure if what im doing is even close as up until now I have had no need to use indexing.
suggestions welcome on how to index my table correctly or how to modify the query.
Out of curiousity, does this query do what you want?
SELECT email, title, first_name, last_name, 'chauntry' AS source,
post_code AS postcode,
Avg(amount_paid) AS avg_paid,
Count(*) AS no_times_booked,
Count(DISTINCT( Date_format(added, '%M %Y') )) AS unique_months
FROM chauntry
WHERE added >= Now() - INTERVAL 1 year
GROUP BY email, title, first_name, last_name, post_code
HAVING SUM(mailing_indicator = 1) > 0;
It would seem to follow the same logic as your query, except that the mailing indicator would need to have been set in the past year.
Why use JOIN on subselects to same table?
I would try this:
SELECT email,
title,
first_name,
last_name,
'chauntry' AS source,
post_code AS postcode
Avg(amount_paid) AS avg_paid,
Count(*) AS no_times_booked,
Count(DISTINCT( Date_format(added, '%M %Y') )) AS unique_months
FROM chauntry
WHERE
mailing_indicator = 1 and
added >= Now() - INTERVAL 1 year
GROUP BY email
Also I don't think you need any index with query like this, maybe on added and email, but you already added them.
Minor play.
The average of the amount_paid is the biggest problem. If you are prepared to put up with the possibility of an inaccuracy for this figure then you could maybe average the distinct values of the amount_paid field. This WILL give the wrong value under certain circumstances (ie, if you had 100 bookings, 99 at $1 and 1 at $100 the average would be given as $50.50 rather than $1.99), but if the amount paid is never repeated then this may be acceptable.
Otherwise you can probably use a join of the table against itself. To get the no_times_booked you can count the DISTINCT unique identifiers of the table (I have assumed id here).
SELECT c1.email,
c1.title,
c1.first_name,
c1.last_name,
'chauntry' AS source,
c1.post_code AS postcode
Avg(DISTINCT c2.amount_paid) AS avg_paid,
Count(DISTINCT c2.id) AS no_times_booked,
Count(DISTINCT( Date_format(c2.added, '%M %Y') )) AS unique_months
FROM chauntry c1
INNER JOIN chauntry c2
ON c1.email = c2.email
WHERE c1.mailing_indicator = 1
AND c2.added >= Now() - INTERVAL 1 year
GROUP BY c1.email,
c1.title,
c1.first_name,
c1.last_name,
source,
c1.post_code
I have a table called receiving with 4 columns:
id, date, volume, volume_units
The volume units are always stored as a value of either "Lbs" or "Gals".
I am trying to write an SQL query to get the sum of the volumes in Lbs and Gals for a specific date range. Something along the lines of: (which doesn't work)
SELECT sum(p1.volume) as lbs,
p1.volume_units,
sum(p2.volume) as gals,
p2.volume_units
FROM receiving as p1, receiving as p2
where p1.volume_units = 'Lbs'
and p2.volume_units = 'Gals'
and p1.date between "2012-01-01" and "2012-03-07"
and p2.date between "2012-01-01" and "2012-03-07"
When I run these queries separately the results are way off. I know the join is wrong here, but I don't know what I am doing wrong to fix it.
SELECT SUM(volume) AS total_sum,
volume_units
FROM receiving
WHERE `date` BETWEEN '2012-01-01'
AND '2012-03-07'
GROUP BY volume_units
You can achieve this in one query by using IF(condition,then,else) within the SUM:
SELECT SUM(IF(volume_units="Lbs",volume,0)) as lbs,
SUM(IF(volume_units="Gals",volume,0)) as gals,
FROM receiving
WHERE `date` between "2012-01-01" and "2012-03-07"
This only adds volume if it is of the right unit.
This query will display the totals for each ID.
SELECT s.`id`,
CONCAT(s.TotalLbsVolume, ' ', 'lbs') as TotalLBS,
CONCAT(s.TotalGalVolume, ' ', 'gals') as TotalGAL
FROM
(
SELECT `id`, SUM(`volume`) as TotalLbsVolume
FROM Receiving a INNER JOIN
(
SELECT `id`, SUM(`volume`) as TotalGalVolume
FROM Receiving
WHERE (volume_units = 'Gals') AND
(`date` between '2012-01-01' and '2012-03-07')
GROUP BY `id`
) b ON a.`id` = b.`id`
WHERE (volume_units = 'Lbs') AND
(`date` between '2012-01-01' and '2012-03-07')
GROUP BY `id`
) s
this is a cross join with no visible condition on the join, i don't think you meant that
if you want to sum quantities you don't need to join at all, just group as zerkms did
You can simply group by date and volume_units without self-join.
SELECT date, volume_units, sum(volume) sum_vol
FROM receving
WHERE date between "2012-01-01" and "2012-03-07"
GROUP BY date, volume_units
Sample test:
select d, vol_units, sum(vol) sum_vol
from
(
select 1 id, '2012-03-07' d, 1 vol, 'lbs' vol_units
union
select 2 id, '2012-03-07' d, 2 vol, 'Gals' vol_units
union
select 3 id, '2012-03-08' d, 1 vol, 'lbs' vol_units
union
select 4 id, '2012-03-08' d, 2 vol, 'Gals' vol_units
union
select 5 id, '2012-03-07' d, 10 vol, 'lbs' vol_units
) t
group by d, vol_units