Using sqlContext.sql for sub queries in scala - mysql

I have created the following HIVE code and require to translate it to use in scala. From what I understand we need to use sqlContext.sql
The examples available online only have simple select statements.Like the example below.
For example to run a simple sql query in scala:
val tableA = sqlContext.sql("Select * from game");
I can't seem to use the same syntax for the code below.What is the syntax to translate the code below to fit the above usage.
DROP TABLE ADW.TERA_BARCODE_LOOKUP_TABLE_RAW ;
CREATE TABLE ADW.TERA_BARCODE_LOOKUP_TABLE_RAW AS
SELECT CAST(BRCDE_REF_I AS STRING) AS BARCODE,
MAX(TRIM(GST_SRC_ID)) AS GST_SRC_ID,MAX(SRC_ACTV_TS) AS SRC_ACTV_TS
FROM
(SELECT RANKED.*
FROM
(SELECT BRCDE_REF_I,GST_SRC_ID,SRC_ACTV_TS,
RANK() over (partition by BRCDE_REF_I ORDER BY SRC_ACTV_TS DESC) AS RANK
FROM
ADW.GST_SRC_ID_BRCDE_LKUP_TABLE X
WHERE UPPER(X.CURR_ACTV_F) = 'Y' AND TRIM(X.GST_SRC_ID) IN
(SELECT TRIM(GST_SRC_I) FROM ADW.CANDIDATE_GST_ID_SRC_TABLE GROUP BY TRIM(GST_SRC_I))
) RANKED
WHERE RANKED.RANK = 1 ) X
GROUP BY BRCDE_REF_I ;

There are two SQL commands, and because Apache Hive doesn't support BEGIN and COMMIT best option for you will be to put it into two commands.
You have not posted your error. I assumed you could also got error on DROP TABLE, so changed it to DROP TABLE IF EXISTS.
Also, in Scala to use multiline strings, you have to wrap them with """, not ".
sqlContext.sql("DROP TABLE IF EXISTS ADW.TERA_BARCODE_LOOKUP_TABLE_RAW")
sqlContext.sql("""
CREATE TABLE ADW.TERA_BARCODE_LOOKUP_TABLE_RAW AS
SELECT CAST(BRCDE_REF_I AS STRING) AS BARCODE,
MAX(TRIM(GST_SRC_ID)) AS GST_SRC_ID,MAX(SRC_ACTV_TS) AS SRC_ACTV_TS
FROM
(SELECT RANKED.*
FROM
(SELECT BRCDE_REF_I,GST_SRC_ID,SRC_ACTV_TS,
RANK() over (partition by BRCDE_REF_I ORDER BY SRC_ACTV_TS DESC) AS RANK
FROM
ADW.GST_SRC_ID_BRCDE_LKUP_TABLE X
WHERE UPPER(X.CURR_ACTV_F) = 'Y' AND TRIM(X.GST_SRC_ID) IN
(SELECT TRIM(GST_SRC_I) FROM ADW.CANDIDATE_GST_ID_SRC_TABLE GROUP BY TRIM(GST_SRC_I))
) RANKED
WHERE RANKED.RANK = 1 ) X
GROUP BY BRCDE_REF_I
""")

Related

Is there a way I can optimize this query to make it shorter?

I am trying to make the code shorter and simpler. The code is working. I want to take the inner queries to a CTE or temp table to make it shorter. How do I go about this?
Create OR REPLACE view piper.v_da_areas_per_site(logistics_id, streams_label, city_id, type) as
SELECT
dataset.logistics_id,
dataset.streams_label,
dataset.city_id,
FROM
(SELECT
DISTINCT r.logistics_id,
r.streams_label,
r.city_id
FROM
top.distributions r
JOIN (
SELECT
distributions. Distribution_id,
max(distributions.event_time) AS event_time
FROM
top. distributions distributions
WHERE
distributions.stream_type = 'DA'
AND distributions. distribution_space = 'DaFilterName'
GROUP BY
distributions. distribution_id
) m ON r. distribution_id = m. distribution_id
AND r.event_time = m.event_time
AND current_date >= r. distribution_start_time
AND r. distribution_end_time >= current_date
AND r.stream_type = 'DA'
AND r. distribution_space = 'DaFilterName'
AND (
r.logistics_id IN (
SELECT
DISTINCT dev_class_hub_list.class_hub
FROM
piper.dev_class_hub_list
WHERE
dev_class_hub_list.is_3p = 'N'
)
)
)dataset;
These indexes may help:
distributions: INDEX(stream_type, distribution_space,
distribution_id, Distribution_id, event_time)
distributions: INDEX(distribution_id) -- (unless is PK)
dev_class_hub_list: INDEX(is_3p, class_hub)
Try changing
AND ( r.logistics_id IN (
SELECT DISTINCT dev_class_hub_list.class_hub
FROM dev_class_hub_list
WHERE dev_class_hub_list.is_3p = 'N' ) )
to
AND EXISTS ( SELECT 1
FROM dev_class_hub_list AS dchl
WHERE dchl.is_3p = 'N'
AND dchl.class_hub = r.logistics_id )
The DISTINCT may be causing an extra de-dup pass; the **EXISTS** is a "semi-join", so it stops when it finds the first one. The Optimizer may turn one of these into the other. Please do
EXPLAIN SELECT ...;
SHOW WARNINGS; -- to see the transformations performed
Range tests like this are notoriously difficult to optimize:
AND current_date >= r.distribution_start_time
AND r.distribution_end_time >= current_date
As for CTE -- I need to understand what the SELECT is trying to do.
As for VIEW -- Views are "syntactic sugar"; they tend to be no better than the equivalent SELECT. Will you be adding WHERE or other clauses when you SELECT from this VIEW? They may or may not be efficiently folded into the resulting Select.

How to find which values are not in a column from the list?(SQL) [duplicate]

This question already has an answer here:
Select values from a list that are not in a table
(1 answer)
Closed 2 years ago.
I have a list of values:
('WEQ7EW', 'QWE7YB', 'FRERH4', 'FEY4B', .....)
and the dist table with a dist_name column.
and I need to create SQL query which would return values from the list which don't exist in the dist_name column.
Yo need to use left join. This requires creating a derived table with the values you care about. Here is typical syntax:
select v.val
from (values ('WEQ7EW'), ('QWE7YB'), ('FRERH4'), ('FEY4B')
) v(val) left join
t
on t.col = v.val
where t.col is null;
Not all databases support the values() table constructor but allow allow some method for creating a derived table. In MySQL, this looks like:
select v.val
from (select'WEQ7EW' as val union all
select 'QWE7YB' as val union all
select 'FRERH4' as val union all
select 'FEY4B' as val
) v(val) left join
t
on t.col = v.val
where t.col is null;
You would typically put this list of values in a derived table, and then use not exists. In MySQL:
select v.dist_name
from (
select 'WEQ7EW' as dist_name
union all select 'QWE7YB'
union all ...
) v
where not exists (select 1 from dist d where d.dist_name = v.dist_name)
Or if you are running a very recent version (8.0.19 or higher), you can use the VALUES ROW() syntax:
select v.dist_name
from (values row('WEQ7EW'), row('QWE7YB'), ...) v(dist_name)
where not exists (select 1 from dist d where d.dist_name = v.dist_name)
SELECT TRIM(TRAILING ',' FROM result) result
FROM ( SELECT #tmp:=REPLACE(#tmp, CONCAT(words.word, ','), '') result
FROM words, (SELECT #tmp:='WEQ7EW,QWE7YB,FRERH4,FEY4B,') arg
) perform
ORDER BY LENGTH(result) LIMIT 1;
fiddle
The list of values to be cleared from existing values is provided as CSV string with final comma and without spaces before/after commas ('WEQ7EW,QWE7YB,FRERH4,FEY4B,' in shown code).
If CSV contains duplicate values all of them will be removed whereas non-removed duplicates won't be compacted. The relative arrangement of the values will stay unchanged.
Remember that this query performs full table scan, so it is not applicable to huge tables because it will be slow.

Recursive CTEs to pull BOM (Bill of Material)

I need some assistance or pointer on CTEs.
I am trying to extract Bill of Material and I have used CTEs query. The query works good and it pulls all the data. My struggle is there are lot of parts where the parts has new version on different levels and I want to grab new versions only. Currently my query grabs everything. I have a version column.
I tried few different things like trying to utilize max function within CTEs but I got an error saying group by, having cannot be part of recursive ctes.
Also, I tried using subquery but I didnt get the right result.
WITH BOM (
Parent
,Child
,Qty
,Childrev
,LEVEL
,sort
)
AS (
SELECT Parent
,cast(RTRIM(Child) AS NVARCHAR(max))
,Qty
,Childrev
,0 AS LEVEL
,cast(RTRIM(Child) AS NVARCHAR(max))
FROM Bomtable
UNION ALL
SELECT BOM.Parent
,cast(RTRIM(Bomtable.Child) AS NVARCHAR(max))
,Bomtable.Qty
,BOM.Childrev
,LEVEL + 1
,CAST(BOM.Sort + '..... ' + RTRIM(Bomtable.Child) AS NVARCHAR(max))
FROM BOM
INNER JOIN Bomtable ON Bomtable.Parent = BOM.Child
WHERE BOM.Parent = main product
ORDER BY SORT
)
I know I do not fully understand your data model. However, try replacing your BOM and BomTable tables with a derived table like this which will give you one row for each Child record with the greatest Childrev value without using a GROUP BY.
SELECT *
FROM (
SELECT *
, ROW_NUMBER() OVER (PARTITION BY Child ORDER BY Childrev DESC) AS ROW_NBR
FROM BOM
) AS x
WHERE x.ROW_NBR = 1;
Here is the documentation for the the OVER Clause.
Noel

Randomise rows for each group and select N rows with 2 different criteria

(SELECT schemename, message FROM RandomMessagesSet where type = 'ES' ORDER BY RAND())
UNION ALL (SELECT schemename, message FROM RandomMessagesSet where type = 'HE' ORDER BY RAND()) ORDER BY schemename;
This gives the list of all the messages with their scheme names. Is there a way to get 3 each of type "ES" and 2 each of type "HE" for each of schemename?
This is not a homework but part of a research problem that would feed into designing a user study. I tried using LIMIT and JOIN by looking at most of the posts here but still stuck.
Please help me. Your help would assist me to design my second last experiment for my PhD.
EDIT: Thanks to the most empathetic person who downvoted this. You should try doing a PhD yourself to get the feel of it.
I'm afraid I cannot provide sample data due to nature of research work.
Desired output:
In MySQL 8.0 and above, we can use Window Functions. We use a Partition over an expression of concatenated string of schemename and type.
Try (DB Fiddle):
SELECT dt.schemename,
dt.type,
dt.message
FROM
(
SELECT
schemename,
type,
message,
ROW_NUMBER() OVER (PARTITION BY CONCAT(schemename, '-', type)
ORDER BY RAND()) AS row_num
FROM RandomMessagesSet
) AS dt
WHERE (dt.type = 'ES' AND dt.row_num <= 3) OR
(dt.type = 'HE' AND dt.row_num <= 2)
ORDER BY dt.schemename

How to find position of User in query?

How to find position of User in query ?
I have ( I am using MySQL)
session.query(UserModel).order_by(desc(UserModel.age)).all()
and I have user id, how to find position of that specific id is ordered array ?
( I can return all and iterate but is there uicker way to solve this on database level, need to run fast)
Database should support window functions to do this. Raw SQL query will look like this:
SELECT pos FROM
(SELECT id, row_number() OVER (ORDER BY age DESC) AS pos FROM user) AS sub
WHERE sub.id = :id;
In SQLAlchemy:
from sqlalchemy import func, desc
user_id = 42
sub = (session
.query(
UserModel.id,
func.row_number().over(order_by=desc(UserModel.age)).label('pos'))
.subquery())
pos = session.query(sub.c.pos).filter(sub.c.id==user_id).scalar()
Note that returned index is 1-based.
Among popular RDBMS this will work in PostgreSQL, Oracle, MSSQL, but not in MySQL or SQLite.