Calculating unmatching rows in partitioned table in hive - mysql

I have a use-case where I have to calculate unmatching rows(excluding matching records) from two different partition's from a partitioned hive table.
Let's suppose there is a partitioned table called test which is partitioned on column as_of_date. Now to get the unmatching rows I tried with two option-
1.)
select count(x.item_id)
from
(select coalesce(test_new.item_id, test_old.item_id) as item_id
from
(select item_id from test where as_of_date = '2019-03-10') test_new
full outer join
(select item_id from test where as_of_date = '2019-03-09') test_old
on test_new.item_id = test_old.item_id
where coalesce(test_new.item_id,0) != coalesce(test_old.item_id,0)) as x;
2.) I am creating a view first and then querying on that
create view test_diff as
select coalesce(test_new.item_id, test_old.item_id) as item_id, coalesce(test_new.as_of_date, date_add(test_old.as_of_date, 1)) as as_of_date
from test test_new
full outer join test test_old
on (test_new.item_id = test_old.item_id and date_sub(test_new.as_of_date, 1) = test_old.as_of_date)
where coalesce(test_new.item_id,0) != coalesce(test_old.item_id,0);
Then I am using query
select count(distinct item_id) from test_diff where as_of_date = '2019-03-10';
Both the case are returning different count. In second option I am getting lesser count. Please provide any suggestion on why counts are different.

Assuming you taken care of test_new,test_old tables(filtered with as_of_date = '2019-03-10') in 2nd option.
1st option , you are using select clause count(X.item_id), where as 2nd option count(distinct). distinct might have reduced your item count in later option.

Related

MYSQL GROUP BY on 2 Tables with MAX

I have 2 mysql tables:
record table:
and
race table:
I want to select the records from the 1st table group by id_Race but only the MAX from column "secs".
I tried the following but didnt work:
$query = "SELECT rec.RecordsID,rec.id_Athlete,rec.date_record,rec.id_Race,rec.placeevent,rec.mins,rec.secs,rec.huns,rec.distance,rec.records_text,r.name,MAX(rec.secs)
FROM records AS rec INNER JOIN race AS r ON r.RaceID=rec.id_Race WHERE (id_Athlete=$u_athlete) GROUP BY rec.id_Race;";
($u_athlete is a variable i get from _SESSION)
Can you help me about that?
Thank you.
When you use an aggregation function like MAX and select all fields, you are forced to include all selected fields that are not affected by the MAX inside the GROUP BY clause.
Though you can use a window function like ROW_NUMBER that will group by specifically on id_Race and order by the secs column in a descendent way (so that the highest value of secs will be associated with row_number=1).
Afterwards you can select the rows which have row_number=1 and the id_Athlete you pass using the variable.
SELECT
rec.RecordsID,
rec.id_Athlete,
rec.date_record,
rec.id_Race,
rec.placeevent,
rec.mins,
rec.secs,
rec.huns,
rec.distance,
rec.records_text,
race.name,
FROM
(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY id_race ORDER BY secs) rank
FROM
record
) rec
INNER JOIN
race race
ON
race.RaceID=rec.id_Race
WHERE
rec.rank = 1
AND
rec.id_Athlete = $u_athlete;

MySQL SELECT query that counts left joined rows takes too long

Does anyone know how to optimize this query?
SELECT planbook.*,
COUNT(pb_unit_id) AS total_units,
COUNT(pb_lsn_id) AS total_lessons
FROM planbook
LEFT JOIN planbook_unit ON pb_unit_pb_id = pb_id
LEFT JOIN planbook_lesson ON pb_lsn_pb_id = pb_id
WHERE pb_site_id = 1
GROUP BY pb_id
The slow part is getting the total number of matching units and lessons. I have indexes on the following fields (and others):
planbook.pb_id
planbook_unit.pb_unit_pb_id
planbook_lesson.pb_lsn_pb_id
My only objective is to get the total number of matching units and lessons along with the details of each planbook row.
However, this query is taking around 35 seconds. I have 1625 records in planbook, 13,693 records in planbook_unit, and 122,950 records in planbook_lesson.
Any suggestions?
Edit: Explain Results
SELECT planbook.*,
( SELECT COUNT(*) FROM planbook_unit
WHERE pb_unit_pb_id = planbook.pb_id ) AS total_units,
( SELECT COUNT(*) FROM planbook_lesson
WHERE pb_lsn_pb_id = planbook.pb_id ) AS total_lessons
FROM planbook
WHERE pb_site_id = 1
planbook: INDEX(pb_site_id)
planbook_unit: INDEX(pb_unit_pb_id)
planbook_lesson: INDEX(pb_lsn_pb_id)
Looking to your query
You should add and index for
table planbook column pb_site_id
and eventually a composite one for
table planbook column (pb_site_id, pd_id)

merging SQL statements and how can it affect processing time

Let's assume I have the following tables:
items table
item_id|view_count
item_views table
view_id|item_id|ip_address|last_view
What I would like to do is:
If last view of item with given item_id by given ip_address was 1+ hour ago I would like to increment view_count of item in items table. And as a result get the view count of item. How I will do it normally:
q = SELECT count(*) FROM item_views WHERE item_id='item_id' AND ip_address='some_ip' AND last_view < current_time-60*60
if(q==1) then q = UPDATE items SET view_count = view_count+1 WHERE item_id='item_id'
//and finally get view_count of item
q = SELECT view_count FROM items WHERE item_id='item_id'
Here I used 3 SQL queries. How can I merge it into one SQL query? And how can it affect the processing time? Will it be faster or slower than previous method?
I don't think your logic is correct for what you describe that you want. The query:
SELECT count(*)
FROM item_views
WHERE item_id='item_id' AND
ip_address='some_ip' AND
last_view < current_time-60*60
is counting the number of views longer ago than your time frame. I think you want:
last_view > current_time-60*60
and then have if q = 0 on the next line.
MySQL is pretty good with the performance of not exists, so the following should work well:
update items
set view_count = view_count+1
WHERE item_id='item_id' and
not exists (select 1
from item_views
where item_id='item_id' AND
ip_address='some_ip' AND
last_view > current_time-60*60
)
It will work much better with an index on item_views(item_id, ip_address, last_view) and an index on item(item_id).
In MySQL scripting, you could then write:
. . .
set view_count = (#q := view_count+1)
. . .
This would also give you the variable you are looking for.
update target
set target.view_count = target.view_count + 1
from items target
inner join (
select item_id
from item_views
where item_id = 'item_id'
and ip_address = 'some_ip'
and last_view < current_time - 60*60
) ref
on ref.item_id = target.item_id;
You can only combine the update statement with the condition using a join as in the above example; but you'll still need a separate select statement.
It may be slower on very large set and/or unindexed table.

Need Help Speeding up an Aggregate SQLite Query

I have a table defined like the following...
CREATE table actions (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
end BOOLEAN,
type VARCHAR(15) NOT NULL,
subtype_a VARCHAR(15),
subtype_b VARCHAR(15),
);
I'm trying to query for the last end action of some type to happen on each unique (subtype_a, subtype_b) pair, similar to a group by (except SQLite doesn't say what row is guaranteed to be returned by a group by).
On an SQLite database of about 1MB, the query I have now can take upwards of two seconds, but I need to speed it up to take under a second (since this will be called frequently).
example query:
SELECT * FROM actions a_out
WHERE id =
(SELECT MAX(a_in.id) FROM actions a_in
WHERE a_out.subtype_a = a_in.subtype_a
AND a_out.subtype_b = a_in.subtype_b
AND a_in.status IS NOT NULL
AND a_in.type = "some_type");
If it helps, I know all the unique possibilities for a (subtype_a,subtype_b)
eg:
(a,1)
(a,2)
(b,3)
(b,4)
(b,5)
(b,6)
Beginning with version 3.7.11, SQLite guarantees which record is returned in a group:
Queries of the form: "SELECT max(x), y FROM table" returns the value of y on the same row that contains the maximum x value.
So greatest-n-per-group can be implemented in a much simpler way:
SELECT *, max(id)
FROM actions
WHERE type = 'some_type'
GROUP BY subtype_a, subtype_b
Is this any faster?
select * from actions where id in (select max(id) from actions where type="some_type" group by subtype_a, subtype_b);
This is the greatest-in-per-group problem that comes up frequently on StackOverflow.
Here's how I solve it:
SELECT a_out.* FROM actions a_out
LEFT OUTER JOIN actions a_in ON a_out.subtype_a = a_in.subtype_a
AND a_out.subtype_b = a_in.subtype_b
AND a_out.id < a_in.id
WHERE a_out.type = "some type" AND a_in.id IS NULL
If you have an index on (type, subtype_a, subtype_b, id) this should run very fast.
See also my answers to similar SQL questions:
Fetch the row which has the Max value for a column
Retrieving the last record in each group
SQL join: selecting the last records in a one-to-many relationship
Or this brilliant article by Jan Kneschke: Groupwise Max.

Why is this SQL query with subquery very slow?

I have this query:
select *
from transaction_batch
where id IN
(
select MAX(id) as id
from transaction_batch
where status_id IN (1,2)
group by status_id
);
The inner query runs very fast (less than 0.1 seconds) to get two ID's, one for status 1, one for status 2, then it selects based on primary key so it is indexed. The explain query says that it's searching 135k rows using where only, and I cannot for the life of me figure out why this is so slow.
The inner query is run seperatly for every row of your table over and over again.
As there is no reference to the outer query in the inner query, I suggest you split those two queries and just insert the results of the inner query in the WHERE clause.
select b.*
from transaction_batch b
inner join (
select max(id) as id
from transaction_batch
where status_id in (1, 2)
group by status_id
) bm on b.id = bm.id
my first post here.. sorry about the lack of formatting
I had a performance problem shown below:
90sec: WHERE [Column] LIKE (Select [Value] From [Table]) //Dynamic, slow
1sec: WHERE [Column] LIKE ('A','B','C') //Hardcoded, fast
1sec: WHERE #CSV like CONCAT('%',[Column],'%') //Solution, below
I had tried joining rather than subquerying.
I had also tried a hardcoded CTE.
I had lastly tried a temp table.
None of these standard options worked, and I was not willing to dosp_execute option.
The only solution that worked as:
DECLARE #CSV nvarchar(max) = Select STRING_AGG([Value],',') From [Table];
// This yields #CSV = 'A,B,C'
...
WHERE #CSV LIKE CONCAT('%',[Column],'%')