Speed up Mysql Query using a split function - mysql

I am trying to speed up a MYSQL query.
In a column called "MISC", I first have to extract a "traceID" variable, that will be used to match row of another table.
Example of the MISC column:
PFFCC_Strip/fkk49322/PMethod=Diners/CardType=Diners/9999******9999/2010/TraceId=7122910
I am extracting the value "7122910" as traceID and find corresponding row with a left join. The traceId value being unique, only one row must be present on each table.
I cannot set Index on the tables to speed up process. Any approach that could make this query run faster? As it is, it takes a few seconds to run which is not possible.
select *
from
(select TraceID,PP,UDef2, Payment_Method, Approved, TransactionID, Amount
from pr) pr
left join
(select
PAYMENT_ID as Payment_ID_omega, TRANSACTION_TYPE,
REQUESTED_AMOUNT, AMOUNT, `STATUS` as StatusRef_omega,
REQUEST_DATE, Agent,
if (locate('TraceId=',MISC)>0, SUBSTRING_INDEX(MISC,'TraceId=',-1),'') as traceID
from BankingActivity ) omega
on pr.TraceID = omega.traceID
having
(REQUEST_DATE BETWEEN DATE_ADD(DATE(NOW()), INTERVAL -1 DAY) AND NOW())
ORDER BY pr.TraceID DESC

You can place your filters inside the query before join that must make a difference and you must have the index on table pr(TraceID) and BankingActivity(REQUEST_DATE, traceID). For more optimised query, Please post the execution plan.
select * from(select TraceID
,PP
,UDef2
,Payment_Method
,Approved
,TransactionID
,Amount
from pr) pr
left join (select PAYMENT_ID as Payment_ID_omega
,TRANSACTION_TYPE
,REQUESTED_AMOUNT
,AMOUNT
,`STATUS` as StatusRef_omega
,REQUEST_DATE
,Agent
,if (locate('TraceId=', MISC) > 0, SUBSTRING_INDEX(MISC,'TraceId=',-1),'') as traceID
from BankingActivity
WHERE REQUEST_DATE BETWEEN DATE_ADD(DATE(NOW()), INTERVAL -1 DAY) AND NOW()) omega
on pr.TraceID = omega.traceID
ORDER BY pr.TraceID DESC

Related

Find the nearest date from entered date in SQL, both ways

I have a problem, I have a task to find a nearest date to a given date, looking both ways, older or younger. But I have no idea, I'm new to SQL, tried googling but didn't find any help.
create proc Task
(#Date date)
as
begin
select top(1) p.FirstName, p.LastName, e.BirthDate, e.JobTitle from HumanResources.Employee e
join Person.Person p
on p.BusinessEntityID = e.BusinessEntityID
where e.BirthDate>#Date
end
I started something like this, and then lost it
Always remember: TOP without ORDER BY by doesn't make much sense; add an order by that is ascending (your birthdate > #date comparison asks for all birthdates greater than/after, so the TOP(1) ordered by birthdate ascending would be the earliest birthdate that is greater than your variable)
Then take the entire query, paste it again, put UNION ALL between them and flip your ORDER BY to be descending and your comparison to be less than, in this second query
You thus end up with a query that chooses the smallest that is greater than and the largest that is less than i.e. The nearest ones to your variable
Consider whether you should be using >= and <= if a date that is bang on meets the specification
I would not use functions in order (as server will not be able to use indexes).
Instead, I'd go for two-queries solution.
It could be wrapped in SP something like this (MySQL version):
CREATE FUNCTION `Task`(
`aDate` DATE
)
RETURNS INT
BEGIN
SELECT
`BusinessEntityID`
, `BirthDate`
INTO
#id_next
, #birthdate_next
FROM
`Employee`
WHERE
`BirthDate` >= aDdate
ORDER BY
`BirthDate` ASC
LIMIT
1
;
IF #birthdate_next IS NULL THEN
SELECT
`BusinessEntityID`
, `BirthDate`
INTO
#id_prev
, #birthdate_prev
FROM
`Employee`
WHERE
`BirthDate` < aDate
ORDER BY
`BirthDate` DESC
LIMIT
1
;
ELSE
IF DATEDIFF(#birthdate_next, aDate) > 1 THEN
SELECT
`BusinessEntityID`
, `BirthDate`
INTO
#id_prev
, #birthdate_prev
FROM
`Employee`
WHERE
`BirthDate` < aDate
AND `BirthDate` > DATE_SUB(aDate, INTERVAL DATEDIFF(#birthdate_next, aDate) DAY)
ORDER BY
`BirthDate` DESC
LIMIT
1
;
END IF;
END IF;
CASE
WHEN #id_prev IS NULL AND #id_next IS NULL THEN RETURN NULL;
WHEN #id_prev IS NULL THEN RETURN #id_next;
WHEN #id_next IS NULL THEN RETURN #id_prev;
WHEN DATEDIFF(#birthdate_next, aDate) < DATEDIFF(aDate, #birthdate_prev) THEN RETURN #id_next;
ELSE RETURN #id_prev;
END CASE;
END
So in some cases only single query (the first one) would be executed.
The query will use index by BirthDate.
If the first query diff from specified date is less than 2 days, the second query will not be executed at all (it is more complicated as ordered DESC).
It is possible to further simplify the SP, however I'm keeping it "as is" so it is easier to understand.
Use datediff() to get the duration between the two dates. Since you don't care whether the date is in the future or in the past, use abs() to get the absolute value of the duration. Then order by the absolute duration and take the top one record.
I'm not sure if you're really on MySQL or on SQL Server. The TOP (1) indicates SQL Server, the tag says MySQL.
Here's the MySQL version:
SELECT p.firstname,
p.lastname,
e.birthdate,
e.jobtitle
FROM humanresources.employee e
INNER JOIN person.person p
ON p.businessentityid = e.businessentityid
ORDER BY abs(datediff(e.birthdate, #date))
LIMIT 1;
And here for SQL Server:
SELECT TOP (1)
p.firstname,
p.lastname,
e.birthdate,
e.jobtitle
FROM humanresources.employee e
INNER JOIN person.person p
ON p.businessentityid = e.businessentityid
ORDER BY abs(datediff(day, e.birthdate, #date));
May need some tweaks depending on the actual data types you're using.
Edit:
Addressing fifoniks's concern a version, that could perform better, if the respective indexes exist (on humanresources.employee.birthdate optimally once ascending and once descending).
It first gets the union of the nearest record in the the future of #date (including #date) and the analog record for the past, hopefully using indexes along the way. From these two records, the one with the lowest absolute duration to #date is picked. Then person gets joined.
SELECT p.firstname,
p.lastname,
y.bithdate,
y.jobtitle
FROM (SELECT TOP (1)
x.businessentityid,
x.birthdate,
x.jobtitle
FROM (SELECT TOP (1)
e.businessentityid,
e.birthdate,
e.jobtitle
FROM humanresources.employee e
WHERE e.birthdate >= #date
ORDER BY e.birthdate ASC
UNION ALL
SELECT TOP (1)
e.businessentityid,
e.birthdate,
e.jobtitle
FROM humanresources.employee e
WHERE e.birthdate <= #date
ORDER BY e.birthdate DESC) x
ORDER BY abs(datediff(day, x.birthdate, #date)) ASC) y
INNER JOIN person.person p
ON p.businessentityid = y.businessentityid;

Get Data According to Group by date field

Here is my table
Which have field type which means 1 is for income and 2 is for expense
Now requirement is for example in table there is two transaction made on 2-10-2018 so i want data as following
Expected Output
id created_date total_amount
1 1-10-18 10
2 2-10-18 20(It calculates all only income transaction made on 2nd date)
3 3-10-18 10
and so on...
it will return an new field which contains only incom transaction made on perticulur day
What i had try is
SELECT * FROM `transaction`WHERE type = 1 ORDER BY created_date ASC
UNION
SELECT()
//But it wont work
SELECT created_date,amount,status FROM
(
SELECT COUNT(amount) AS totalTrans FROM transaction WHERE created_date = created_date
) x
transaction
You can Also See Schema HERE http://sqlfiddle.com/#!9/6983b9
You can Count() the total number of expense transactions using conditional function If(), on a group of created_date.
Similarly, you can Sum() the amount of expense done using If(), on a created_date.
Try the following:
SELECT
`created_date`,
SUM(IF (`type` = 2, `amount`, 0)) AS total_expense_amount,
COUNT(IF (`type` = 2, `id`, NULL)) AS expense_count
FROM
`transaction`
GROUP BY `created_date`
ORDER BY `created_date` ASC
Do you just want a WHERE clause?
SELECT t.created_date, SUM(amount) as total_amount
FROM transaction t
WHERE type = 2
GROUP BY t.created_date
ORDER BY created_date ASC ;

Optimizing cohort analysis on Google BigQuery

I'm attempting to perform a cohort analysis on a very large table. I have a test table with ~30M rows (over double in production). The query fails in BigQuery stating "resources exceeded.." and it's a tier 18 query (tier 1 is $5, so it's a $90 query!)
The query:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
Here is the desired output:
From my understanding of BigQuery, the joins are causing a major performance hit because each BigQuery node needs to process them. The table is partitioned by day, which I'm not yet making use of in this query, but I know it will still need to be optimized.
How can this query be optimized or exclude the use of joins to allow BigQuery to process more efficiently in parallel?
Step #1
Try below
Moved JOIN'ing on cohort_active_user_count outside the inner SELECT as I think it is one of main reason for query be expensive. And as you see - using JOIN instead LEFT JOIN for this one as LEFT is not needed here
Please test and let us know result
WITH cohort_active_user_count AS (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE.bot_id AS bot_id
FROM BQ_TABLE
LEFT JOIN BQ_TABLE AS future_message
ON BQ_TABLE.bot_user_id = future_message.bot_user_id
AND BQ_TABLE.created_at < future_message.created_at
AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
AND BQ_TABLE.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
Step # 2
Below "further optimization" is based on assumption that your BQ_TABLE is a raw data with multiple entries for the same user_id/bit_id for the same day - thus increasing a lot expenses of LEFT JOIN in inner SELECT.
I propose first to aggregate this, as it is done below. In addition to drastically reducing size of JOIN - it also eliminates all those conversion from TIMESTAMP to DATE in each join'ed row
WITH BQ_TABLE_AGG AS (
SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
FROM BQ_TABLE
GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
SELECT
created_at,
COUNT(DISTINCT bot_user_id) AS COUNT,
bot_id AS bot_id
FROM BQ_TABLE_AGG
GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
cohort_size.count AS active_users, retained_users,
retained_users / cohort_size.count AS retention, t.bot_id
FROM (
SELECT
BQ_TABLE_AGG.created_at AS created_at,
DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
BQ_TABLE_AGG.bot_id AS bot_id
FROM BQ_TABLE_AGG
LEFT JOIN BQ_TABLE_AGG AS future_message
ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
AND BQ_TABLE_AGG.created_at < future_message.created_at
AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
AND BQ_TABLE_AGG.bot_id = future_message.bot_id
GROUP BY 1, 2, bot_id
HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size
ON t.created_at = cohort_size.created_at
AND t.bot_id = cohort_size.bot_id
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id
If you don't want to enable a higher billing tier given the costs, here are a couple of suggestions that might help to reduce the CPU requirements:
Use INNER JOINs rather than LEFT JOINs if you can. INNER JOINs should generally be less CPU-intensive, but then again you won't get unmatched rows like you would with LEFT JOINs.
Use APPROX_COUNT_DISTINCT(expr) instead of COUNT(DISTINCT expr). You won't get an exact count, but it's less CPU-intensive and may be "good enough" depending on your needs.
You could also consider manually breaking the query into stages of computation, e.g. write the WITH clause statement to a table, then use that in the subsequent query. I don't know what the specific cost tradeoffs would be, though.
Why is it tagged MySQL?
In MySQL, I would change
max(cohort_size.count) as active_users, -- all equal in group
to
( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,
and remove the JOIN to that table. Without doing this, you risk inflating the COUNT(...) values!
Also move the division to get retention into the outside query.
Once you have done that, you can also move the other JOIN into a subquery:
( SELECT count(distinct future_message.bot_user_id)
FROM ... WHERE ... ) as retained_users,
I would have these indexes. Note that created_at needs to be last.
cohort_active_user_count: INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

SQL query for counting multiple strings with one output

I have a database including certain strings, such as '{TICKER|IBM}' to which I will refer as ticker-strings. My target is to count the amount of ticker-strings per day for multiple strings.
My database table 'tweets' includes the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss) and 'processed text'. The ticker-strings, such as '{TICKER|IBM}', are within the 'processed text' row.
At this moment, I have a working SQL query for counting one ticker-string (thanks to the help of other Stackoverflow-ers). What I would like to have is a SQL query in which I can count multiple strings (next to '{TICKER|IBM}' also '{TICKER|GOOG}' and '{TICKER|BAC}' for instance).
The working SQL query for counting one ticker-string is as follows:
SELECT d.date, IFNULL(t.count, 0) AS tweet_count
FROM all_dates AS d
LEFT JOIN (
SELECT COUNT(DISTINCT tweet_id) AS count, DATE(created_at) AS date
FROM tweets
WHERE processed_text LIKE '%{TICKER|IBM}%'
GROUP BY date) AS t
ON d.date = t.date
The eventual output should thus give a column with the date, a column with {TICKER|IBM}, a column with {TICKER|GOOG} and one with {TICKER|BAC}.
I was wondering whether this is possible and whether you have a solution for this? I have more than 100 different ticker-strings. Of course, doing them one-by-one is an option, but it is a very time-consuming one.
If I understand correctly, you can do this with conditional aggregation:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG, coalesce(BAC, 0) AS BAC
FROM all_dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
I'd return the specified resultset like this, adding expressions to the SELECT list for each "ticker" I want returned as a separate column:
SELECT d.date
, IFNULL(SUM(t.processed_text LIKE '%{TICKER|IBM}%' ),0) AS `cnt_ibm`
, IFNULL(SUM(t.processed_text LIKE '%{TICKER|GOOG}%'),0) AS `cnt_goog`
, IFNULL(SUM(t.processed_text LIKE '%{TICKER|BAC}%' ),0) AS `cnt_goog`
, IFNULL(SUM(t.processed_text LIKE '%{TICKER|...}%' ),0) AS `cnt_...`
FROM all_dates d
LEFT
JOIN tweets t
ON t.created_at >= d.date
AND t.created_at < d.date + INTERVAL 1 DAY
GROUP BY d.date
NOTES: The expressions within the SUM aggregates above are evaluated as booleans, so they return 1 (if true), 0 (if false), or NULL. I'd avoid wrapping the created_at column in a DATE() function, and use a range scan instead, especially if a predicate is added (WHERE clause) that restricts the values ofdatebeing returned fromall_dates`.
As an alternative, expressions like this will return an equivalent result:
, SUM(IF(t.process_text LIKE '%{TICKER|IBM}%' ,1,0)) AS `cnt_ibm`

MYSQL Query : How to get values per category?

I have huge table with millions of records that store stock values by timestamp. Structure is as below:
Stock, timestamp, value
goog,1112345,200.4
goog,112346,220.4
Apple,112343,505
Apple,112346,550
I would like to query this table by timestamp. If the timestamp matches,all corresponding stock records should be returned, if there is no record for a stock for that timestamp, the immediate previous one should be returned. In the above ex, if I query by timestamp=1112345 then the query should return 2 records:
goog,1112345,200.4
Apple,112343,505 (immediate previous record)
I have tried several different ways to write this query but no success & Im sure I'm missing something. Can someone help please.
SELECT `Stock`, `timestamp`, `value`
FROM `myTable`
WHERE `timestamp` = 1112345
UNION ALL
SELECT `Stock`, `timestamp`, `value`
FROM `myTable`
WHERE `timestamp` < 1112345
ORDER BY `timestamp` DESC
LIMIT 1
select Stock, timestamp, value from thisTbl where timestamp = ? and fill in timestamp to whatever it should be? Your demo query is available on this fiddle
I don't think there is an easy way to do this query. Here is one approach:
select tprev.*
from (select t.stock,
(select timestamp from t.stock = s.stock and timestamp <= <whatever> order by timestamp limit 1
) as prevtimestamp
from (select distinct stock
from t
) s
) s join
t tprev
on s.prevtimestamp = tprev.prevtimestamp and s.stock = t.stock
This is getting the previous or equal timestamp for the record and then joining it back in. If you have indexes on (stock, timestamp) then this may be rather fast.
Another phrasing of it uses group by:
select tprev.*
from (select t.stock,
max(timestamp) as prevtimestamp
from t
where timestamp <= YOURTIMESTAMP
group by t.stock
) s join
t tprev
on s.prevtimestamp = tprev.prevtimestamp and s.stock = t.stock