Subquery with max value in a big table SQL - mysql

I'm trying to make a query to get the date of last work experience of a person and also the date they left the company (in some cases that value is null because the person is still working on the company).
I have something like:
SELECT r.idcurriculum, r.startdate, r.lastdate FROM (
SELECT idcurriculum, max(startdate) as startdate
FROM workexperience
GROUP BY idcurriculum) as s
INNER JOIN workexperience r on (r.idcurriculum = s.idcurriculum)
The structure should come out something like this:
idcurriculum | startdate | lastdate
1234 | 2010-05-01| null
2532 | 2005-10-01| 2010-02-28
5234 | 2011-07-01| 2013-10-31
1025 | 2012-04-01| 2014-03-31
I tried running that query but I had to stop it because it was taking too long. The workexperience table weights aprox 20GB. I don't know if the query is wrong, I've only run it for 10 minutes.
Help will be much appreciated.

You might try rephrasing the query as:
select r.*
from workexperience we
where not exists (select 1
from workexperience we2
where we2.idcurriculum = we.idcurriculum and
we2.startdate > we.startdate
);
Important: for performance reasons you need a composite index on idcurriculum, startdate:
create index idx_workexperience_idcurriculum_startdate on workexperience(idcurriculum, strtdate)
The logic of the query is: "Get me all rows from workexperience where there is no row for the same idcurriculum that has a larger startdate". That is a fancy way of saying "get me the maximum".
With the group by, MySQL has to do an aggregation, which would typically involve sorting the data -- expensive on 20 Gbytes. With this method, it can look up the results using the index, which should be faster.

As an alternative to Gordon's answer you could also write the query as:
SELECT r.*
FROM work_experience we
LEFT JOIN work_experience we2
ON we2.idcurriculum = we.idcurriculum
AND we2.startdate > we.startdate
WHERE we2.idcurriculum IS NULL;
You can run into problems when there are multiple maximum start_dates in the group however.

Related

Filter large number of records on mysql when using INNER JOIN with two fields

I'm working on existing database with millions of inserts per day. Database design itself pretty bad and filtering records from it takes huge amount of time. we are in the process of moving this to ELK cluster but in the mean time I have to filter some records for immediate use.
I have two tables like this
table - log_1
datetime | id | name | ip
2017-01-01 01:01:00 | 12345 | sam | 192.168.100.100
table - log_2
datetime | mobile | id
2017-01-01 01:01:00 | 999999999 | 12345
I need to filter my data using ip and from the log_1 and datetime on both log_1 and log_2. to do that I use below query
SELECT log_1.datetime, log_1.id, log_1.name, log_1.ip, log_2,datetime, log_2.mobile, log_2.id
FROM log_1
INNER JOIN log_2
ON log_1.id = log_2.id AND log_1.datetime = log_2.datetime
where log_1.ip = '192.168.100.100'
limit 100
Needless to say this take forever to retrieve results with such large number of records. is there any better method I can do the same thing without waiting long time mysql to respond ?. In other words how can I optimized my query against such large database.
database is not production and it's for just analytics
First of all, your current LIMIT clause is fairly meaningless, because the query has no ORDER BY clause. It is not clear which 100 records you want to retain. So, you might want to use something like this:
SELECT
l1.datetime,
l1.id,
l1.name,
l1.ip,
l2.datetime,
l2.mobile,
l2.id
FROM log_1 l1
INNER JOIN log_2 l2
ON l1.id = l2.id AND l1.datetime = l2.datetime
WHERE
l1.ip = '192.168.100.100'
ORDER BY
l1.datetime DESC
LIMIT 100;
This would return the 100 most recent matching records. As for speeding up this query, one way to at least make the join faster would be to add the following index on the log_2 table:
CREATE INDEX idx ON log_2 (datetime, id, mobile);
Assuming MySQL chooses to use this index, it should make the join much faster, because each id and datetime value can be looked up in a B-tree instead of doing a manual scan of the entire table. Note that the index also covers the mobile column, which is needed in the select.
Can you try this :
1. Create index on both tables on id column if not already created (this will take time).
Try creating two temp tables log_1_tmp and log_2_tmp with data as below :
Query 1 - insert into log_1_tmp select * from log_1 where log_1.ip = '192.168.100.100'
Query 2 - insert into log_2_tmp select * from log_2 where log_2.ip = '192.168.100.100'
Run your query on above two tables and here you can remove where condition from your query.
See if this works.

Moving average query MS Access

I am trying to calculate the moving average of my data. I have googled and found many examples on this site and others but am still stumped. I need to calculate the average of the previous 5 flow for the record selected for the specific product.
My Table looks like the following:
TMDT Prod Flow
8/21/2017 12:01:00 AM A 100
8/20/2017 11:30:45 PM A 150
8/20/2017 10:00:15 PM A 200
8/19/2017 5:00:00 AM B 600
8/17/2017 12:00:00 AM A 300
8/16/2017 11:00:00 AM A 200
8/15/2017 10:00:31 AM A 50
I have been trying the following query:
SELECT b.TMDT, b.Flow, (SELECT AVG(Flow) as MovingAVG
FROM(SELECT TOP 5 *
FROM [mytable] a
WHERE Prod="A" AND [a.TMDT]< b.TMDT
ORDER BY a.TMDT DESC))
FROM mytable AS b;
When I try to run this query I get an input prompt for b.TMDT. Why is b.TMDT not being pulled from mytable?
Should I be using a different method altogether to calculate my moving averages?
I would like to add that I started with another method that works but is extremely slow. It runs fast enough for tables with 100 records or less. However, if the table has more than 100 records it feels like the query comes to a screeching halt.
Original method below.
I created two queries for each product code (There are 15 products): Q_ProdA_Rank and Q_ProdA_MovAvg
Q_ProdA_RanK (T_ProdA is a table with Product A's information):
SELECT a.TMDT, a.Flow, (Select count(*) from [T_ProdA]
where TMDT<=a.TMDT) AS Rank
FROM [T_ProdA] AS a
ORDER BY a.TMDT DESC;
Q_ProdA_MovAvg
SELECT b.TMDT, b.Flow, Round((Select sum(Flow) from [Q_PRodA_Rank] where
Rank between b.Rank-1 and (b.Rank-5))/IIf([Rank]<5,Rank-1,5),0) AS
MovingAvg
FROM [Q_ProdA_Rank] AS b;
The problem is that you're using a nested subquery, and as far as I know (can't find the right site for the documentation at the moment), variable scope in subqueries is limited to the direct parent of the subquery. This means that for your nested query, b.TMDT is outside of the variable scope.
Edit: As this is an interesting problem, and a properly-asked question, here is the full SQL answer. It's somewhat more complex than your try, but should run more efficiently
It contains a nested subquery that first lists the 5 previous flows for per TMDT and prod, then averages that, and then joins that in with the actual query.
SELECT A.TMDT, A.Prod, B.MovingAverage
FROM MyTable AS A LEFT JOIN (
SELECT JoinKeys.TMDT, JoinKeys.Prod, Avg(Top5.Flow) As MovingAverage
FROM (
SELECT JoinKeys.TMDT, JoinKeys.Prod, Top5.Flow
FROM MyTable As JoinKeys INNER JOIN MyTable AS Top5 ON JoinKeys.Prod = Top5.Prod
WHERE Top5.TMDT In (
SELECT TOP 5 A.TMDT FROM MyTable As A WHERE JoinKeys.Prod = A.Prod AND A.TMDT < JoinKeys.TMDT ORDER BY A.TMDT
)
)
GROUP BY JoinKeys.TMDT, JoinKeys.Prod
) AS B
ON A.Prod = B.JoinKeys.Prod AND A.TMDT = B.JoinKeys.TMDT
While in my previous version I advocated a VBA approach, this is probably more efficient, only more difficult to write and adjust.

Retrieve rows that have a first entry in 2014 in MySQL

I want to retrieve all rows from a table that have their first entry on or after 01/01/2014 but no later than 31/12/2014
Example of the table:
OID FK_OID Treatment Trt_DATE
1 100 19304 2011-05-24
2 100 19304 2011-08-01
3 100 19306 2014-03-05
4 200 19305 2012-02-02
5 300 19308 2014-01-20
6 400 19308 2014-06-06
For example. I would like to pull all entries that have STARTED treatment in 2014. So above i would to extract FK_OID's 300 and 400 because their first entry is in 2014, but i would like to omit FK_OID 100 because they have 2 entries prior to 2014.
How do i go about this? I can extract all entries within a date range etc but that brings back all entries for that date and doesn't omit anyone who has an entry prior to the start of the date range. It just returns their first entry in 2014.
For the ones who need to see that i have tried something. See below.
I am not an experienced coder and this is the best i can get because i don't have the knowledge.
SELECT
mod,
(select NHSNum from person p
WHERE
p.oid = t.fk_oid) as 'NHS'
FROM
timeline t
Where trt_date BETWEEN '2014-01-01' AND '2014-12-31'
ORDER BY trt_date ASC
This returns every treatment for 2014 regardless of whether it is the first ever one for that person. I want to omit anyone from this list who has had treatment before 01/01/2014 as well as only return the first treatment per person. For example, this code returns all treatments for all people in 2014. I only want their first one and only if it is their first one ever.
Thanks.
create table aThing
( oid int auto_increment primary key,
fk_oid int not null,
treatment int not null,
trt_date date not null
);
insert aThing (fk_oid,treatment,trt_date) values
(100, 19304, '2011-05-24'),
(100, 19304, '2011-08-01'),
(100, 19306, '2014-03-05'),
(200, 19305, '2012-02-02'),
(300, 19308, '2014-01-20'),
(400, 19308, '2014-06-06');
select fk_oid,dt
from
( select fk_oid,min(trt_date) as dt
from aThing
group by fk_oid
) xDerived
where year(dt)=2014;
+--------+------------+
| fk_oid | dt |
+--------+------------+
| 300 | 2014-01-20 |
| 400 | 2014-06-06 |
+--------+------------+
The inner part, the nested one, become a derived table, and is given a name xDerived. This means that even though it is just a result set, by making it a derived table, it can be referred to by name. So it is not a physical table, but a derived one, or virtual one.
So that derived table is a very simple group by with an aggregate function. It says, for every fk_oid, bring back one row and only 1 row, with its minimum value for trt_date.
So if you have 10 million rows in that table called aThing, but only 17 distinct values for fk_oid, it will return only 17 rows. Each row being the minimum of trt_date for its fk_oid.
So now that that is achieved, the outer wrapper says just show me those two columns (but with a year check). There is a complicated to explain reason why I had to do that, so I will try to do it here.
But I might need a little time to explain it well, so bear with me.
This will be a shortcut way to say it. I had to get the min into an alias, and I only had access to that alias if resolved in a derived table, to cleanse it so to speak, and then access it with an outer wrapper.
An alias of aggregate column, like as dt, is not available (as a pseudo like column name which is what an alias is) ... it is not available in a where clause. But by wrapping it in a derived table name, I cleanse it so to speak, and then I can access it in a where clause.
So I can't access it directly in its own query in the where clause, but when I wrap it in an envelope (a derived table), I can access it on the outside.
I will try better to explain it later, maybe, but I would have to show alternative attempts to gain access to results, and the syntax errors that would result.
There's probably a more elegant solution, but this seems to satisfy the requirement...
SELECT x.*
FROM my_table x
JOIN
( SELECT fk_oid
, MIN(trt_date) min_date
FROM my_table
GROUP
BY fk_oid
HAVING min_date > '2014-01-01'
) a
ON a.fk_oid = x.fk_oid
LEFT
JOIN my_table b
ON b.fk_oid = a.fk_oid
AND b.trt_date > '2014-12-31'
WHERE b.oid IS NULL;
Having a few years a experience with this, i decided to revisit it. The solution i now use regularly is:
SELECT t1.column1, t1.column2
FROM MyTable AS t1
LEFT OUTER JOIN MyTable AS t2
ON t1.fkoid = t2.fkoid
AND (t1.date > t2.date
OR (t1.date = t2.date AND t1.oid > t2.oId))
WHERE t2.fkoid IS NULL and t1.date >= '2014-01-01'

Mixing HAVING with CASE OR Analytic functions in MySQL (PartitionQualify(?

I have a SELECT query that returns some fields like this:
Date | Campaign_Name | Type | Count_People
Oct | Cats | 1 | 500
Oct | Cats | 2 | 50
Oct | Dogs | 1 | 80
Oct | Dogs | 2 | 50
The query uses aggregation and I only want to include results where when Type = 1 then ensure that the corresponding Count_People is greater than 99.
Using the example table, I'd like to have two rows returned: Cats. Where Dogs is type 1 it's excluded because it's below 100, in this case where Dogs = 2 should be excluded also.
Put another way, if type = 1 is less than 100 then remove all records of the corresponding campaign name.
I started out trying this:
HAVING CASE WHEN type = 1 THEN COUNT(DISTINCT Count_People) > 99 END
I used Teradata earlier int he year and remember working on a query that used an analytic function "Qualify PartitionBy". I suspect something along those lines is what I need? I need to base the exclusion on aggregation before the query is run?
How would I do this in MySQL? Am I making sense?
Now that I understand the question, I think your best bet will be a subquery to determine which date/campaign combinations of a type=1 have a count_people greater than 99.
SELECT
<table>.date,
<table>.campaign_name,
<table>.type,
count(distinct count_people) as count_people
FROM
(
SELECT
date,
campaign_name
FROM
<table>
WHERE type=1
HAVING count(distinct count_people) > 99
GROUP BY 1,2
) type1
LEFT OUTER JOIN <table> ON
type1.campaign_name = <table>.campaign_name AND
type1.date = <table>.date
WHERE <table>.type IN (1,2)
GROUP BY 1,2,3
The subquery here only returns campaign/date combinations when both the type=1 AND it has greater than 99 count_people. It uses a LEFT JOIN back to the to insure that only those campaign/date combinations make it into the result set.
The WHERE on the main query keeps the results to only Types 1 and 2, which you stated was already a filter in place (though not mentioned in the question, it was stated in a comment to a previous answer).
Based on your comments to answer by #JNevill I think you will have no option but to use subselects to pre-filter the record set you are dealing with, as working with HAVING is going to limit you only to the current record being evaluated - there is no way to compare against previous or subsequent records in the set in this manner.
So have a look at something like this:
SELECT
full_data.date AS date,
full_data.campaign_name AS campaign_name,
full_data.type AS type,
COUNT(full_data.people) AS people_count
FROM
(
SELECT
date,
campaign_name,
type,
COUNT(people) AS people_count
FROM table
WHERE type IN (1,2)
GROUP BY date, campaign_name, type
) AS full_data
LEFT JOIN
(
SELECT
date,
campaign_name,
COUNT(people) AS people_count
FROM table
WHERE type = 1
GROUP BY date, campaign_name
HAVING people_count < 100
) AS filter
ON
full_data.date = filter.date
AND full_data.campaign_name = filter.campaign_name
WHERE
filter.date IS NULL
AND filter.campaign_name IS NULL
The first subselect is basically your current query without any attempt at using HAVING to filter out results. The second subselect is used to find all date/campaign name combos which have people_count > 100 and use those as a filter for against the full data set.

count rows where date is equal but separated by name

I think it will be easiest to start with the table I have and the result I am aiming for.
Name | Date
A | 03/01/2012
A | 03/01/2012
B | 02/01/2012
A | 02/01/2012
B | 02/01/2012
A | 02/01/2012
B | 01/01/2012
B | 01/01/2012
A | 01/01/2012
I want the result of my query to be:
Name | 01/01/2012 | 02/01/2012 | 03/01/2012
A | 1 | 2 | 2
B | 2 | 2 | 0
So basically I want to count the number of rows that have the same date, but for each individual name. So a simple group by of dates won't do because it would merge the names together. And then I want to output a table that shows the counts for each individual date using php.
I've seen answers suggest something like this:
SELECT
NAME,
SUM(CASE WHEN GRADE = 1 THEN 1 ELSE 0 END) AS GRADE1,
SUM(CASE WHEN GRADE = 2 THEN 1 ELSE 0 END) AS GRADE2,
SUM(CASE WHEN GRADE = 3 THEN 1 ELSE 0 END) AS GRADE3
FROM Rodzaj
GROUP BY NAME
so I imagine there would be a way for me to tweak that but I was wondering if there is another way, or is that the most efficient?
I was perhaps thinking if the while loop were to output just one specific name and date each time along with the count, so the first result would be A,01/01/2012,1 then the next A,02/01/2012,2 - A,03/01/2012,3 - B,01/01/2012,2 etc. then perhaps that would be doable through a different technique but not sure if something like that is possible and if it would be efficient.
So I'm basically looking to see if anyone has any ideas that are a bit outside the box for this and how they would compare.
I hope I explained everything well enough and thanks in advance for any help.
You have to include two columns in your GROUP BY:
SELECT name, COUNT(*) AS count
FROM your_table
GROUP BY name, date
This will get the counts of each name -> date combination in row-format. Since you also wanted to include a 0 count if the name didn't have any rows on a certain date, you can use:
SELECT a.name,
b.date,
COUNT(c.name) AS date_count
FROM (SELECT DISTINCT name FROM your_table) a
CROSS JOIN (SELECT DISTINCT date FROM your_table) b
LEFT JOIN your_table c ON a.name = c.name AND
b.date = c.date
GROUP BY a.name,
b.date
SQLFiddle Demo
You're asking for a "pivot". Basically, it is what it is. The real problem with a pivot is that the column names must adapt to the data, which is impossible to do with SQL alone.
Here's how you do it:
SELECT
Name,
SUM(`Date` = '01/01/2012') AS `01/01/2012`,
SUM(`Date` = '02/01/2012') AS `02/01/2012`,
SUM(`Date` = '03/01/2012') AS `03/01/2012`
FROM mytable
GROUP BY Name
Note the cool way you can SUM() a condition in mysql, becasue in mysql true is 1 and false is 0, so summing a condition is equivalent to counting the number of times it's true.
It is not more efficient to use an inner group by first.
Just in case anyone is interested in what was the best method:
Zane's second suggestion was the slowest, I loaded in a third of the data I did for the other two and it took quite a while. Perhaps on smaller tables it would be more efficient, and although I am not working with a huge table roughly 28,000 rows was enough to create significant lag, with the between clause dropping the result to about 4000 rows.
Bohemian's answer gave me the least amount to code, I threw in a loop to create all the case statements and it worked with relative ease. The benefit of this method was the simplicity, besides creating the loop for the cases, the results come in without the need for any php tricks, just simple foreach to get all the columns. Recommended for those not confident with php.
However, I found Zane's first suggestion the quickest performing and despite the need for extra php coding it seems I will be sticking with this method. The disadvantage of this method is that it only gives the dates that actually have data, so creating a table with all the dates becomes a bit more complicated. What I did was create a variable that keeps track of what date it is supposed to be compared to the table column which is reset on each table row, when the result of the query is equal to that date it echoes the value otherwise it does a while loop echoing table cells with 0 until the dates do match. It also had to do a check to see if the 'Name' value is still the same and if not it would switch to the next row after filling in any missing cells with 0 to the end of that row. If anyone is interested in seeing the code you can message me.
Results of the two methods over 3 months of data (a column for each day so roughly 90 case statements) ~ 12,000 rows out of 28,000:Bohemian's Pivot - ~0.158s (highest seen ~0.36s)Zane's Double Group by - ~0.086s (highest seen ~0.15s)