MySQL select only new records - mysql

How to write a MySQL query to achieve this task?
Table: writers
w_id w_name
---------------
1 Michael
2 Samantha
3 John
---------------
Table: articles
a_id w_id timestamp a_name
----------------------------------------
1 1 0000000001 PHP programming
2 3 0000000003 Other programming languages
3 3 0000000005 Another article
4 2 0000000015 Web design
5 1 0000000020 MySQL
----------------------------------------
Need to SELECT only those writers who published their first article not earlier than 0000000005. (only writers who published at least one article can be selected)
In this example the result would be:
2 Samantha
SQL code can be tested here http://sqlfiddle.com/#!2/7a308

Untested, but close:
SELECT w_id, MIN(timestamp) as min_time
from writers w
JOIN articles a on w.w_id = a.w_id
GROUP BY 1
HAVING min_time > 5

Here's one approach, using an inline view (or "derived table" as MySQL calls it) to get the earliest timestamp for each writer:
SELECT w.w_id
, w.w_name
-- , e.earliest_timestamp
FROM writers w
LEFT
JOIN ( SELECT a.w_id
, MIN(a.timestamp) AS earliest_timestamp
FROM articles a
GROUP BY a.w_id
) e
ON e.w_id = w.w_id
WHERE e.earliest_timestamp >= '0000000005'
ORDER BY w.w_id
This may not be the most efficient approach, but you can run just the query in the inline view (aliased as e) to see what it returns. We can then reference the result set from that query like we do a table (with some restrictions.)
(Other approaches can make better use of suitable indexes.)
I'm unclear on the datatype of earliest_timestamp column. The SQL above assumes it's character datatype. If it's integer rather than character, the WHERE clause could look like this:
WHERE e.earliest_timestamp >= 5

Related

How to make an inner join while maintaining unique rows

I have a ternary relationship in which I stablish the relation between Offers, Profiles, and Skills. The ternary relationship table, called ternary for example, has the IDs of the three tables as primary key. It could look something like this:
id_Offer - id_Profile - id_Skill
1 - 1 - 1
1 - 1 - 2
1 - 1 - 3
1 - 2 - 1
2 - 1 - 1
2 - 3 - 2
2 - 1 - 3
2 - 5 - 1
[and so on, there would be more registers for each id_Offer from Offer but I want to limit the example]
So I have 2 offers in total, with a number of profiles in each one.
The table Offer looks something like this:
Offer - business_name
1 - business-1
2 - business-1
3 - business-1
4 - business-1
5 - business-2
6 - business-2
7 - business-2
8 - business-3
So when I do a query like
select distinct id_offer, business_name, COUNT(*)
FROM Offer
GROUP BY business_name
Order by COUNT(*);
I get that for business-1 I have 4 offers.
Now if I want to take into account the offers for some Profile, I have to make a join with my ternary relationship. But even if I do something as simple as the following
select distinct business_name
from Offer
INNER JOIN ternary ON Offer.id_Offer = ternary.id_Offer
GROUP BY business_name
WHERE business_name = 'business-1'
No matter what I put on the group by, or if I write distinct or not, I do not get what I want. The reality is that for business-1, I have 4 offers. Right now in the ternary only appear two. So it should return 2 unique offers for this name with no filtering by profile.
But instead I get 8 offers, because that is how many times it appears in the ternary, the id_Offer's that match.
How should this be done? If I need no filters I can simply look at Offers table alone. But what if I need to filter by id_skill or id_Profile AND want to return the business_name?
I have seen solutions such as this but I can not make them work, I do not understand what the ? is, how is it called to learn more about it, if MariaDB works the same in this sense, I could not find information about it because I do not know how that operation is called. When I try to build that query for my data I get:
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '? ORDER BY COUNT(*) DESC' at line 1
But as I said, it is kind of hard to look for '?' as an... Operator? Function?
There are two basic solutions.
SELECT
o.business_name,
COUNT(DISTINCT o.id_offer) AS unique_offers
FROM
Offer AS o
INNER JOIN
ternary AS t
ON t.id_Offer = o.id_Offer
WHERE
o.business_name = 'business-1'
AND t.id_profile IN (1, 2, 3, 5)
GROUP BY
o.business_name
That's the simplest to write and think about. But, it can also be quite intensive because you're still joining each row in offer to 4 rows in ternary - Creating 8 rows to aggregate and process through DISTINCT.
The "better" (in my opinion) route is to filter then aggregate the ternary table in a sub-query.
SELECT
o.business_name,
COUNT(*) AS unique_offers
FROM
Offer AS o
INNER JOIN
(
SELECT id_Offer
FROM ternary
WHERE id_profile IN (1, 2, 3, 5)
GROUP BY id_Offer
)
AS t
ON t.id_Offer = o.id_Offer
WHERE
o.business_name = 'business-1'
GROUP BY
o.business_name
This ensures the t only ever has one row for any given offer. This in turn means that each row in offer only ever joins to one row in t; no duplication. That in turn means there is no need to use COUNT(DISTINCT) and relieves some overhead (By moving it to the inner query's GROUP BY).
Are you saying that you want to see offers for a particular business, but you want to limit these according to certain profiles or skills?
We limit query results in the WHERE clause. If we want to look up data in another table, we use IN or EXISTS. For instance:
select *
from offer
where business_name = 'business-1'
and id_offer in
(
select id_offer
from ternary
where id_profile = 1
and id_skill = 2
);

Mixing HAVING with CASE OR Analytic functions in MySQL (PartitionQualify(?

I have a SELECT query that returns some fields like this:
Date | Campaign_Name | Type | Count_People
Oct | Cats | 1 | 500
Oct | Cats | 2 | 50
Oct | Dogs | 1 | 80
Oct | Dogs | 2 | 50
The query uses aggregation and I only want to include results where when Type = 1 then ensure that the corresponding Count_People is greater than 99.
Using the example table, I'd like to have two rows returned: Cats. Where Dogs is type 1 it's excluded because it's below 100, in this case where Dogs = 2 should be excluded also.
Put another way, if type = 1 is less than 100 then remove all records of the corresponding campaign name.
I started out trying this:
HAVING CASE WHEN type = 1 THEN COUNT(DISTINCT Count_People) > 99 END
I used Teradata earlier int he year and remember working on a query that used an analytic function "Qualify PartitionBy". I suspect something along those lines is what I need? I need to base the exclusion on aggregation before the query is run?
How would I do this in MySQL? Am I making sense?
Now that I understand the question, I think your best bet will be a subquery to determine which date/campaign combinations of a type=1 have a count_people greater than 99.
SELECT
<table>.date,
<table>.campaign_name,
<table>.type,
count(distinct count_people) as count_people
FROM
(
SELECT
date,
campaign_name
FROM
<table>
WHERE type=1
HAVING count(distinct count_people) > 99
GROUP BY 1,2
) type1
LEFT OUTER JOIN <table> ON
type1.campaign_name = <table>.campaign_name AND
type1.date = <table>.date
WHERE <table>.type IN (1,2)
GROUP BY 1,2,3
The subquery here only returns campaign/date combinations when both the type=1 AND it has greater than 99 count_people. It uses a LEFT JOIN back to the to insure that only those campaign/date combinations make it into the result set.
The WHERE on the main query keeps the results to only Types 1 and 2, which you stated was already a filter in place (though not mentioned in the question, it was stated in a comment to a previous answer).
Based on your comments to answer by #JNevill I think you will have no option but to use subselects to pre-filter the record set you are dealing with, as working with HAVING is going to limit you only to the current record being evaluated - there is no way to compare against previous or subsequent records in the set in this manner.
So have a look at something like this:
SELECT
full_data.date AS date,
full_data.campaign_name AS campaign_name,
full_data.type AS type,
COUNT(full_data.people) AS people_count
FROM
(
SELECT
date,
campaign_name,
type,
COUNT(people) AS people_count
FROM table
WHERE type IN (1,2)
GROUP BY date, campaign_name, type
) AS full_data
LEFT JOIN
(
SELECT
date,
campaign_name,
COUNT(people) AS people_count
FROM table
WHERE type = 1
GROUP BY date, campaign_name
HAVING people_count < 100
) AS filter
ON
full_data.date = filter.date
AND full_data.campaign_name = filter.campaign_name
WHERE
filter.date IS NULL
AND filter.campaign_name IS NULL
The first subselect is basically your current query without any attempt at using HAVING to filter out results. The second subselect is used to find all date/campaign name combos which have people_count > 100 and use those as a filter for against the full data set.

Subquery with max value in a big table SQL

I'm trying to make a query to get the date of last work experience of a person and also the date they left the company (in some cases that value is null because the person is still working on the company).
I have something like:
SELECT r.idcurriculum, r.startdate, r.lastdate FROM (
SELECT idcurriculum, max(startdate) as startdate
FROM workexperience
GROUP BY idcurriculum) as s
INNER JOIN workexperience r on (r.idcurriculum = s.idcurriculum)
The structure should come out something like this:
idcurriculum | startdate | lastdate
1234 | 2010-05-01| null
2532 | 2005-10-01| 2010-02-28
5234 | 2011-07-01| 2013-10-31
1025 | 2012-04-01| 2014-03-31
I tried running that query but I had to stop it because it was taking too long. The workexperience table weights aprox 20GB. I don't know if the query is wrong, I've only run it for 10 minutes.
Help will be much appreciated.
You might try rephrasing the query as:
select r.*
from workexperience we
where not exists (select 1
from workexperience we2
where we2.idcurriculum = we.idcurriculum and
we2.startdate > we.startdate
);
Important: for performance reasons you need a composite index on idcurriculum, startdate:
create index idx_workexperience_idcurriculum_startdate on workexperience(idcurriculum, strtdate)
The logic of the query is: "Get me all rows from workexperience where there is no row for the same idcurriculum that has a larger startdate". That is a fancy way of saying "get me the maximum".
With the group by, MySQL has to do an aggregation, which would typically involve sorting the data -- expensive on 20 Gbytes. With this method, it can look up the results using the index, which should be faster.
As an alternative to Gordon's answer you could also write the query as:
SELECT r.*
FROM work_experience we
LEFT JOIN work_experience we2
ON we2.idcurriculum = we.idcurriculum
AND we2.startdate > we.startdate
WHERE we2.idcurriculum IS NULL;
You can run into problems when there are multiple maximum start_dates in the group however.

MySQL - Combining two select statements into one result with LIMIT efficiently

For a dating application, I have a few tables that I need to query for a single output with a LIMIT 10 of both queries combined. It seems difficult to do at the moment, even though it's not an issue to query them separately, but the LIMIT 10 won't work as the numbers are not exact (ex. not LIMIT 5 and LIMIT 5, one query may return 0 rows, while the other 10, depending on the scenario).
members table
member_id | member_name
------------------------
1 Herb
2 Karen
3 Megan
dating_requests
request_id | member1 | member2 | request_time
----------------------------------------------------
1 1 2 2012-12-21 12:51:45
dating_alerts
alert_id | alerter_id | alertee_id | type | alert_time
-------------------------------------------------------
5 3 2 platonic 2012-12-21 10:25:32
dating_alerts_status
status_id | alert_id | alertee_id | viewed | viewed_time
-----------------------------------------------------------
4 5 2 0 0000-00-00 00:00:00
Imagine you are Karen and just logged in, you should see these 2 items:
1. Herb requested a date with you.
2. Megan wants a platonic relationship with you.
In one query with a LIMIT of 10. Instead here are two queries that need to be combined:
1. Herb requested a date with you.
-> query = "SELECT dr.request_id, dr.member1, dr.member2, m.member_name
FROM dating_requests dr
JOIN members m ON dr.member1=m.member_id
WHERE dr.member2=:loggedin_id
ORDER BY dr.request_time LIMIT 5";
2. Megan wants a platonic relationship with you.
-> query = "SELECT da.alert_id, da.alerter_id, da.alertee_id, da.type,
da.alert_time, m.member_name
FROM dating_alerts da
JOIN dating_alerts_status das ON da.alert_id=das.alert_id
AND da.alertee_id=das.alertee_id
JOIN members m ON da.alerter_id=m.member_id
WHERE da.alertee_id=:loggedin_id AND da.type='platonic'
AND das.viewed='0' AND das.viewed_time<da.alert_time
ORDER BY da.alert_time LIMIT 5";
Again, sometimes both tables may be empty, or 1 table may be empty, or both full (where LIMIT 10 kicks in) and ordered by time. Any ideas on how to get a query to perform this task efficiently? Thoughts, advice, chimes, optimizations are welcome.
You can combine multiple queries with UNION, but only if the queries have the same number of columns. Ideally the columns are the same, not only in data type, but also in their semantic meaning; however, MySQL doesn't care about the semantics and will handle differing datatypes by casting up to something more generic - so if necessary you could overload the columns to have different meanings from each table, then determine what meaning is appropriate in your higher level code (although I don't recommend doing it this way).
When the number of columns differs, or when you want to achieve a better/less overloaded alignment of data from two queries, you can insert dummy literal columns into your SELECT statements. For example:
SELECT t.cola, t.colb, NULL, t.colc, NULL FROM t;
You could even have some columns reserved for the first table and others for the second table, such that they are NULL elsewhere (but remember that the column names come from the first query, so you may wish to ensure they're all named there):
SELECT a, b, c, d, NULL AS e, NULL AS f, NULL AS g FROM t1
UNION ALL -- specify ALL because default is DISTINCT, which is wasted here
SELECT NULL, NULL, NULL, NULL, a, b, c FROM t2;
You could try aligning your two queries in this fashion, then combining them with a UNION operator; by applying LIMIT to the UNION, you're close to achieving your goal:
(SELECT ...)
UNION
(SELECT ...)
LIMIT 10;
The only issue that remains is that, as presented above, 10 or more records from the first table will "push out" any records from the second. However, we can utilise an ORDER BY in the outer query to solve this.
Putting it all together:
(
SELECT
dr.request_time AS event_time, m.member_name, -- shared columns
dr.request_id, dr.member1, dr.member2, -- request-only columns
NULL AS alert_id, NULL AS alerter_id, -- alert-only columns
NULL AS alertee_id, NULL AS type
FROM dating_requests dr JOIN members m ON dr.member1=m.member_id
WHERE dr.member2=:loggedin_id
ORDER BY event_time LIMIT 10 -- save ourselves performing excessive UNION
) UNION ALL (
SELECT
da.alert_time AS event_time, m.member_name, -- shared columns
NULL, NULL, NULL, -- request-only columns
da.alert_id, da.alerter_id, da.alertee_id, da.type -- alert-only columns
FROM
dating_alerts da
JOIN dating_alerts_status das USING (alert_id, alertee_id)
JOIN members m ON da.alerter_id=m.member_id
WHERE
da.alertee_id=:loggedin_id
AND da.type='platonic'
AND das.viewed='0'
AND das.viewed_time<da.alert_time
ORDER BY event_time LIMIT 10 -- save ourselves performing excessive UNION
)
ORDER BY event_time
LIMIT 10;
Of course, now it's up to you to determine what type of row you're dealing with as you read each record in the resultset (suggest you test request_id and/or alert_id for NULL values; alternatively one could add an additional column to the results that explicitly states from which table each record originated, but it should be equivalent provided those id columns are NOT NULL).

GROUP BY does not remove duplicates

I have a watchlist system that I've coded, in the overview of the users' watchlist, they would see a list of records, however the list shows duplicates when in the database it only shows the exact, correct number.
I've tried GROUP BY watch.watch_id, GROUP BY rec.record_id, none of any types of group I've tried seems to remove duplicates. I'm not sure what I'm doing wrong.
SELECT watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
FROM
(
watchlist watch
LEFT OUTER JOIN records rec ON rec.record_id = watch.record_id
LEFT OUTER JOIN members usr ON rec.user_id = usr.user_id
)
WHERE watch.user_id = 1
GROUP BY watch.watch_id
LIMIT 0, 25
The watchlist table looks like this:
+----------+---------+-----------+------------+
| watch_id | user_id | record_id | watch_date |
+----------+---------+-----------+------------+
| 13 | 1 | 22 | 1314038274 |
| 14 | 1 | 25 | 1314038995 |
+----------+---------+-----------+------------+
GROUP BY does not "remove duplicates". GROUP BY allows for aggregation. If all you want is to combine duplicated rows, use SELECT DISTINCT.
If you need to combine rows that are duplicate in some columns, use GROUP BY but you need to to specify what to do with the other columns. You can either omit them (by not listing them in the SELECT clause) or aggregate them (using functions like SUM, MIN, and AVG). For example:
SELECT watch.watch_id, COUNT(rec.street_number), MAX(watch.watch_date)
... GROUP by watch.watch_id
EDIT
The OP asked for some clarification.
Consider the "view" -- all the data put together by the FROMs and JOINs and the WHEREs -- call that V. There are two things you might want to do.
First, you might have completely duplicate rows that you wish to combine:
a b c
- - -
1 2 3
1 2 3
3 4 5
Then simply use DISTINCT
SELECT DISTINCT * FROM V;
a b c
- - -
1 2 3
3 4 5
Or, you might have partially duplicate rows that you wish to combine:
a b c
- - -
1 2 3
1 2 6
3 4 5
Those first two rows are "the same" in some sense, but clearly different in another sense (in particular, they would not be combined by SELECT DISTINCT). You have to decide how to combine them. You could discard column c as unimportant:
SELECT DISTINCT a,b FROM V;
a b
- -
1 2
3 4
Or you could perform some kind of aggregation on them. You could add them up:
SELECT a,b, SUM(c) "tot" FROM V GROUP BY a,b;
a b tot
- - ---
1 2 9
3 4 5
You could add pick the smallest value:
SELECT a,b, MIN(c) "first" FROM V GROUP BY a,b;
a b first
- - -----
1 2 3
3 4 5
Or you could take the mean (AVG), the standard deviation (STD), and any of a bunch of other functions that take a bunch of values for c and combine them into one.
What isn't really an option is just doing nothing. If you just list the ungrouped columns, the DBMS will either throw an error (Oracle does that -- the right choice, imo) or pick one value more or less at random (MySQL). But as Dr. Peart said, "When you choose not to decide, you still have made a choice."
While SELECT DISTINCT may indeed work in your case, it's important to note why what you have is not working.
You're selecting fields that are outside of the GROUP BY. Although MySQL allows this, the exact rows it returns for the non-GROUP BY fields is undefined.
If you wanted to do this with a GROUP BY try something more like the following:
SELECT watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
FROM
(
watchlist watch
LEFT OUTER JOIN est8_records rec ON rec.record_id = watch.record_id
LEFT OUTER JOIN est8_members usr ON rec.user_id = usr.user_id
)
WHERE watch.watch_id IN (
SELECT watch_id FROM watch WHERE user_id = 1
GROUP BY watch.watch_id)
LIMIT 0, 25
I Would never recommend using SELECT DISTINCT, it's really slow on big datasets.
Try using things like EXISTS.
You are grouping by watch.watch_id and you have two results, which have different watch IDs, so naturally they would not be grouped.
Also, from the results displayed they have different records. That looks like a perfectly valid expected results. If you are trying to only select distinct values, then you don't want ot GROUP, but you want to select by distinct values.
SELECT DISTINCT()...
If you say your watchlist table is unique, then one (or both) of the other tables either (a) has duplicates, or (b) is not unique by the key you are using.
To suppress duplicates in your results, either use DISTINCT as #Laykes says, or try
GROUP BY watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
It sort of sounds like you expect all 3 tables to be unique by their keys, though. If that is the case, you are simply masking some other problem with your SQL by trying to retrieve distinct values.