Better way to exclude duplicates without creating new table - mysql

I have a query that uses a subquery to detect if an item in a joined table has a duplicate record, and if so the data is not pulled into the parent query:
select
(f.listing_datetime) as datetime,
round(avg(f.listing_price), 0) as price,
round(avg(f.listing_sqft), 0) as sqft,
round(avg(f.listing_p_per_sqft), 2) as p_per_ft,
f.listing_neighborhood, count(*) as points
from (
select
a.listing_datetime, a.listing_price, a.listing_sqft, a.listing_p_per_sqft,
a.listing_neighborhood, i.listing_tokens, count(i.listing_tokens) as c
from
agg_cl_data as a
left join incoming_cl_data_desc as i
on a.listing_url = i.listing_url
where a.listing_datetime between curdate() - interval 30 day and curdate()
group by i.listing_tokens
having c < 2
) as f
group by day(f.listing_datetime), f.listing_neighborhood
order by f.listing_datetime;
As you can see, by using a simple way to deal with dupes with the HAVING clause, I'm actually losing the original record that was stored because any aggregated record with great than 2 is thrown out. Is there a better way to do this so that I don't lose some of the data, WITHOUT creating a new table that would be queried against?

If you want to remove duplicate rows then use DISTINCT clause. If you want to find out duplicate based on partitioning on a particular column use the ROW_NUMBER window function.
On first glance, your subquery is invalid since you are grouping by one column and not using any other aggregate function in the other columns.
select distinct
a.listing_datetime, a.listing_price, a.listing_sqft, a.listing_p_per_sqft,
a.listing_neighborhood, i.listing_tokens
from
agg_cl_data as a
left join incoming_cl_data_desc as i
on a.listing_url = i.listing_url
where a.listing_datetime between curdate() - interval 30 day and curdate()

Try using 'distinct' instead if 'having' in subquery. You will get each url only once without loosing it, even if there were two entries for it.
So your code should be:
select DISTINCT a.listing_datetime, ...
and then no 'having' in the end.

Related

How to count related id on second table quicker in mysql?

I need to know how many orders made to each product within a day by their ids. I tried select all the product_today.id. And count each of them from the second table - product_today_order.hid. I'm now have 20k+ rows of data. It took me 10s+ only this query.
Is there any way to make the query faster?
SELECT t.id,(select count(o.hid) from product_today_order o where o.hid=t.id) as zid
FROM product_today t
where date(t.dtime)='2021-11-26'
group by t.id
5 tips:
Probably the main slowdown is the un-sargable date(t.dtime)='...'. Change that to
WHERE t.dtime >= '2021-11-26'
AND t.dtime < '2021-11-26' + INTERVAL 1 DAY
Also, get rid of the GROUP BY. It is unnecessary (if t.id is the PRIMARY KEY).
Do you have an index on t that starts with dtime?
Do you need to check o.hid for being not-NULL? If not, simply say COUNT(*).
Do you have an index on o that starts with hid?

Mysql DISTINCT with more than one column (remove duplicates)

My database is called: (training_session)
I try to print out some information from my data, but I do not want to have any duplicates. I do get it somehow, may someone tell me what I do wrong?
SELECT DISTINCT athlete_id AND duration FROM training_session
SELECT DISTINCT athlete_id, duration FROM training_session
It works perfectly if i use only one column, but when I add another. it does not work.
I think you misunderstood the use of DISTINCT.
There is big difference between using DISTINCT and GROUP BY.
Both have some sort of goal, but they have different purpose.
You use DISTINCT if you want to show a series of columns and never repeat. That means you dont care about calculations or group function aggregates. DISTINCT will show different RESULTS if you keep adding more columns in your SELECT (if the table has many columns)
You use GROUP BY if you want to show "distinctively" on a certain selected columns and you use group function to calculate the data related to it. Therefore you use GROUP BY if you want to use group functions.
Please check group functions you can use in this link.
https://dev.mysql.com/doc/refman/8.0/en/group-by-functions.html
EDIT 1:
It seems like you are trying to get the "latest" of a certain athlete, I'll assume the current scenario if there is no ID.
Here is my alternate solution:
SELECT a.athlete_id ,
( SELECT b.duration
FROM training_session as b
WHERE b.athlete_id = a.athlete_id -- connect
ORDER BY [latest column to sort] DESC
LIMIT 1
) last_duration
FROM training_session as a
GROUP BY a.athlete_id
ORDER BY a.athlete_id
This syntax is called IN-SELECT subquery. With the help of LIMIT 1, it shows the topmost record. In-select subquery must have 1 record to return or else it shows error.
MySQL's DISTINCT clause is used to filter out duplicate recordsets.
If your query was SELECT DISTINCT athlete_id FROM training_session then your output would be:
athlete_id
----------
1
2
3
4
5
6
As soon as you add another column to your query (in your example, the column called duration) then each record resulting from your query are unique, hence the results you're getting. In other words the query is working correctly.

Specific where clause in Mysql query

So i have a mysql table with over 9 million records. They are call records. Each record represents 1 individual call. The columns are as follows:
CUSTOMER
RAW_SECS
TERM_TRUNK
CALL_DATE
There are others but these are the ones I will be using.
So I need to count the total number of calls for a certain week in a certain Term Trunk. I then need to sum up the number of seconds for those calls. Then I need to count the total number of calls that were below 7 seconds. I always do this in 2 queries and combine them but I was wondering if there were ways to do it in one? I'm new to mysql so i'm sure my syntax is horrific but here is what I do...
Query 1:
SELECT CUSTOMER, SUM(RAW_SECS), COUNT(*)
FROM Mytable
WHERE TERM_TRUNK IN ('Mytrunk1', 'Mytrunk2')
GROUP BY CUSTOMER;
Query 2:
SELECT CUSTOMER, COUNT(*)
FROM Mytable2
WHERE TERM_TRUNK IN ('Mytrunk1', 'Mytrunk2') AND RAW_SECS < 7
GROUP BY CUSTOMER;
Is there any way to combine these two queries into one? Or maybe just a better way of doing it? I appreciate all the help!
There are 2 ways of achieving the expected outcome in a single query:
conditional counting: use a case expression or if() function within the count() (or sum()) to count only specific records
use self join: left join the table on itself using the id field of the table and in the join condition filter the alias on the right hand side of the join on calls shorter than 7 seconds
The advantage of the 2nd approach is that you may be able to use indexes to speed it up, while the conditional counting cannot use indexes.
SELECT m1.CUSTOMER, SUM(m1.RAW_SECS), COUNT(m1.customer), count(m2.customer)
FROM Mytable m1
LEFT JOIN Mytable m2 ON m1.id=m2.id and m2.raw_secs<7
WHERE TERM_TRUNK IN ('Mytrunk1', 'Mytrunk2')
GROUP BY CUSTOMER;

Order By on date field starting in a middle point of the dates range

I have a table "A" with a "date" field. I want to make a select query and order the rows with previous dates in a descending order, and then, the rows with next dates in ascending order, all in the same query. Is it possible?
For example, table "A":
id date
---------------------
a march-20
b march-21
c march-22
d march-23
e march-24
I'd like to get, having as a starting date "march-22", this result:
id date
---------------------
c march-22
b march-21
a march-20
d march-23
e march-24
In one query, because I'm doing it with two of them and it's slow, because the only difference is the sorting, and the joins I have to do are a bit "heavy".
Thanks a lot.
You could use something like this -
SELECT *
FROM test
ORDER BY IF(
date <= '2012-03-22',
DATEDIFF('2000-01-01', date),
DATEDIFF(date, '2000-01-01')
);
Here is a link to a test on SQL Fiddle - http://sqlfiddle.com/#!2/31a3f/13
That's wrong, sorry :(
From documentation:
However, use of ORDER BY for individual SELECT statements implies nothing about the order in which the rows appear in the final result because UNION by default produces an unordered set of rows. Therefore, the use of ORDER BY in this context is typically in conjunction with LIMIT, so that it is used to determine the subset of the selected rows to retrieve for the SELECT, even though it does not necessarily affect the order of those rows in the final UNION result. If ORDER BY appears without LIMIT in a SELECT, it is optimized away because it will have no effect anyway.
This should do the trick. I'm not 100% sure about adding an order in a UNION...
SELECT * FROM A where date <= now() ORDER BY date DESC
UNION SELECT * FROM A where date > now() ORDER BY date ASC
I think the real question here is how to do the joining once. Create a temporary table with the result of joining, and make the 2 selects from that table. So it will be be time consuming only on creation (once) not on select query (twice).
CREATE TABLE tmp SELECT ... JOIN -- do the heavy duty here
With this you can make the two select statenets as you originally did.

SQL Server: Selecting DateTime and grouping by Date

This simple SQL problem is giving me a very hard time. Either because I'm seeing the problem the wrong way or because I'm not that familiar with SQL. Or both.
What I'm trying to do: I have a table with several columns and I only need two of them: the datetime when the entry was created and the id of the entry. Note that the hours/minutes/seconds part is important here.
However, I want to group my selection according to the DATE part only. Otherwise all groups will most likely have 1 element.
Here's my query:
SELECT MyDate as DateCr, COUNT(Id) as Occur
FROM MyTable tb WITH(NOLOCK)
GROUP BY CAST(tb.MyDate as Date)
ORDER BY DateCr ASC
However I get the following error from it:
Column "MyTable.MyDate" is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
If I don't do the cast in the GROUP BY, everything fine. If I cast MyDate to DATE in the SELECT and keep the CAST from GROUP BY, everything fine once more. Apparently it wants to keep the same DATE or DATETIME format in the GROUP BY as in the SELECT.
My approach can be completely wrong so I am not necessarily looking to fix the above query, but to find the proper way to do it.
LE: I get the above error on line 1.
LE2: On a second look, my question indeed is not very explicit. You can ignore the above approach if it is completely wrong. Below is a sample scenario
Let me tell you what I need: I want to retrieve (1) the DateTime when each entry was created. So if I have 20 entries, then I want to get 20 DateTimes. Then if I have multiple entries created on the same DAY, I want the number of those entries. For example, let's say I created 3 entries on Monday, 1 on Tuesday and 2 today. Then from my table I need the datetimes of these 6 entries + the number of entries which were created on each day (3 for 19/03/2012, 1 for 20/03/2012 and 2 for 21/03/2012).
I'm not sure why you're objecting to performing the CONVERT in both the SELECT and the GROUP BY. This seems like a perfectly logical way to do this:
SELECT
DateCr = CONVERT(DATE, MyDate),
Occur = COUNT(Id)
FROM dbo.MyTable
GROUP BY CONVERT(DATE, MyDate)
ORDER BY DateCr;
If you want to keep the time portion of MyDate in the SELECT list, why are you bothering to group? Or how do you expect the results to look? You'll have a row for every individual date/time value, where the grouping seems to indicate you want a row for each day. Maybe you could clarify what you want with some sample data and example desired results.
Also, why are you using NOLOCK? Are you willing to trade accuracy for a haphazard turbo button?
EDIT adding a version for the mixed requirements:
;WITH d(DateCr,d,Id) AS
(
SELECT MyDate, d = CONVERT(DATE, MyDate), Id
FROM dbo.MyTable)
SELECT DateCr, Occur = (SELECT COUNT(Id) FROM d AS d2 WHERE d2.d = d.d)
FROM d
ORDER BY DateCr;
Even though this is an old post, I thought I would answer it. The solution below will work with SQL Server 2008 and above. It uses the over clause, so that the individual lines will be returned, but will also count the rows grouped by the date (without time).
SELECT MyDate as DateCr,
COUNT(Id) OVER(PARTITION BY CAST(tb.MyDate as Date)) as Occur
FROM MyTable tb WITH(NOLOCK)
ORDER BY DateCr ASC
Darren White