Count each customer once in this query - mysql

I have two tables: one is a list of store locations (with lat/long) and the other is a customer list (with address lat/long). What I need is a query that shows how many customers are within certain ranges from each store. The goal is to have each customer counted once in the the distance range that is closest to a store. That is, each customer should only be counted once. For example, if they are 2 miles from one store and 5 from another, then only count them as being associated with the first store.
The query below is supposed to roll all this up so basically I can see the maximum distance all customers are from any store.
This is what my query looks like:
SELECT CASE
WHEN dist < 8046. THEN 1
WHEN dist < 16093. THEN 2
WHEN dist < 40233. THEN 3
WHEN dist < 80467. THEN 4
WHEN dist < 160934. THEN 5
END AS grp,count(*)
FROM (SELECT s.id, s.identifier, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist FROM full_data_for_testing_deid_2 c, demo_locations_table s)
AS loc_dist
GROUP BY grp
And here's the result:
| Count | grp |
|---------|------|
| 2860 | 1 |
| 4858 | 2 |
| 12735 | 3 |
| 11432 | 4 |
| 23950 | 5 |
| 1002970 | null |
There are only 32048 customers in my database, so this isn't quite working right. If it were, I'd expect the values to increase linearly, but in my results there are more customers in group 3 v. 4, which shouldn't be the case. In addition, groups 1-5 should add up to 32048, as each customer should only be counted once.
Any thoughts on how to adjust this such that each customer is only counted once?

To count each customer only once (in Postgres 9.3+):
SELECT CASE
WHEN s.dist < 8046.0 THEN 1
WHEN s.dist < 16093.0 THEN 2
WHEN s.dist < 40233.0 THEN 3
WHEN s.dist < 80467.0 THEN 4
WHEN s.dist < 1609340.0 THEN 5
END AS grp
, count(*)
FROM full_data_for_testing_deid_2 c
, LATERAL (
SELECT s.id, s.identifier, ST_Distance_Sphere(s.the_geom, c.the_geom) AS dist
FROM demo_locations_table s
ORDER BY dist
LIMIT 1
) s
GROUP BY 1;
This takes every customer exactly once and finds the closest location to go with it before aggregating.
But I don't think ST_Distance_Sphere() uses a GiST index on the_geom.
Consider ST_DWithin() instead if performance is an issue.
How to alter this PostGIS ST_distance_sphere query to give the answer for all points in the table, not just one?

Related

Refine SQL Query given list of ids

I am trying to improve this query given that it takes a while to run. The difficulty is that the data is coming from one large table and I need to aggregate a few things. First I need to define the ids that I want to get data for. Then I need to aggregate total sales. Then I need to find metrics for some individual sales. This is what the final table should look like:
ID | Product Type | % of Call Sales | % of In Person Sales | Avg Price | Avg Cost | Avg Discount
A | prod 1 | 50 | 25 | 10 | 7 | 1
A | prod 2 | 50 | 75 | 11 | 4 | 2
So % of Call Sales for each product and ID adds up to 100. The column sums to 100, not the row. Likewise for % of In Person Sales. I need to define the IDs separately because I need it to be Region Independent. Someone could make sales in Region A or Region B, but it does not matter. We want aggregate across Regions. By aggregating the subqueries and using a where clause to get the right ids, it should cut down on memory required.
IDs Query
select distinct ids from tableA as t where year>=2021 and team = 'Sales'
This should be a unique list of ids
Aggregate Call Sales and Person Sales
select ids
,sum(case when sale = 'call' then 1 else 0 end) as call_sales
,sum(case when sale = 'person' then 1 else 0 end) as person_sales
from tableA
where
ids in t.ids
group by ids
This will be as follows with the unique ids, but the total sales are from everything in that table, essentially ignoring the where clause from the first query.
ids| call_sales | person_sales
A | 100 | 50
B | 60 | 80
C | 100 | 200
Main Table as shown above
select ids
,prod_type
,cast(sum(case when sale = 'call' then 1 else 0 end)/CAST(call_sales AS DECIMAL(10, 2)) * 100 as DECIMAL(10,2)) as call_sales_percentage
,cast(sum(case when sale = 'person' then 1 else 0 end)/CAST(person_sales AS DECIMAL(10, 2)) * 100 as DECIMAL(10,2)) as person_sales_percentage
,mean(price) as price
,mean(cost) as cost
,mean(discount) as discount
from tableA as A
where
...conditions...
group by
...conditions...
You can combine the first two queries as:
select ids, sum( sale = 'call') as call_sales,
sum(sale = 'person') as person_sales
from tableA
where
ids in t.ids
group by ids
having sum(year >= 2021 and team = 'Sales') > 0;
I'm not exactly sure what the third is doing, but you can use the above as a CTE and just plug it in.

Need validation that interpretation for a Grouping Query is correct

I am running the following query and at first it appears to give the sub totals for customers and shows by date each customers payment amounts only if that total for all payments is greater than $90,000.
SELECT
Customername,
Date(paymentDate),
CONCAT('$', Round(SUM(amount),2)) AS 'High $ Paying Customers'
FROM Payments
JOIN Customers
On payments.customernumber = customers.customernumber
Group by customername, Date(paymentDate) WITH ROLLUP
having sum(amount)> 90000;
But upon looking at the records for Dragon Souveniers, Ltd. and Euro+ Shopping Channel is is actually showing the paydates that have amounts individually over $90000 as well as the subtotal for that customer as a rollup. For all other customers, their individual payment dates are not reported in the result set and only their sum is if it over $90000. For example Annna's Decorations as 4 payment records and none of them are over 90000 but her sum is reported as the value for the total payments in the query with the rollup. Is this the correct interpretation?
The HAVING clause work correct, It filters all records with a total no above 90000. It also does do this for totals.
When using GROUP BY .... WITH ROLLUP, you can detect the created ROLL UP lines by using the GROUPING() function.
You should add a condition in a way that the desired columns are not filtered.
Simple example:
select a, sum(a), grouping(a<3)
from (select 1 as a
union
select 2
union select 3) x
group by a<3 with rollup;
output:
+---+--------+---------------+
| a | sum(a) | grouping(a<3) |
+---+--------+---------------+
| 3 | 3 | 0 |
| 1 | 3 | 0 |
| 1 | 6 | 1 |
+---+--------+---------------+
this shows that the last line (with grouping(i<3) == 1) is a line containing totals for a<3.

SQL - return latest of multiple records from large data set

Background
I have a stock_price table that stores historical intra-day stock prices for roughly 1000 stocks. Although the old data is purged regularly, the table regularly has 5M+ records. Structure is loosely:
| id | stock_id | value | change | created_at |
|--------|----------|-------|--------|---------------------|
| 12345 | 1 | 50 | 2.12 | 2020-05-05 17:39:00 |
| 12346 | 2 | 25 | 1.23 | 2020-05-05 17:39:00 |
I regularly need to fetch the latest stock prices for ~20ish stocks at time for an API endpoint. An original implementation of this executed a single query per stock:
select * from stock_prices where stock_id = 1 order by created_at desc limit 1
Part 1: An inefficient query
Somewhat inefficient with 20+ queries, but it worked. The code (Laravel 6) was updated to use the correct relationships (stock hasMany stock_prices), which in turn generated a query like this:
select
*
from
`stock_prices`
where
`stock_prices`.`stock_id` in (1, 2, 3, 4, 5)
order by
`id` desc
While this saves on queries, it takes 1-2 seconds to run. Running explain shows it's still having to query 50k+ rows at any given time, even with the foreign key index. My next thought was that I'd add a limit to the query to only return the number of rows equal to the number of stocks I'm asking for. Query is now:
select
*
from
`stock_prices`
where
`stock_prices`.`stock_id` in (1, 2, 3, 4, 5)
order by
`id` desc
limit
5
Part 2: Query sometimes misses records
Performance is amazing - millisecond-level processing with this. However, it suffers from potentially not returning a price for one/ multiple of the stocks. Since the limit has been added, if any stock has more than one price (row) before the next stock, it will "consume" one of the row counts.
This is a very real scenario as some stocks pull data each minute, others every 15 minutes, etc. So there are cases where that above query, due to the limit will pull multiple rows for one stock and subsequently not return data for others:
| id | stock_id | value | change | created_at |
|------|----------|-------|--------|----------------|
| 5000 | 1 | 50 | 0.5 | 5/5/2020 17:00 |
| 5001 | 1 | 51 | 1 | 5/5/2020 17:01 |
| 6001 | 2 | 25 | 2.2 | 5/5/2020 17:00 |
| 6002 | 3 | 35 | 3.2 | 5/5/2020 17:00 |
| 6003 | 4 | 10 | 1.3 | 5/5/2020 17:00 |
In this scenario, you can see that stock_id of 1 has more frequent intervals of data, so when the query was ran, it returned two records for that ID, then continued down the list. After it hit 5 records, it stopped, meaning that stock id of 5 did not have any data returned, although it does exist. As you can imagine, that breaks things down the line in the app when no data was returned.
Part 3: Attempts to solve
The most obvious answer seems to be to add a GROUP BY stock_id as a way to require that I get the same number of results as I'm expected per stock. Unfortunately, this leads me back to Part 1, wherein that query, while it works, takes 1-2 seconds because it ends up having to traverse the same 50k+ rows as it did without the limit previously. This leaves me no better off.
The next thought was to arbitrarily make the LIMIT larger than it needs to be so it can capture all the rows. This is not a predictable solution since the query could be any combination of thousands of stocks that each have different intervals of data available. The most extreme example is stocks that pull daily versus each minute, which means one could have somewhere near 350+ rows before the second stock appears. Multiply that by the number of stocks in one query - say 50, and this still will require querying 15k+ plus rows. Feasible, but not ideal, and potentially not scalable.
Part 4: Suggestions?
Is it such a bad practice to have one API call initiate potentially 50+ DB queries just to get stock price data? Is there some thresehold of LIMIT I should use that minimizes the chances of failure enough to be comfortable? Are there other methods with SQL that would allow me to return the required rows without having to query a large chunk of tables?
Any help appreciated.
The fastest method is union all:
(select * from stock_prices where stock_id = 1 order by created_at desc limit 1)
union all
(select * from stock_prices where stock_id = 2 order by created_at desc limit 1)
union all
(select * from stock_prices where stock_id = 3 order by created_at desc limit 1)
union all
(select * from stock_prices where stock_id = 4 order by created_at desc limit 1)
union all
(select * from stock_prices where stock_id = 5 order by created_at desc limit 1)
This can use an index on stock_prices(stock_id, created_at [desc]). Unfortunately, when you use in, the index cannot be used as effectively.
Groupwise-max
SELECT b.*
FROM ( SELECT stock_id, MAX(created_at) AS created_at
FROM stock_proces
GROUP BY stock_id
) AS a
JOIN stock_prices AS b USING(stock_id, created_at)
Needed:
INDEX(stock_id, created_at)
If you can have two rows for the same stock in the same second, this will give 2 rows. See the link below for alternatives.
If that pair is unique, then make it the PRIMARY KEY and get rid of id; this will help performance, too.
More discussion: http://mysql.rjweb.org/doc.php/groupwise_max#using_an_uncorrelated_subquery

SQL consecutive occurrences for availability based query

I am a bit stuck trying to create a pretty complex on SQL, and more specifically MySQL.
The database deals with car rentals, and the main table of what is a snowflake patters looks a bit like:
id | rent_start | rent_duration | rent_end | customerID | carId
-----------------------------------------------------------------------------------
203 | 2016-10-03 | 5 | 2016-11-07 | 16545 | 4543
125 | 2016-10-20 | 9 | 2016-10-28 | 54452 | 5465
405 | 2016-11-01 | 2 | 2016-01-02 | 43565 | 346
My goal is to create a query that allows given
1) A period range like, for example: from 2016-10-03 to 2016-11-03
2) A number of days, for example: 10
allows me to retrieve the cars that are actually available for at least 10 CONSECUTIVE days between the 10th of October and the 11th.
A list of IDs for those cars is more than enough... I just don't really know how to setup a query like that.
If it can help: I do have a list of all the car IDs in another table.
Either way, thanks!
I think it is much simpler to work with availability, rather than rentals, for this purpose.
So:
select r.car_id, r.rent_end as avail_start,
(select min(r2.rent_start
from rentals r2
where r2.car_id = r.car_id and r2.rent_start > r.rent_start
) as avail_end
from rentals r;
Then, for your query, you need at least 10 days. You can use a having clause or subquery for that purpose:
select r.*
from (select r.car_id, r.rent_end as avail_start,
(select min(r2.rent_start
from rentals r2
where r2.car_id = r.car_id and r2.rent_start > r.rent_start
) as avail_end
from rentals r
) r
where datediff(avail_end, avail_start) >= $days;
And finally, you need for that period to be during the dates you specify:
select r.*
from (select r.car_id, r.rent_end as avail_start,
(select min(r2.rent_start
from rentals r2
where r2.car_id = r.car_id and r2.rent_start > r.rent_start
) as avail_end
from rentals r
) r
where datediff(avail_end, avail_start) >= $days and
( (avail_end > $end and avail_start < $start) or
(avail_start <= $start and avail_end >= $start + interval 10 day) or
(avail_start > $start and avail_start + interval 10 day <= $end)
);
This handles the various conditions where the free period covers the entire range or starts/ends during the range.
There are no doubt off-by-one errors in this logic (is a car available the same date it returns). The this should give you a solid approach for solving the problem.
By the way, you should also include cars that have never been rented. But that is not possible with the tables you describe in the question.

mysql get last N records with MAX(date)

So I have following data in a product_rate_history table -
I want to select last N records ( eg 7 records ) informing rate change history of given product. If product rate is changed more than one time a day, then query should select most recent rate change for that day.
So from above table I want output like following for product id 16-
+-----------+-------------------------+------------------------+
| product_id | previous_rate | date |
+----------------+--------------------+------------------------|
| 16 | 2400 | 2016-04-30 23:05:35 |
| 16 | 4500 | 2016-04-29 11:02:42 |
+----------------+--------------------+------------------------+
I have tried following query but it returns only one row having last update rate only-
SELECT * FROM `product_rate_history` prh
INNER JOIN (SELECT max(created_on) as max FROM `product_rate_history` GROUP BY Date(created_on)) prh2
ON prh.created_on = prh2.max
WHERE prh.product_id = 16
GROUP BY DATE(prh.created_on)
ORDER BY prh.created_on DESC;
First, you do not need an aggregation in the outer query.
Second, you need to repeat the WHERE clause in the subquery (for the method you are using):
SELECT prh.*
FROM product_rate_history prh INNER JOIN
(SELECT max(created_on) as maxco
FROM product_rate_history
WHERE prh.product_id = 16
GROUP BY Date(created_on)
) prh2
ON prh.created_on = prh2.maxco
WHERE prh.product_id = 16
ORDER BY prh.created_on DESC;