SUM before JOIN in MySQL? - mysql

I'm having a probably basic problem with an SQL query (I'm learning).
I'm tracking the vaccination status in the different Spanish regions.
Simplifying, I have two tables: one with the regions (ca) and their population, and the other with the region (ca), the day of each data (time) and the dosis administered.
In order to get the percentages of overall Spanish population vaccinated each day, I need to SUM all the populations of each region, and then divide the SUM of doses administered in all regions and then divide between that SUM.
However, when I do the JOIN, each population is added to every row, so the SUM is very high (it is counted once per time the region appears).
I think I need to SUM all the population before JOIN, but then, what column do I use to JOIN?
It is something like this:
SELECT
time AS "time",
SUM(SUM(v1.dose)) OVER (ORDER BY time)/'SP_population'
FROM vaccines v1
INNER JOIN
(SELECT SUM(population) AS 'SP_population' FROM ca_population) v2 ON ?????
GROUP BY time
ORDER BY time```
What should the ??? be?

If I understand correctly, you want a CROSS JOIN. You no longer care about regions so you want one population value for all rows:
SELECT v.time, p.sp_population, v.daily_dose,
SUM(SUM(v.daily_dose)) OVER (ORDER BY time) / p.sp_population
FROM (SELECT v.time, SUM(v.dose) as daily_dose
FROM vaccines v
GROUP BY v.time
) v CROSS JOIN
(SELECT SUM(population) as sp_population
FROM ca_population
) p
ORDER BY time

Related

How do I join two grouped up SELECT Querys in SQL?

There are just two relations important for this:
geoLake with Name and Country
geoMountain with Name and Country
Both relations having couple hundreds of Entries.
The Task is to just display the names of the countrys which have more lakes than mountains.
SELECT m.Country, Count(m.Country)
FROM geoMountain m
GROUP BY m.Country
Shows a list with all Countrynames and the Number of how many Mountains are related to each country.
SELECT l.Country, Count(l.Country)
FROM geoLake l
GROUP BY l.Country
Having the same Output for how many Lakes are in every Country.
I tried like everthing to bring this two grouped relations together, but not having any success and kinda stucked after like 2 hours, because I am running out of ideas.
How do I bring this together?
My specific Questions:
Is it possible to get a Relation like:
+--------+-------------------+----------------+
|Country |COUNT(m.Country) |COUNT(l.Country)|
+--------+-------------------+----------------+
|Country1|How many Mountains |How many Lakes |
|Country2|How many Mountains |How many Lakes |
[...]
And how do I add a SELECT query on top of this with this
SELECT Country FROM What is build up there WHERE COUNT(m.Country) > COUNT(l.Country)
mechanic.
PS. Hope my question is understandable, english isn't my native language.
WITH
-- count the amount of mountains
cte1 AS (
SELECT m.Country, Count(m.Country) cnt
FROM geoMountain m
GROUP BY m.Country
),
-- count the amount of lakes
cte2 AS (
SELECT l.Country, Count(l.Country) cnt
FROM geoLake l
GROUP BY l.Country
),
-- gather countries which are mentioned in at least one table
cte3 AS (
SELECT Country FROM cte1
UNION
SELECT Country FROM cte2
)
-- get needed data
SELECT Country,
COALESCE(cte1.cnt, 0) AS MountainsAmount,
COALESCE(cte2.cnt, 0) AS LakesAmount
-- join data to complete countries list
FROM cte3
LEFT JOIN cte2 USING (Country)
LEFT JOIN cte1 USING (Country)
-- HAVING MountainsAmount > LakesAmount
;

Query including the sum of one row in all rows

I'm trying to add in a column the sum of one of my columns and then show it for all of my rows. Specifically, I have a table where each row is a country, and there is a column identifying them with a region. A Third column is a measure. What I'm trying to do is sum the measures that are from the same region, and then show them for evert country.
This is my query so far:
SELECT country, region, SUM(share) AS value_sum FROM data_xlsx_Hoja2 GROUP BY region
I know that the group by makes it to appear just by region, but are those values resulting from the sum the ones that I want to place next to each country. Right now what I get is a table including the first country from every region and then the sum.
Any ideas?
I suppose your country information is more specific than region information.
So you can wirte a main query to return two informations, country and region, and a subquery to return the sum about all countries in the same region.
Try this:
SELECT main.country, main.region,
(SELECT SUM(sec.share)
FROM data_xlsx_Hoja2 sec WHERE sec.region = main.region) as total
FROM data_xlsx_Hoja2 main
You appears to want correlation subquery :
SELECT country, region,
(SELECT SUM(d1.share)
FROM data_xlsx_Hoja2 d1
WHERE d1.region = d.region
) AS value_sum
FROM data_xlsx_Hoja2 d;

Compare individual to group

I am relatively new to MySQL. I have a huge database of salespeople and each sale they make, and for how much. I know that I can put AVG(SaleAmount) in a SELECT statement to get the average number of sales of the people at large, but I was wondering how I can get a list of each individuals' average sales, and then compare it to the group average to get a group of salesmen who are above that average.
I eventually want a list of those salespeople whose averages are two standard deviations above the group average.
I'm sorry if this is relatively simple, but I would really appreciate the help.
Join with a subquery that gets the global statistics.
SELECT s.id, s.name, AVG(SaleAmount) AS personAvg, avgSale, stdSale
FROM salespeople AS s
JOIN (SELECT AVG(SaleAmount) AS avgSale, STDDEV(SaleAmount) AS stdSale
FROM salespeople) AS av
GROUP BY s.id
HAVING personAvg > avgSale + 2 * stdSale

MySQL Query Never Returns

EDIT: This has been solved, requiring a subquery into the appearances table. Here is the working solution.
SELECT concat(m.nameFirst, ' ', m.nameLast) as Name,
m.playerID as playerID,
sum(b.HR) as HR
FROM Master AS m
INNER JOIN Batting AS b
ON m.playerID=b.playerID
WHERE ((m.weight/(m.height*m.height))*703) >= 27.99
AND m.playerID in (SELECT playerID FROM appearances GROUP BY playerID HAVING SUM(G_1b+G_dh)/SUM(G_All) >= .667)
GROUP BY playerID, Name
HAVING HR >= 100
ORDER BY HR desc;
I'm working with the Lahman baseball stat database, if anyone's familiar.
I'm trying to retrieve a list of all large, slugging first basemen, and the data I need is spread across three different tables. The way I'm doing this is finding players of a minimum BMI, who have spent at least 2/3 of their time at first/designated hitter, and have a minimum number of home runs.
'Master' houses player names, height, weight (for BMIs).
'Batting' houses HR.
'Appearances' houses games played at first, games played at DH, and total games.
All three databases are connected by the same 'playerID' value.
Here is my current query:
SELECT concat(m.nameFirst, ' ', m.nameLast) as Name,
m.playerID as playerID,
sum(b.HR) as HR
FROM Master AS m
INNER JOIN Batting AS b
ON m.playerID=b.playerID
INNER JOIN Appearances AS a
ON m.playerID=a.playerID
GROUP BY Name, playerID
HAVING ((m.weight/(m.height*m.height))*703) >= 27.99
AND ((SUM(IFNULL(a.G_1b,0)+IFNULL(a.G_dh,0)))/SUM(IFNULL(a.G_All,0))) >= .667
AND HR >= 200
ORDER BY HR desc;
This appears correct to me, but when entered it never returns (runs forever) - for some reason I think it has something to do with the inner join of the appearances table. I also feel like there's a problem with combining m.weight/m.height in a "HAVING" clause, but with aggregates involved I can't use "WHERE." What should I do?
Thanks for any help!
EDIT: After removing all conditionals, I'm still getting the same (endless) result. This is my simpler query:
SELECT concat(m.nameFirst, ' ', m.nameLast) as Name,
m.playerID as playerID,
sum(b.HR) as HR
FROM Master AS m
INNER JOIN Batting AS b
ON m.playerID=b.playerID
INNER JOIN Appearances AS a
ON m.playerID=a.playerID
GROUP BY playerID, Name
ORDER BY HR desc;
My guess is that the problem with your query is that each player has appeared many times (appearances) and at bat many times. Say a player has been at bat 1000 times in 100 games. Then the join -- as you have written it -- will have 100,000 rows just for that player.
This is just a guess because you have provided no sample data to verify if this is the problem.
The solution is to pre-aggregate the appearances and games tables as subqueries (at the playerId level) and then join them back.

MySQL huge tables JOIN makes database collapse

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?
300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.
You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.