I have been intrigued by a problem on SQLZoo. It is a "greatest-n-per-group" problem. I would like to understand how the engine is operating.
A table called bbc contains the name, region of the world and population of each country:
bbc( name, region, population)
The given task is to select the most populous country of each region, showing its name, the region and population.
The solution provided is:
SELECT region, name, population FROM bbc x
WHERE population >= ALL
(SELECT population FROM bbc y
WHERE y.region=x.region
AND population>0)
1. Main Question. I am finding this a bit of a mind twister. I would like to understand how the engine processes this, because at first blush it seems there is some kind of co-dependence (x depending on y, and y depending on x). Does the engine follow some kind of recursion to produce the final selection? Or am I missing something, such that either x or y is actually fixed?
2. Secondary Question. Oddly, when I pull the "AND population>0" out of the parenthesis and leave it on its own at the bottom, one of the regions (Europe / Russia) goes missing from the 8 results. Why? I don't understand that.
And indeed, when I try the query on the world database (available from the mySQL website on the same page as Sakila), the behavior is different:
With population > 0 out of the parentheses, I get 6 regions. Six is the right number in this database, because "SELECT continent FROM country GROUP BY continent" reveals seven continents, of which one is Antarctica, which includes 5 "countries", all with a 0 population.
So that seems right.
SELECT continent, `name`, population FROM country X
WHERE population >= ALL
(SELECT population FROM country Y
WHERE Y.`Continent` = X.`Continent`)
AND population>0
On the other hand, when I pull "population > 0" back into the parentheses as on SQLZoo, I also get 5 countries with a zero (the countries "belonging to Antarctica"). It doesn't matter if I specify x.population or y.population, I get zeroes.
continent name population
------------- -------------------------------------------- ------------
Antarctica Antarctica 0
Antarctica French Southern territories 0
Oceania Australia 18886000
South America Brazil 170115000
Antarctica Bouvet Island 0
Asia China 1277558000
Antarctica Heard Island and McDonald Islands 0
Africa Nigeria 111506000
Europe Russian Federation 146934000
Antarctica South Georgia and the South Sandwich Islands 0
North America United States 278357000
Very much looking for insights on these questions!
Wishing you all a beautiful week.
:)
Notes:
For reference, the problem is number 3a on this page:
http://old.sqlzoo.net/1a.htm?answer=1
A thread mentioning the "greatest-n-per-group" problem for the same query:
MySQL world database Trying to avoid subquery
The world database is available here: http://dev.mysql.com/doc/index-other.html
Main Question. I am finding this a bit of a mind twister. I would like to understand how the engine processes this, because at first
blush it seems there is some kind of co-dependence (x depending on y,
and y depending on x). Does the engine follow some kind of recursion
to produce the final selection? Or am I missing something, such that
either x or y is actually fixed?
This isn't recursion. See this from the MySQL docs. Their solution to the problem is equivalent to this
SELECT region, name, population FROM bbc x
WHERE population =
(SELECT max(population) FROM bbc y
WHERE y.region=x.region
)
Secondary Question. Oddly, when I pull the "AND population>0" out of the parenthesis and leave it on its own at the bottom, one of the
regions (Europe / Russia) goes missing from the 8 results. Why? I
don't understand that.
Slight changes (as suggested by ypercube above) work
SELECT region, name, population FROM bbc x
WHERE population >= ALL
(SELECT population FROM bbc y
WHERE y.region=x.region
AND population IS NOT NULL)
This query
SELECT region, name, population FROM bbc x
WHERE population is null
Returns a row. Not sure why population should be nullable, but didn't take a good look at the rest of it. Otherwise, the query should work fine without the >0
Also, this is different from the greatest-n-per-group. In that problem you seek to find the top N items instead of just the top one.
Related
Is there a way to find all the orders shipped to London using an SQL query? Simply searching for London in the columns doesn't work as some customers have put the district name rather than "London".
So I thought the best way to go was via the postcode. Would this be the best way to go about finding the rows? And continue with using OR statements for each postcode?
select * from tt_order_data
where ship_postcode like "e1%"
According to wiki, this is the postcode range:
The E, EC, N, NW, SE, SW, W and WC postcode areas (the eight London
postal districts) comprised the inner area of the London postal region
and correspond to the London post town.
The BR, CR, DA, EN, HA, IG, SL, TN, KT, RM, SM, TW, UB, WD and CM (the
14 outer London postcode areas) comprised the outer area of the London
postal region.[20]
The inner and outer areas together comprised the London postal
region.[13]
One way to do this would be to leverage REGEXP and define a pattern that matches only ship_postcodes that begin with one of the aforementioned London postcode character sequences:
SELECT *
FROM tt_order_data
WHERE UPPER(TRIM(ship_postcode)) REGEXP '^(E|EC|N|NW|SE|SW|W|WC|BR|CR|DA|EN|HA|IG|SL|TN|KT|RM|SM|TW|UB|WD|CM)'
DB Fiddle | Regex101
It's important to keep in mind that you will still need to perform some amount of data cleansing if the inputs weren't properly controlled, as invalid postcodes would match this filter (e.g., E1 7AA is valid, but this filter would also consider a string like ERGO valid as well).
As an aside, I'm not exactly sure how this will perform with your specific dataset at scale, but if this is for a one-off exercise then it should fit your needs just fine.
I just started an sql exercise-style tutorial BUT I still haven't grasped the concept of correlated queries.
name, area and continent are fields on a table.
The query is to Find the largest country (by area) in each continent, show the continent, the name and the area.
The draft work so far:
SELECT continent, name, population FROM world x
WHERE area >= ALL
(SELECT area FROM world y
WHERE y.continent=x.continent
AND population>0)
Tried reading up on it on a few other blogs.
need to understand the logic behind correlated queries.
I assume the query you posted work. You just need clarification of what it does.
SELECT continent, name, population
FROM world x
WHERE area >= ALL (
SELECT area FROM world y
WHERE y.continent=x.continent
AND population>0
)
The query translates to
"Get the continent, name, and population of a country where area is bigger than or equal to all other countries in the same continent".
The WHERE clause in the inner query is to link the 2 queries (in this case countries in the same continent). Without the WHERE, it will get the country with the largest are in the world.
You can think of a correlated subquery as a looping mechanism. This is not necessarily how it is implemented, but it describes what it does.
Consider data such as:
row continent area population
1 a 100 19
2 a 200 10
3 a 300 20
4 b 15 2000
The outer query loops through each row. Then it looks at all matching rows. So, it takes record 1:
row continent area population
1 a 100 19
It then runs the subquery:
(SELECT w2.area
FROM world w2
WHERE w2.continent = w.continent AND
w2.population > 0
)
And substitutes in the values from the outer table:
(SELECT w2.area
FROM world w2
WHERE w2.continent = 'a' AND
w2.population > 0
)
This returns the set (100, 200, 300).
Then it applies the condition:
where w1.area >= all (100, 200, 300)
(This isn't really valid SQL but it conveys the idea.)
Well, we know that w1.area = 100, so this condition is false.
The process is then repeated for each of the rows. For the "a" continent, the only row that meets the condition is the third one -- the one with the largest area.
I have a requirement to remove "duplicate" entries from a dataset, which is being displayed on the front-end of our application.
A duplicate is defined by the client as a speed test result which is in the same exchange.
Here is my current query,
SELECT id, isp, exchange_name, exchange_postcode_area, download_kbps, upload_kbps
FROM speedtest_results
WHERE postcode IS NOT NULL
AND exchange_name IS NOT NULL
ORDER BY download_kbps DESC, upload_kbps ASC
This query would return some data like this,
12062 The University of Bristol Bristol North BS6 821235 212132
12982 HighSpeed Office Limited Totton SO40 672835 298702
18418 University of Birmingham Victoria B9 553187 336889
14050 Sohonet Limited Lee Green SE13 537686 104439
19981 The JNT Association Holborn WC1V 335833 74459
19983 The JNT Association Holborn WC1V 333661 84397
5652 University of Southampton Woolston SO19 330320 64200
As you can see, there are two tests in the WC1V postcode area, which I'd like to aggregate into a single result, ideally using max rather than avg.
How can I modify my query to ensure that I am selecting the fastest speed test result for the exchange whilst still being able to return a list of all the max speeds?
Seems that I was far too hasty to create a question! I have since solved my own issue.
SELECT id, isp, exchange_name, exchange_postcode_area, MAX(download_kbps) as download_kbps, upload_kbps
FROM speedtest_results
WHERE exchange_name IS NOT NULL
AND postcode IS NOT NULL
GROUP BY exchange_name
ORDER BY MAX(download_kbps) DESC
LIMIT 20
I am trying to craft code that "Shows the countries that are big by area or big by population but not both". It should show the name, population and area.
The table the code references...
This is my code so far...
SELECT name, population, area FROM world
WHERE area > 3000000 OR population > 250000000 OR name != LIKE '%United States%'
world contains name, area, and population.
Anyone have any advice?
You can use XOR for this. It's true if only one of its parameters is true.
SELECT name, population, area
FROM world
WHERE (area > 3000000 XOR population > 250000000)
AND name NOT LIKE '%United States%'
I also changed the way the United States test is combined. I assume you're trying to exclude United States from the results, so it needs to be AND.
Use the XOR (exclusive OR) operator:
SELECT name, population, area FROM world
WHERE area > 3000000 XOR population > 250000000
i dont understand the problem with returning multiple rows:
here is my table BBC:
name region area population gdp
Afghanistan South Asia 652225 26000000
Albania Europe 28728 3200000 6656000000
Algeria Middle East 2400000 32900000 75012000000
Andorra Europe 468 64000
Angola Africa 1250000 14500000 14935000000
etc.............................
question:
List the name and region of countries
in the regions containing 'India',
'Iran'.
this is my statement:
select name from bbc where region = (select region from bbc where name='India' or name='Iran')
it returns:
sql: errorSubquery returns more than 1 row
whats wrong with my statement? the answer should be in the form of a select statement within a select statement
thank you!
This is because you are trying to compare region to a table of values. Instead, try using in:
select name
from bbc
where region in
(select region from bbc where name='India' or name='Iran')
You might have slightly different syntax and it'll work:
SELECT name
FROM bbc
WHERE region IN
(
SELECT region FROM bbc WHERE name='India' OR name='Iran'
)
The only difference being that instead of equals (=), we use IN.
The reason your previous one failed is because to use equals, you compare one value with one other value. What you were accidentally doing is comparing one value with multiple values (the "SubQuery returns more than one row"). The change here is saying where region is within the results returned from the sub query.
select name,region from bbc where region IN (select region from bbc where name IN('India','Iran'))