For one of the questions in my computing coursework, I was asked to explain the following SQL script in detail:
SELECT exam_board, COUNT(*)
FROM subjects
GROUP BY exam_board;
Below is what I have written in response to that question. I was just wondering if I forgot to include something, or if I incorrectly stated something.Any feedback at all would be greatly appreciated!
The script begins with a SELECT statement. A SELECT statement retrieves records from one or more tables or databases (, the data that is returned is then stored inside a result table, which is called a result-set). ‘COUNT ()’ is a function which returns (all (, as there is an asterisk)) the number of rows which match a specified criteria and it gives a total number of records fetched in a query. Therefore ‘SELECT exam_board, COUNT() FROM subjects’ means that the script will return all exam boards from the ‘exam_board’ column in the ‘subjects’ table with their count (of how many subjects are of that exam board). Finally the last line is ‘GROUP BY exam_board;’ the ‘GROUP BY’ clause is often used in SELECT statements to collect data from a number of records. Its purpose is to group the results in one or more columns. In this case it was grouped by ‘exam_board’, meaning that the result of the query will be grouped into a column of the exam boards.
You forgot the effect of GROUP BY is to reduce the result set to one row per distinct value in the grouping column (exam_board in this query).
So there might be 10,000 rows in the subjects table, but only four distinct values for exam_board. Using GROUP BY means you will only have four rows in the result set, exactly one row for each exam_board.
Then the COUNT(*) will be the count of rows that were "collapsed" for each respective group.
I request that you do not copy & paste my answer, but write your own answer in your own words. My writing style is pretty different from yours, so if you copy & paste, it'll be obvious to your teacher that you lifted this.
Actually this not the best answer.
SELECT can return not only data from the tables, but any result of any function, for example SELECT VERSION() returns a version of server software.
An asterisk as a parameter for COUNT(*) does not matter at all. You can put here any column or function, even COUNT(VERSION()), the result will be the same.
‘SELECT exam_board, COUNT() FROM subjects’ will return a single row with two columns: the total number of rows in table 'subjects' and the value of 'exam_board' column in the first row of the table.
Content of the table:
mysql> select exam_board from subjects;
+------------+
| exam_board |
+------------+
| 2 |
| 2 |
| 3 |
| 3 |
| 3 |
+------------+
5 rows in set (0.00 sec)
Mixing together column values and a function returning a single value like SUM(), MIN(), MAX() etc without grouping functions:
mysql> select exam_board, count(*) from subjects;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 5 |
+------------+----------+
1 row in set (0.00 sec)
And only with grouping operator we will get the desired result: the count of records for each value of exam_board field.
mysql> select exam_board, count(*) from subjects group by exam_board;
+------------+----------+
| exam_board | count(*) |
+------------+----------+
| 2 | 2 |
| 3 | 3 |
+------------+----------+
2 rows in set (0.00 sec)
Related
I'm trying to write a query that returns a fixed number of results in a group concat. I don't think it's possible with a group concat, but I'm having trouble figuring out what sort of subquery to add.
Here's what I would like to do:
Query
select id,
group_concat(concat(user,'-',time) order by time limit 5)
from table
where id in(1,2,3,4)
group by 1
When I remove the "limit 5" from the group concat, the query works but spits out way too much information.
I'm open to structuring the query differently. Specific ID numbers will be supplied by the user of the query, and for each ID specified, I would like to list a fixed number of results. Let me know if there is a better way to achieve this.
Not sure the exact result set you want, but check out this SO post:
How to hack MySQL GROUP_CONCAT to fetch a limited number of rows?
As another example, I tried out the query/solution provided in the link and came up with this:
SELECT user_id, SUBSTRING_INDEX(GROUP_CONCAT(DISTINCT date_of_entry),',',5) AS logged_dates FROM log GROUP BY user_id;
Which returns:
user_id | logged_dates
1 | "2014-09-29,2014-10-18,2014-10-05,2014-10-12,2014-10-19"
2 | "2014-09-12,2014-09-03,2014-09-23,2014-09-22,2014-10-13"
3 | "2014-09-10"
6 | "2014-09-29,2014-09-27,2014-09-26,2014-09-25"
8 | "2014-09-26,2014-09-30,2014-09-27"
9 | "2014-09-28"
13 | "2014-09-29"
22 | "2014-10-12"
The above query will return every user id that has logged something, and up to 5 dates that the user has logged. If you want more or less results form the group concat, just change the number 5 in my query.
Following up, and merging my query with yours, I get:
SELECT user_id, SUBSTRING_INDEX(GROUP_CONCAT(date_of_entry ORDER BY date_of_entry ASC),',',3) AS logged_dates FROM log WHERE user_id IN(1,2,3,4) GROUP BY user_id
Which would return (notice that I changed the number of results returned from the group_concat):
user_id | logged_dates
1 | "2014-09-16,2014-09-17,2014-09-18"
2 | "2014-09-02,2014-09-03,2014-09-04"
3 | "2014-09-10"
is it possible to always respect an expected number of element constraint by filling the remaining of a SQL dataset with previous written data, keeping the data insertion in order? Using MySQL?
Edit
In a web store, I always want to show n elements. I update the show elements every w seconds and I want to loop indefinitely.
By example, using table myTable:
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
Something like
SELECT id FROM myTable WHERE id > 3 ORDER BY id ALWAYS_RETURN_THIS_NUMBER_OF_ELEMENTS 5
would actually return (where ALWAYS_RETURN_THIS_NUMBER_OF_ELEMENTS doesn't exist)
+----+
| id |
+----+
| 4 |
| 5 |
| 4 |
| 5 |
| 4 |
+----+
This is a very strange need. Here is a method:
select id
from (SELECT id
FROM myTable
WHERE id > 3
ORDER BY id
LIMIT 5
) t cross join
(select 1 as n union all select 2 union all select 3 union all select 4 union all select 5
) n
order by n.n, id
limit 5;
You may need to extend the list of numbers in n to be sure you have enough rows for the final limit.
No, that's not what LIMIT does. The LIMIT clause is applied as the last step in the statement execution, after aggregation, after the HAVING clause, and after ordering.
I can't fathom a use case that would require the type of functionality you describe.
FOLLOWUP
The query that Gordon Linoff provided will return the specified result, as long as there is at least one row in myTable that satisfies the predicate. Otherwise, it will return zero rows.
Here's the EXPLAIN output for Gordon's query:
id select_type table type key rows Extra
-- ------------ ---------------- ----- ------- ---- -------------------------------
1 PRIMARY <derived2> ALL 5 Using temporary; Using filesort
1 PRIMARY <derived3> ALL 5 Using join buffer
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
UNION RESULT <union3,4,5,6,7> ALL
2 DERIVED myTable range PRIMARY 10 Using where; Using index
Here's the EXPLAIN output for the original query:
id select_type table type key rows Extra
-- ----------- ----------------- ----- ------- ---- -------------------------------
1 SIMPLE myTable range PRIMARY 10 Using where; Using index
It just seems like it would be a whole lot more efficient to reprocess the resultset from the original query, if that resultset contains fewer than five (and more than zero) rows. (When that number of rows goes from 5 to 1,000 or 150,000, it would be even stranger.)
The code to get multiple copies of rows from a resultset is quite simple: fetch the rows, and if the end of the result set is reached before you've fetched five (or N) rows, then just reset the row pointer back to the first row, so the next fetch will return the first row again. In PHP using mysqli, for example, you could use:
$result->data_seek(0);
Or, for those still using the deprecated mysql_ interface:
mysql_data_seek($result,0);
But if you're returning only five rows, it's likely you aren't even looping through the result at all, and you already stuffed all the rows into an array. Just loop back through the beginning of the array.
For MySQL interfaces that don't support a scrollable cursor, we'd just store the whole resultset and process it multiple times. With Perl DBI, using the fetchall_arrayref, with JDBC (which is going to store the whole result set in memory anyway without special settings on the connection), we'd store the resultset as an object.
Bottom line, squeezing this requirement (to produce a resultset of exactly five rows) back to the database server, and pulling back duplicate copies of a row and/or storing duplicate copies of a row in memory just seems like the wrong way to satisfy the use case. (If there's rationale for storing duplicate copies of a row in memory, then that can be achieved without pulling duplicate copies of rows back from the database.)
It's just very odd that you say you're using/implementing a "circular buffer", but that you choose not to "circle" back around to the beginning of a resultset which contains fewer than five rows, and instead need to have MySQL return you duplicate rows. Just very, very strange.
If I use following query:
SELECT DISTINCT comment FROM table;
And I have for example following data: (IDs are just there to SHOW the order...)
ID | comment
-------------
1 | comment1
2 | comment1
3 | comment2
4 | comment1
What I could get back are following three results:
Result 1:
1 | comment1
3 | comment2
Result 2:
3 | comment2
4 | comment1
Result 3:
order is unpredicatable
Question 1:
Is the result independant from the platform? Can I make sure, that I always get a predictable result?
Question 2:
I want to distinct select all comments and get the NEWEST only, meaning I want to always get result 2. Is it possible to achive that? Maybe ordering by the key would affect the result?
Your query doesn't request the ID column, only the comment column:
SELECT DISTINCT comment FROM table;
In the result, the ID is not included, so the row each value comes from is irrelevant.
comment1
comment2
As for how it will sort them, I think it depends on index order. I'll do a test to confirm:
mysql> create table t (id int primary key, comment varchar(100));
mysql> insert into t values
-> (1, 'comment2'),
-> (2, 'comment1'),
-> (3, 'comment2'),
-> (4, 'comment1');
The default order is that of the primary key:
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment2 |
| comment1 |
+----------+
Whereas if we have an index on the requested column, it returns the values in index order:
mysql> create index i on t(comment);
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment1 |
| comment2 |
+----------+
I'm assuming the InnoDB storage engine, because everyone should be using InnoDB. ;-)
Your last question indicates that you really want a query that doesn't involve DISTINCT at all, but it's a greatest-n-per-group question. This type of question is very common, and it has been asked and answered hundreds of times on StackOverflow. Follow the link and read the many solutions.
You can experiment and see which of the unique rows is returned, and you can experiment and see which order they're returned in, but that will only show you how things turn out with your experimental table, today, under the current database engine version. Bottom line:
If you SELECT DISTINCT comment the id is immaterial because it's not in your SELECT
If you don't ORDER BY the database will determine the order.
If you want the most recent distinct comment with its ID, this will work every time (full disclosure: this replaces an earlier answer that works but was over-thinking the problem):
SELECT comment, MAX(id)
FROM myTable
GROUP BY comment
ORDER BY 2 DESC;
Note that the ORDER BY 2 DESC assumes that the higher the ID, the more recent the comment.
If you select a single distinct column, the other will not be returned.
select distinct column from table
is the same result as
select column from table group by column
In both these cases, the sort order of column is unpredictable, depending on the execution plan which may vary with larger amounts of data, diferent table structures, diferent database versions
to mimic your result, one would have to do :
select id, column from table group by column
which is an illegal grouping. If your SQL mode permits it to run, ID will be random.
if you mean select distinct * from table, then all distinct rows will be returned, in your case all the table.
I want to get the maximum value of a column for the first 1000 not null results for some condition. Then, when for the next 1000, and so on. I do this for different conditions, but here I found something strange, when I use dayofweek. The first command I show you works:
mysql> select max(id),max(d20) from (select id, d20 from data where d20 is not null and id<1000000 and dayofweek(day)=1 limit 1000) x;
+---------+----------+
| max(id) | max(d20) |
+---------+----------+
| 100281 | 13785 |
+---------+----------+
1 row in set (0.44 sec)
but actually I want this second command, which doesn't work as expected.
mysql> select max(id),max(d20) from (select id, d20 from data where d20 is not null and id>100000 and dayofweek(day)=1 limit 1000) x;
+---------+----------+
| max(id) | max(d20) |
+---------+----------+
| 303765 | 0 |
+---------+----------+
1 row in set (0.02 sec)
Any clue?
Take the extreme case of the limit being 1.
That means, the subquery returns any row (there's no order by to make the row deterministic) that has id<1000000, which makes MAX(id) and MAX(d20) return the values from that row only. Hardly representative of the total collection.
Raising the limit to 1000 will just make the sample bigger, but will still give an indeterministic result depending on which 1000 rows are sampled (assuming there are more than 1000 rows that match). You may very well get a different result every time you execute the query, so expecting a particular result won't work.
If you need a deterministic result, add an ORDER BY to your subquery before limiting the results.
I have a table with multiple rows which have a same data. I used SELECT DISTINCT to get a unique row and it works fine. But when i use ORDER BY with SELECT DISTINCT it gives me unsorted data.
Can anyone tell me how distinct works?
Based on what criteria it selects the row?
From your comment earlier, the query you are trying to run is
Select distinct id from table where id2 =12312 order by time desc.
As I expected, here is your problem. Your select column and order by column are different. Your output rows are ordered by time, but that order doesn't necessarily need to preserved in the id column. Here is an example.
id | id2 | time
-------------------
1 | 12312 | 34
2 | 12312 | 12
3 | 12312 | 48
If you run
SELECT * FROM table WHERE id2=12312 ORDER BY time DESC
you will get the following result
id | id2 | time
-------------------
2 | 12312 | 12
1 | 12312 | 34
3 | 12312 | 48
Now if you select only the id column from this, you will get
id
--
2
1
3
This is why your results are not sorted.
When you specify SELECT DISTINCT it will give you all the rows, eliminating duplicates from the result set. By "duplicates" I mean rows where all fields have the same values. For example, say you have a table that looks like:
id | num
--------------
1 | 1
2 | 3
3 | 3
SELECT DISTINCT * would return all rows above, whereas SELECT DISTINCT num would return two rows:
num
-----
1
3
Note that which row actual row (eg: whether it's row 2 or row 3) it selects is irrelevant, as the result would be indistinguishable.
Finally, DISTINCT should not affect how ORDER BY works.
Reference: MySQL SELECT statement
The behaviour you describe happens when you ORDER BY an expression that is not present in the SELECT clause. The SQL standard does not allow such a query but MySQL is less strict and allows it.
Let's try an example:
SELECT DISTINCT colum1, column2
FROM table1
WHERE ...
ORDER BY column3
Let's say the content of the table table1 is:
id | column1 | column2 | column3
----+---------+---------+---------
1 | A | B | 1
2 | A | B | 5
3 | X | Y | 3
Without the ORDER BY clause, the above query returns following two records (without ORDER BY the order is not guaranteed):
column1 | column2
---------+---------
A | B
X | Y
But with ORDER BY column3 the order is also not guaranteed.
The DISTINCT clause operates on the values of the expressions present in the SELECT clause. If row #1 is processed first then (A, B) is placed in the result set and it is associated with row #1. Then, when row #2 is processed, the values of the SELECT expressions produce the record (A, B) that is already in the result set. Because of DISTINCT it is dropped. Row #3 produces (X, Y) that is also put in the result set. Then, the ORDER BY column3 clause makes the records be sorted in the result set as (A, B), (X, Y).
But if row #2 is processed before row #1 then, following the same logic exposed in the previous paragraph, the records in the result set are sorted as (X, Y), (A, B).
There is no rule imposed on the database engine about the order it processes the rows when it runs a query. The database is free to process the rows in any order it consider it's better for performance.
Your query is invalid SQL and the fact that it can return different results using the same input data proves it.