Max occurences of a given value in a table - mysql

I have a table (pretty big one) with lots of columns, two of them being "post" and "user".
For a given "post", I want to know which "user" posted the most.
I was first thinking about getting all the entries WHERE (post='wanted_post') and then throw a PHP hack to find which "user" value I get the most, but given the large size of my table, and my poor knowledge of MySQL subtle calls, I am looking for a pure-MySQL way to get this value (the "user" id that posted the most on a given "post", basically).
Is it possible ? Or should I fall back on the hybrid SQL-PHP solution ?
Thanks,
Cystack

It sounds like this is what you want... am I missing something?
SELECT user
FROM myTable
WHERE post='wanted_post'
GROUP BY user
ORDER BY COUNT(*) DESC
LIMIT 1;
EDIT: Explanation of what this query does:
Hopefully the first three lines make sense to anyone familiar with SQL. It's the last three lines that do the fun stuff.
GROUP BY user -- This collapses rows with identical values in the user column. If this was the last line in the query, we might expect output something like this:
+-------+
| user |
+-------+
| bob |
| alice |
| joe |
ORDER BY COUNT(*) DESC -- COUNT(*) is an aggregate function, that works along with the previous GROUP BY clause. It tallies all of the rows that are "collapsed" by the GROUP BY for each user. It might be easier to understand what it's doing with a slightly modified statement, and it's potential output:
SELECT user,COUNT(*)
FROM myTable
WHERE post='wanted_post'
GROUP BY user;
+-------+-------+
| user | count |
+-------+-------+
| bob | 3 |
| alice | 1 |
| joe | 8 |
This is showing the number of posts per user.
However, it's not strictly necessary to actually output the value of an aggregate function in this case--we can just use it for the ordering, and never actually output the data. (Of course if you want to know how many posts your top-poster posted, maybe you do want to include it in your output, as well.)
The DESC keyword tells the database to sort in descending order, rather than the default of ascending order.
Naturally, the sorted output would look something like this (assuming we leave the COUNT(*) in the SELECT list):
+-------+-------+
| user | count |
+-------+-------+
| joe | 8 |
| bob | 3 |
| alice | 1 |
LIMIT 1 -- This is probably the easiest to understand, as it just limits how many rows are returned. Since we're sorting the list from most-posts to fewest-posts, and we only want the top poster, we just need the first result. If you wanted the top 3 posters, you might instead use LIMIT 3.

Related

How do you batch SELECT statements when you can't rely on the IDs to be in literal order?

What I mean by literal order is that, altough the IDs are auto-increment, through business logic, it might end up that 8 comes after 4 when 5 should've been there. That is to say, if a deletion if ID happens, there's no re-indexing
This is how my rows look (table name is wp_posts):
+-----+-------------+----+--+--+--+
| ID | post_author | .. | | | |
+-----+-------------+----+--+--+--+
| 4 | .. | | | | |
+-----+-------------+----+--+--+--+
| 8 | .. | | | | |
+-----+-------------+----+--+--+--+
| 124 | .. | | | | |
+-----+-------------+----+--+--+--+
| 672 | .. | | | | |
+-----+-------------+----+--+--+--+
| 673 | .. | | | | |
+-----+-------------+----+--+--+--+
| 674 | .. | | | | |
+-----+-------------+----+--+--+--+
ID is an int that has the auto-increment characteristic, but when a post is deleted, there is no re-assignment of IDs. It will just simply get deleted and because it's auto-increment, you can still assume that, vertically, the items that come after the one you're looking at are always bigger than the ones before.
I'm querying for ID: SELECT ID FROM wp_posts to get a list of all the IDs I need. Now, it just so happens that I need to batch all of this, using AJAX requests because once I retrieve the IDs, I need to operate on them.
Thing is, I don't really understand how to pass my data back to AJAX. What LIMIT does is, if I provide 2 arguments, such as: SELECT ID FROM wp_posts LIMIT 1,3, it'll return back 4,8,124 because it looks at row number. But what do I do on the next call? Yes, the first call always starts with 1, but once I need to launch the second AJAX request to perform yet another SELECT, how do I know where I should start? In my case, I'd want to start again at 4, so, my second query would be SELECT ID FROM wp_posts LIMIT 4, 7 and so on.
Do I really need to send that counter (even if I can automate it, since, you see, it's an increment of 3) back?
Is there no way for SQL to handle this automatically?
You have many confusions in your question. Let me try to clear up some basic ones.
First, the auto-incremented key is the primary key for the table. You do not need to worry about gaps. In fact, the key should basically be meaningless. It fulfills the following:
It is guaranteed to be unique.
It is guaranteed to be in insertion order.
Gaps are allowed and of no concern. There is no re-indexing. It is a bad idea because:
Primary keys uniquely identify each row and this mapping should be consistent across time.
Primary keys are used in other tables to refer to values, so re-indexing would either invalidate those relationships or require massive changes to many tables.
Re-indexes pre-supposes that the value means something, when it doesn't.
Second, a query such as:
SELECT ID
FROM wp_posts
LIMIT 1, 3;
Can return any three rows. Why? Because you have no specified an ORDER BY and SQL result sets without ORDER BY are unordered. There are no guarantees. So you should always be in the habit of using an ORDER BY.
Third, if you want to essentially "page" through results, then use the OFFSET feature in LIMIT (as you have above):
SELECT ID
FROM wp_posts
ORDER BY ID
LIMIT #offset, 3;
This will allow you to reset the #offset value and go to which rows you want.
First query:
SELECT ID FROM wp_posts ORDER BY ID LIMIT 3
This returns 4,8,124 as you said. In your client, save the largest ID value in a variable.
Subsequent queries:
SELECT ID FROM wp_posts WHERE ID > ? ORDER BY ID LIMIT 3
Send a parameter into this query using the greated ID value from the previous result. It's still in a variable.
This also helps make the query faster, because it doesn't have to skip all those initial rows every time. Paging through a large dataset using LIMIT/OFFSET is pretty inefficient. SQL has to actually read all those rows even though it's not going to return them.
But if you use WHERE ID > ? then SQL can efficiently start the scan in the right place, on the first row that would be included in the result.
Seems, you want to return the first three rows of your query ordered by currently existing ID values(whatever they're after all DML statement's applied on the table wp_posts).
Then, Consider using an auxiliary iteration variable #i to provide an ordered integer value set starting from 1 and increasing as 2,3,... without any gaps :
select t.*
from
(
select #i := #i + 1 as rownum, t1.*
from tab t1
join (select #i:=0) t2
) t
order by rownum
limit 0,3;
Demo

same query with order clause, same dataset, but different result

I was testing my application(an ERP system) by programming a tester that will do a fixed scenario of 30 steps of things like this
//pseudo code(PHP)
public function runScenario1Test(){
V2_1Tester::resetDatabase();
V2_1Tester::insert60Companies();
V2_1Tester::insert2000Items();
V2_1Tester::insert100Purchases();
V2_1Tester::insert100Sales();
//do some other stuff
V2_1Tester:checkResults();
}
Although every time I run the test of the same code, same data, all inputs were the same, I was getting different results sometimes!!!
My head was going to blow up, and after 4 days of investigations, tears, and even bad database dreams by night, it turns out that the bug was in a query that returns different results sometimes. It is something like this
+-----+--------------+-----------+------+--------+
| ID | date | direction | col3 | col4 |
+-----+--------------+-----------+------+--------+
| 1 | 2018-03-03 | in | 6 | 100.50 |
| 2 | 2018-03-03 | in | 6 | 350.75 |
+-----+--------------+-----------+------+--------+
-- more ~ 3000 rows
$query = "SELECT * FROM table ORDER BY date, direction, col3";
this query sometimes returns 1 then 2 and some other times 2 then 1.
I fixed the query by adding additional level for ordering ID
$query = "SELECT * FROM table ORDER BY date, direction, col3, ID";
But I don't understand why MySQL behaves like this ?, In other words what are the rules that MySQL will follow for the rows that are the same for all order-by columns ? and why it is changing ?
In SQL, order by is not stable. That means that when the keys are the same, the ordering can be in any order.
This is actually obvious. SQL tables represent unordered sets. There is no ordering unless a column specifies the ordering.
So, you have done the right thing by including the id as the final key. This ensures that the order will be well-defined because there are no duplicate keys.

How to properly GROUP BY in MySQL?

I have the following (intentionally denormalized for demonstrating purposes) sample CARS table:
| CAR_ID | OWNER_ID | OWNER_NAME | COLOR |
|--------|----------|------------|-------|
| 1 | 1 | John | White |
| 2 | 1 | John | Black |
| 3 | 2 | Mike | White |
| 4 | 2 | Mike | Black |
| 5 | 2 | Mike | Brown |
| 6 | 3 | Tony | White |
If I wanted to count the amount of cars per owner and return this:
| OWNER_ID | OWNER_NAME | TOTAL |
|----------|------------|-------|
| 1 | John | 2 |
| 2 | Mike | 3 |
| 3 | Tony | 1 |
I know I can write the following query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id, owner_name
However, removing owner_name from the GROUP BY clause gives me the same results.
What is the difference between those 2 queries?
Under what circumstances should I group by all non-agreggated fields in the SELECT statement and in which ones shouldn't I?
Can you give an example in which this grouping would return different results when removing a non-aggregated field and explain why?
The first thing to make clear is that SQL is not MySQL.
In standard SQL it is not allowed to group by a subset of the non-aggregated fields. The reason is very simple. Suppose I'm running this query:
SELECT color, owner_name, COUNT(*) FROM cars
GROUP BY color
That query would not make any sense. Even trying to explain it would be impossible. For sure it is selecting colors and counting the amount of cars per color. However, it is also adding the owner_name field and there can be many owners for a given color, as it is the case of the White color. So if there can be many owner_name values for a single color which happens to be the only field in the GROUP BY clause... then which owner_name will be returned?
If it is needed to return an owner_name then some kind of criteria should be added to only select one of them, e.g., the first one alphabetically, which in this case would be John. That criteria would result in adding an aggregate function MIN(owner_name) and then the query will make sense again as it will be grouping by, at least, all the non-agreggated fields in the select statement.
As you can see, there is a clear and practical reason for standard SQL to be inflexible in the grouping. If it wasn't, you could face awkward situations in which the value for a column will be unpredictable, and that is not a nice word, particularly if the query being run is showing you your bank account transactions.
Having said that, then why would MySQL allow queries that might not make sense? And even worse, the error in the query above could be just syntactically detected! The short answer is: performance. The long answer is that there are certain situations in which, based on data relations, getting an unpredictable value from the group will result in a predictable value.
If you haven't figured it out yet, the only way in which you can predict the value you'll get from taking an unpredictable element from a group will be if all the elements in the group are the same. A clear example of this situation is in the sample query in your very same question. Look at how owner_id and owner_name relates in the table. It is clear that given any owner_id, e.g. 2, you can only have one distinct owner_name. Even having many rows, by choosing any, you will get Mike as the result. In formal database jargon this can be explained as owner_id functionally determines owner_name.
Let's take a closer look at that fully working MySQL query:
SELECT owner_id, owner_name, COUNT(*) total FROM cars
GROUP BY owner_id
Given any owner_id this would return the same owner_name, so adding it to the GROUP BY clause will not result in more rows returned. Even adding an aggregated function MAX(owner_name) will not result in less rows returned. The resulting data will be exacly the same. In both cases, the query would be immediately turned into a legal standard SQL query as at least all the non-aggregated fields would be grouped by. So there are 3 approaches to get the same results.
However, as I mentioned before, this non-standard grouping has a performance advantage. You can check this so underrated link in which this is explained for more detail but I'm going to cite the most important part:
You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. [...] The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
One thing that is worth mentioning is that the results are not necessarily wrong but rather indeterminate. In other words, getting the expected results does not mean you have written the right query. Writing the right query will always give you the expected results.
As you can see, it might be worth applying this MySQL extension to the GROUP BY clause. Anyway, if this is not 100% clear yet then there is a rule of thumb that will make sure that your grouping will always be correct: Always group, at least, by all the non-aggregated fields in the select clause. You might be wasting a few CPU cycles in certain situations but it is better than returning indeterminate results. If you're still terrified about not grouping correctly then changing the ONLY_FULL_GROUP_BY SQL mode could be a last resort :)
May your grouping be correct and performant... or at least correct.

Select a row with specific content from a GROUP BY group

I have two tables which allow a user to request songs. Of course a song can be requested by multiple users:
| Id | Song_Name | | Requested_Id | By_IP |
+====+===========+ +==============+=========+
| 1 | song1 | | 1 | 1.1.1.1 |
| 2 | song2 | | 1 | 2.2.2.2 |
| 3 | song3 | | 1 | 3.3.3.3 |
| 2 | 2.2.2.2 |
In order to prevent one user from requesting a song multiple times (abuse), I need to check whether a certain song has already been requested by the user which is just trying to request it again. So I'm doing a LEFT JOIN between the first and the second table and a GROUP BY by the row's Id which returns one row for each song.
PROBLEM: GROUP BY returns unpredictable values on fields which are not grouped. That is known. But How can I make sure that SELECT returns the row containing a specific IP, in case this IP exists in this group? If the IP does not exist, any other row of the group can be returned by SELECT.
Thanks a lot!
UPDATE: I need to show the song in a list, independent of how many users (or even none at all) have requested it. So SELECT definitely needs to return one row for every song. But in case that for example the user with IP 3.3.3.3 is trying to request song1, (which was already requested by him) I expect the query to return this:
| Id | Song_Name | By_IP |
+====+===========+=========+
| 1 | song1 | 3.3.3.3 | (3.3.3.3 in case it exists, otherwise anything else)
| 2 | song2 | 2.2.2.2 |
I also need the grouping with the other requests (IPs), because I need to get the whole number of requests per song as well. Therefore I use Count().
WORKAROUND: Since it seems to be pretty complicated to do what I need (if possible at all), I'm now working with a workaround. I'm using the GROUP_CONCAT() aggregate function. This delivers me all IPs of that group separated by ",". So I can search whether the one I'm searching for already exists there. The only drawback of this is, that the (default) maximum lenght of this returned string is 1024. That means that I can't handle a big amount of users, but for now it should be fine.
It is still unclear what do u want? there is no requested date present in table. without date how do u know when a particular song has been requested.
Select Songs.id, Songs.Song_name, requested_songs.By_IP
from Songs
INNER JOIN requested_songs
on Songs.id = requested_songs.Requested_id
Group BY requested_songs.Requested_id
order by requested_songs.Requested_id ASC
;
SQLFiddle Demo:
Are you sure you're not overthinking your solution a bit? If all you want to do is eliminate duplicates, just put a UNIQUE index on your second table on both columns.
If you're trying to do something more complicated with that GROUP BY, please provide a sample resultset, as Quassnoi requested.
Just group by with Song_Name and By_IP. Like this
SELECT * FROM `songs` JOIN users GROUP BY song_name, ip

SQL query, return randomly ordered rows with limits, possible?

I am thinking of returning a randomly ordered SQL response where the results are mixed up randomly, with a limit.
The thing is I need All the rows back, basically divided into groups (chunks of rows). I hope I am clear.
For example, from table A:
ID | NAME | PROFESSION
++++++++++++++++++++++++++++++++
1 | Jack | Carpenter
2 | Rob | Manager
3 | Phil | Driver
4 | Mary | Cook
5 | Tim | Postman
6 | Bob | Programmer
The query would return something like this:
With a limit of 0,2:
6 | Bob | Programmer
4 | Mary | Cook
With a limit of 2,2:
1 | Jack | Carpenter
5 | Tim | Postman
With a limit of 4,2:
3 | Phil | Driver
2 | Rob | Manager
Note: all the table rows were returned. In my page I need to have a << >> buttons that will show the user the needed "group"s of data.
How do I go about writing such a query ?
A better name for your explained problem would be randomly shuffled records. That is true that the order is random but since the order needs to be remembered, you have no choice but to save it in a column. You can do this by saving a randomly populated field and ordering your records based on that. This way you have ordered your records in no specific order while the order is remembered for future select queries. And whenever you got tired of the order, you can update the mentioned field with new randomly generated values to shuffle them again. This is the technique used by players to shuffle a playlist without replaying a song twice.
[EDIT]
While the first given solution stands as the general answer, there's a hack you can use in MySQL to randomly order records. In this way, all you need to store for remembering an order is its seed.
SELECT * FROM tbl ORDER BY RAND(s);
For instance, if you want each user see the records in some different randomly ordered, you can use their user_id as the seed. This way the order each user will ever see the records in, will remain the same while it is random and different from other users.
I can think of two things here:
If the data in the table is huge, add a column that tells the group to which a row belongs. When the user clicks on >> or << buttons, get the rows for that particular group.
If you are dealing with small amount of data, you could do this in the code itself.
If you use ORDER BY RAND() then you will have to flag selected records somewhere which is no advisable.
You can use some intelligent algorithm with combination of total_pages and ID e.g.
SELECT *
FROM my_table
ORDER BY MOD(ID, total_pages);
Add a column to the table called something like random_col
Then each time you need to randomise the table you run
UPDATE table SET random_col = RAND()
And now each time you want to retrieve results you run a normal select
SELECT * FROM table ORDER BY random_col ASC LIMIT x,y
And the results will appear in the same order until you randomise them again by running the 'UPDATE'