same query with order clause, same dataset, but different result - mysql

I was testing my application(an ERP system) by programming a tester that will do a fixed scenario of 30 steps of things like this
//pseudo code(PHP)
public function runScenario1Test(){
V2_1Tester::resetDatabase();
V2_1Tester::insert60Companies();
V2_1Tester::insert2000Items();
V2_1Tester::insert100Purchases();
V2_1Tester::insert100Sales();
//do some other stuff
V2_1Tester:checkResults();
}
Although every time I run the test of the same code, same data, all inputs were the same, I was getting different results sometimes!!!
My head was going to blow up, and after 4 days of investigations, tears, and even bad database dreams by night, it turns out that the bug was in a query that returns different results sometimes. It is something like this
+-----+--------------+-----------+------+--------+
| ID | date | direction | col3 | col4 |
+-----+--------------+-----------+------+--------+
| 1 | 2018-03-03 | in | 6 | 100.50 |
| 2 | 2018-03-03 | in | 6 | 350.75 |
+-----+--------------+-----------+------+--------+
-- more ~ 3000 rows
$query = "SELECT * FROM table ORDER BY date, direction, col3";
this query sometimes returns 1 then 2 and some other times 2 then 1.
I fixed the query by adding additional level for ordering ID
$query = "SELECT * FROM table ORDER BY date, direction, col3, ID";
But I don't understand why MySQL behaves like this ?, In other words what are the rules that MySQL will follow for the rows that are the same for all order-by columns ? and why it is changing ?

In SQL, order by is not stable. That means that when the keys are the same, the ordering can be in any order.
This is actually obvious. SQL tables represent unordered sets. There is no ordering unless a column specifies the ordering.
So, you have done the right thing by including the id as the final key. This ensures that the order will be well-defined because there are no duplicate keys.

Related

How do you batch SELECT statements when you can't rely on the IDs to be in literal order?

What I mean by literal order is that, altough the IDs are auto-increment, through business logic, it might end up that 8 comes after 4 when 5 should've been there. That is to say, if a deletion if ID happens, there's no re-indexing
This is how my rows look (table name is wp_posts):
+-----+-------------+----+--+--+--+
| ID | post_author | .. | | | |
+-----+-------------+----+--+--+--+
| 4 | .. | | | | |
+-----+-------------+----+--+--+--+
| 8 | .. | | | | |
+-----+-------------+----+--+--+--+
| 124 | .. | | | | |
+-----+-------------+----+--+--+--+
| 672 | .. | | | | |
+-----+-------------+----+--+--+--+
| 673 | .. | | | | |
+-----+-------------+----+--+--+--+
| 674 | .. | | | | |
+-----+-------------+----+--+--+--+
ID is an int that has the auto-increment characteristic, but when a post is deleted, there is no re-assignment of IDs. It will just simply get deleted and because it's auto-increment, you can still assume that, vertically, the items that come after the one you're looking at are always bigger than the ones before.
I'm querying for ID: SELECT ID FROM wp_posts to get a list of all the IDs I need. Now, it just so happens that I need to batch all of this, using AJAX requests because once I retrieve the IDs, I need to operate on them.
Thing is, I don't really understand how to pass my data back to AJAX. What LIMIT does is, if I provide 2 arguments, such as: SELECT ID FROM wp_posts LIMIT 1,3, it'll return back 4,8,124 because it looks at row number. But what do I do on the next call? Yes, the first call always starts with 1, but once I need to launch the second AJAX request to perform yet another SELECT, how do I know where I should start? In my case, I'd want to start again at 4, so, my second query would be SELECT ID FROM wp_posts LIMIT 4, 7 and so on.
Do I really need to send that counter (even if I can automate it, since, you see, it's an increment of 3) back?
Is there no way for SQL to handle this automatically?
You have many confusions in your question. Let me try to clear up some basic ones.
First, the auto-incremented key is the primary key for the table. You do not need to worry about gaps. In fact, the key should basically be meaningless. It fulfills the following:
It is guaranteed to be unique.
It is guaranteed to be in insertion order.
Gaps are allowed and of no concern. There is no re-indexing. It is a bad idea because:
Primary keys uniquely identify each row and this mapping should be consistent across time.
Primary keys are used in other tables to refer to values, so re-indexing would either invalidate those relationships or require massive changes to many tables.
Re-indexes pre-supposes that the value means something, when it doesn't.
Second, a query such as:
SELECT ID
FROM wp_posts
LIMIT 1, 3;
Can return any three rows. Why? Because you have no specified an ORDER BY and SQL result sets without ORDER BY are unordered. There are no guarantees. So you should always be in the habit of using an ORDER BY.
Third, if you want to essentially "page" through results, then use the OFFSET feature in LIMIT (as you have above):
SELECT ID
FROM wp_posts
ORDER BY ID
LIMIT #offset, 3;
This will allow you to reset the #offset value and go to which rows you want.
First query:
SELECT ID FROM wp_posts ORDER BY ID LIMIT 3
This returns 4,8,124 as you said. In your client, save the largest ID value in a variable.
Subsequent queries:
SELECT ID FROM wp_posts WHERE ID > ? ORDER BY ID LIMIT 3
Send a parameter into this query using the greated ID value from the previous result. It's still in a variable.
This also helps make the query faster, because it doesn't have to skip all those initial rows every time. Paging through a large dataset using LIMIT/OFFSET is pretty inefficient. SQL has to actually read all those rows even though it's not going to return them.
But if you use WHERE ID > ? then SQL can efficiently start the scan in the right place, on the first row that would be included in the result.
Seems, you want to return the first three rows of your query ordered by currently existing ID values(whatever they're after all DML statement's applied on the table wp_posts).
Then, Consider using an auxiliary iteration variable #i to provide an ordered integer value set starting from 1 and increasing as 2,3,... without any gaps :
select t.*
from
(
select #i := #i + 1 as rownum, t1.*
from tab t1
join (select #i:=0) t2
) t
order by rownum
limit 0,3;
Demo

Database design and query optimization/general efficiency when joining 6 tables in mySQL

I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.

How to get the right "version" of a database entry?

Update: Question refined, I still need help!
I have the following table structure:
table reports:
ID | time | title | (extra columns)
1 | 1364762762 | xxx | ...
Multiple object tables that have the following structure
ID | objectID | time | title | (extra columns)
1 | 1 | 1222222222 | ... | ...
2 | 2 | 1333333333 | ... | ...
3 | 3 | 1444444444 | ... | ...
4 | 1 | 1555555555 | ... | ...
In the object tables, on an object update a new version with the same objectID is inserted, so that the old versions are still available. For example see the entries with objectID = 1
In the reports table, a report is inserted but never updated/edited.
What I want to be able to do is the following:
For each entry in my reports table, I want to be able to query the state of all objects, like they were, when the report was created.
For example lets look at the sample report above with ID 1. At the time it was created (see the time column), the current version of objectID 1 was the entry with ID 1 (entry ID 4 did not exist at that point).
ObjectID 2 also existed with it's current version with entry ID 2.
I am not sure how to achieve this.
I could use a query that selects the object versions by the time column:
SELECT *
FROM (
SELECT *
FROM objects
WHERE time < [reportTime]
ORDER BY time DESC
)
GROUP BY objectID
Lets not talk about the performance of this query, it is just to make clear what I want to do. My problem is the comparison of the time columns. I think this is no good way to make sure that I got the right object versions, because the system time may change "for any reason" and the time column would then have wrong data in it, which would lead to wrong results.
What would be another way to do so?
I thought about not using a time column for this, but instead a GLOBAL incremental value that I know the insertion order across the database tables.
If you are interting new versions of the object, and your problem is the time column(I assume you are using this column to sort which one is newer); I suggest you to use an auto-incremental ID column for the versions. Eventually, even if the time value is not reliable for you, the ID will be.Since it is always increasing. So higher ID, newer version.

SQL query, return randomly ordered rows with limits, possible?

I am thinking of returning a randomly ordered SQL response where the results are mixed up randomly, with a limit.
The thing is I need All the rows back, basically divided into groups (chunks of rows). I hope I am clear.
For example, from table A:
ID | NAME | PROFESSION
++++++++++++++++++++++++++++++++
1 | Jack | Carpenter
2 | Rob | Manager
3 | Phil | Driver
4 | Mary | Cook
5 | Tim | Postman
6 | Bob | Programmer
The query would return something like this:
With a limit of 0,2:
6 | Bob | Programmer
4 | Mary | Cook
With a limit of 2,2:
1 | Jack | Carpenter
5 | Tim | Postman
With a limit of 4,2:
3 | Phil | Driver
2 | Rob | Manager
Note: all the table rows were returned. In my page I need to have a << >> buttons that will show the user the needed "group"s of data.
How do I go about writing such a query ?
A better name for your explained problem would be randomly shuffled records. That is true that the order is random but since the order needs to be remembered, you have no choice but to save it in a column. You can do this by saving a randomly populated field and ordering your records based on that. This way you have ordered your records in no specific order while the order is remembered for future select queries. And whenever you got tired of the order, you can update the mentioned field with new randomly generated values to shuffle them again. This is the technique used by players to shuffle a playlist without replaying a song twice.
[EDIT]
While the first given solution stands as the general answer, there's a hack you can use in MySQL to randomly order records. In this way, all you need to store for remembering an order is its seed.
SELECT * FROM tbl ORDER BY RAND(s);
For instance, if you want each user see the records in some different randomly ordered, you can use their user_id as the seed. This way the order each user will ever see the records in, will remain the same while it is random and different from other users.
I can think of two things here:
If the data in the table is huge, add a column that tells the group to which a row belongs. When the user clicks on >> or << buttons, get the rows for that particular group.
If you are dealing with small amount of data, you could do this in the code itself.
If you use ORDER BY RAND() then you will have to flag selected records somewhere which is no advisable.
You can use some intelligent algorithm with combination of total_pages and ID e.g.
SELECT *
FROM my_table
ORDER BY MOD(ID, total_pages);
Add a column to the table called something like random_col
Then each time you need to randomise the table you run
UPDATE table SET random_col = RAND()
And now each time you want to retrieve results you run a normal select
SELECT * FROM table ORDER BY random_col ASC LIMIT x,y
And the results will appear in the same order until you randomise them again by running the 'UPDATE'

Max occurences of a given value in a table

I have a table (pretty big one) with lots of columns, two of them being "post" and "user".
For a given "post", I want to know which "user" posted the most.
I was first thinking about getting all the entries WHERE (post='wanted_post') and then throw a PHP hack to find which "user" value I get the most, but given the large size of my table, and my poor knowledge of MySQL subtle calls, I am looking for a pure-MySQL way to get this value (the "user" id that posted the most on a given "post", basically).
Is it possible ? Or should I fall back on the hybrid SQL-PHP solution ?
Thanks,
Cystack
It sounds like this is what you want... am I missing something?
SELECT user
FROM myTable
WHERE post='wanted_post'
GROUP BY user
ORDER BY COUNT(*) DESC
LIMIT 1;
EDIT: Explanation of what this query does:
Hopefully the first three lines make sense to anyone familiar with SQL. It's the last three lines that do the fun stuff.
GROUP BY user -- This collapses rows with identical values in the user column. If this was the last line in the query, we might expect output something like this:
+-------+
| user |
+-------+
| bob |
| alice |
| joe |
ORDER BY COUNT(*) DESC -- COUNT(*) is an aggregate function, that works along with the previous GROUP BY clause. It tallies all of the rows that are "collapsed" by the GROUP BY for each user. It might be easier to understand what it's doing with a slightly modified statement, and it's potential output:
SELECT user,COUNT(*)
FROM myTable
WHERE post='wanted_post'
GROUP BY user;
+-------+-------+
| user | count |
+-------+-------+
| bob | 3 |
| alice | 1 |
| joe | 8 |
This is showing the number of posts per user.
However, it's not strictly necessary to actually output the value of an aggregate function in this case--we can just use it for the ordering, and never actually output the data. (Of course if you want to know how many posts your top-poster posted, maybe you do want to include it in your output, as well.)
The DESC keyword tells the database to sort in descending order, rather than the default of ascending order.
Naturally, the sorted output would look something like this (assuming we leave the COUNT(*) in the SELECT list):
+-------+-------+
| user | count |
+-------+-------+
| joe | 8 |
| bob | 3 |
| alice | 1 |
LIMIT 1 -- This is probably the easiest to understand, as it just limits how many rows are returned. Since we're sorting the list from most-posts to fewest-posts, and we only want the top poster, we just need the first result. If you wanted the top 3 posters, you might instead use LIMIT 3.