Possible Bug in the Select InterfaceDist Command - mysql

I am new to using SQL. I was wondering whether there could be a bug in this program.
/* Insert to interface table all atoms that have diffASA>0*/
insert into NinterfaceAtom(PDB,Chain,Residue,ResId,Symbol,atom,diffASA)
select PDB,Chain,Residue,ResId,Symbol,Atom,max(ASA)-min(ASA) from perAtomASA
group by PDB,Chain,Residue,ResId,Symbol,Atom
having stddev(ASA)>0;
/* Insert to interface table all atoms that have enough distance */
insert ignore into NinterfaceAtoms (PDB,Chain,Residue,ResId,Symbol,atom)
select asa.PDB,asa.Chain,asa.Residue,asa.ResId,asa.Symbol,dist.Atom from interfaceDist dist
inner join
perAtomASA asa
on
dist.PDB=asa.PDB and
dist.Chain=asa.Chain and
dist.ResId=asa.ResId and
dist.Symbol=asa.Symbol and
Seperated=0
I am just unsure why the programmmer before me put asa.PDB instead of dist.PDB in the inner join section.
I was thinking the eighth line needed to be changed from:
select asa.PDB,asa.Chain,asa.Residue,asa.ResId,asa.Symbol,dist.Atom from interfaceDist dist
to:
select dist.PDB,dist.Chain,dist.Residue,dist.ResId,dist.Symbol,dist.Atom from interfaceDist dist
Is that correct? Thanks.

You are joining asa and dist. It is perfectly logical that their values are checked to make sure only matching pairs will be in the result. So, unless you have a very good reason to think that in the line
dist.PDB=asa.PDB
you need dist.PDB instead of asa.PDB, the command looks to be correct. And if this would be correct
dist.PDB=dist.PDB
then it would be trivial and it would be pointless to have this part checked at all. When you identify bugs, you either need to see behavioral problems of the software, or to understand the code you are looking at.
EDIT
In the select clause one can use asa.PDB or dist.PDB, because the two are equal because of the on condition which ensured their equality. If the two are different, then the pair will not be in the result. So, in terms of values, it makes no difference. But if it is more intuitive to have dist.PDB in the select, then you might want to change it (there is no harm in it, because the value is exactly the same) so the code will be more readable and later, if the code is changed and the two will no longer have to be equal, you will not have new bugs out of the blue sky.

Related

Snowflake ST_POLYGON(TO_GEOGRAPHY(...)) Is Inefficient

I have a few queries that use geospatial conditions. These queries are running surprisingly slow. Initially I thought it was the geospatial calculation itself, but stripping everything down to just ST_POLYGON(TO_GEOGRAPHY(...)), it is still very slow. This would make sense if each row had it's own polygon, but the condition uses a static polygon in the query:
SELECT
ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'))
FROM TABLE(GENERATOR(ROWCOUNT=>1000000))
Snowflake should be able to figure out that it only needs to calculate this polygon once for the entire query. Yet, the more rows that are added, the slower it gets. On an x-small this query takes over a minute. Where this query:
SELECT
'LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
FROM TABLE(GENERATOR(ROWCOUNT=>3000000))
(added 2mm more rows to match the byte count)
Can complete in 2s.
I tried "precomputing" the polygon myself with a WITH statement but SF figures out the WITH is redundant and drops it. I also tried setting a session variable, but you can't set a complex value like this one as a variable.
I believe this is a bug.
Geospatial functions are in preview for now, and the team is working hard on all kind of optimizations.
For this case I want to note that making the polygon a single row table would help, but I would still expect better performance as the team gets this feature out of beta.
Let me create a table with one row, the polygon:
create or replace temp table poly1
as
select ST_POLYGON(TO_GEOGRAPHY('LINESTRING(-95.75122850074004 28.793166796020444,-95.68622920563344 30.207416499279063,-94.5162418937178 32.56537633083211,-90.94128066286225 34.24734047810797,-88.17881062083825 36.812423897251634,-86.13133282498448 38.15341651409619,-85.28634198860107 38.66275098353796,-84.37635185711038 38.789523129087826,-82.84886842210855 38.4848923369382,-82.32887406125734 37.820427257446994,-82.26387476615074 36.96838022284757,-82.03637723327772 36.00158943485101,-80.99638851157454 35.34155096040939,-78.52641529752944 34.62260477275565,-77.51892622337955 34.005211031324734,-78.26641811710381 31.1020568651834,-80.24889661785029 29.926151366059756,-83.59636031583283 28.793166796020444,-95.75122850074004 28.793166796020444)'
)) polygon
;
To see if this would help, I tried a one million rows cross join:
select *
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
It takes 14 seconds, and in the query profiler you can see most time was spent on an internal TO_OBJECT​(​GET_PATH​(​POLY1​.​POLYGON, '_shape'​)​​.
What's interesting to note is that the previous operation is mostly concerned with the ascii representation of the polygon. Running operations over this polygon is much quicker:
select st_area(polygon)
from poly1, TABLE(GENERATOR(ROWCOUNT=>1000000));
This query should have taken longer (finding the area of a polygon sounds more complicated than just selecting it), but turns out it only took 7 seconds (~half).
Thanks for the report, and the team will continue to optimize cases like this.
For anyone curious about the particular polygon in the question - it's a nice heart:

mySQL: Can one rely on the implicit ORDER BY done by mySQL when using an IN-Statement?

I just noticed that,
when i execute the following query:
SELECT * FROM tbl WHERE some_key = 1 AND some_foreign_key IN (2,5,23,8,9);
the results come back in the same order they where given in the IN-Statement List,
e.g. the row with some_foreign_key = 2 is the first row returned,
the one with
some_foreign_key = 9 is the last and so on.
This is exactly the opposite behaviour of what this guy describes:
MySQL WHERE IN - Ordering
Can one rely on this behaviour or modify it via some mySQL Server setting?
I know common wisdom is "no ORDER BY Clause" == "RDBMS can sort however it pleases",
but in my current Task at hand this behaviour is quite helpful (really large import)
and it would be great if i could rely on it.
EDIT: I know about the ORDER BY FIELD Trick already, just wanted to know if i can safely avoid the ORDER BY Clause by setting some config somewhere.
ORDER BY FIELD(some_foreign_key, 2, 5, 23, 8, 9)
isn't really that tough to implement - unless you're really simplifying this example. And as you already know it's the only way to be 100% sure of the output ordering.

Why is Rails is adding `OR 1=0` to queries using the where clause hash syntax with a range?

The project that I'm working on is using MySQL on RDS (mysql2 gem specifically).
When I use a hash of conditions including a range in a where statement I'm getting a bit of an odd addition to my query.
User.where(id: [1..5])
and
User.where(id: [1...5])
Result in the following queries respectively:
SELECT `users`.* FROM `users` WHERE ((`users`.`id` BETWEEN 1 AND 5 OR 1=0))
SELECT `users`.* FROM `users` WHERE ((`users`.`id` >= 1 AND `users`.`id` < 5 OR 1=0))
The queries work perfectly fine since OR FALSE is effectively a no-op. I'm just wondering why Rails or ARel is adding this snippet into the query.
EDIT
It looks like the line that could explain this is line 26 in ActiveRecord::PredicateBuilder. Still no idea how the hash could be empty? at that point but maybe someone else does.
EDIT 2
This is intersting. I was looking into Filip's comment to see why he made it since it seems just like a clarification but he is correct that 1..5 != [1..5]. The former is an inclusive range from 1 to 5 where as the latter is an array whose first element is the former. I tried putting these into an ARel where call to see the SQL produced and the OR 1=0 is not there!
User.where(id: 1..5) #=> SELECT "users".* FROM "users" WHERE ("users"."id" BETWEEN 1 AND 5)
User.where(id: 1...5) #=> SELECT "users".* FROM "users" WHERE ("users"."id" >= 1 AND "users"."id" < 5)
While I still do not know why ARel is adding the OR 1=0 which will always be false and seemingly unnecessary. It may be due to how Arrays and Ranges are handled differently.
Building on the fact, which you've discovered, that [1..5] is not the correct way to specify the range... I have discovered why [1..5] behaves as it does. To get there, I first found that an empty array in a hash condition produces the 1=0 SQL condition:
User.where(id: []).to_sql
# => "SELECT \"users\".* FROM \"users\" WHERE 1=0"
And, if you check the ActiveRecord::PredicateBuilder::ArrayHandler code, you'll see that array values are always partitioned into ranges and other values.
ranges, values = values.partition { |v| v.is_a?(Range) }
This explains why you don't see the 1=0 when using non-range values. That is, the only way to get 1=0 from an array without including a range is to supply an empty array, which yields the 1=0 condition, as shown above. And when all the array has in it is a range you're going to get the range conditions (ranges) and, separately, an empty array condition (values) executed. My guess is that there isn't a good reason for this... it just simply is easier to let this be than to avoid it (since the result set is equivalent either way). If the partition code was a bit smarter then it wouldn't have to tack on the additional, empty values array and could skip the 1=0 condition.
As for where the 1=0 comes from in the first place... I think that comes from the database adapter, but I couldn't find exactly where. However, I would call it an attempt to fail to find a record. In other words, WHERE 1=0 isn't ever going to return any users, which makes sense over alternative SQL like WHERE id=null which will find any users whose id is null (realizing that this isn't really correct SQL syntax). And this is what I'd expect when attempting to find all Users whose id is in the empty set (i.e. we're not asking for nil ids or null ids or whatever). So, in my mind, leaving the bit about exactly where 1=0 comes from as a black box is OK. At least we now can reason about why the range inside of the array is causing it to show up!
UPDATE
I've also found that, even when using ARel directly, you can still get 1=0:
User.arel_table[:id].in([]).to_sql
# => "1=0"
This is strictly speaking a guess, since I did something similar in a project of my own (although I used AND 1).
For whatever reason, when generating a query, it is easier to always have a WHERE clause containing a no-op than it is to conditionally generate the WHERE clause at all. That is, if you don't include any where sections it will end up generating something still valid.
On the other hand, I'm not sure why it's taking this form: when I did it I use 1 [<AND (generated code)>...] it allowed arbitrary chaining, but I don't see how what you're seeing would allow it. None the less, I still think it likely to be a result of an algorithmic code generation scheme.
Check to see if you are using active_record-acts_as. That was the problem with me.
Add the line below to your Gemfile:
gem 'active_record-acts_as', :git => 'https://github.com/hzamani/active_record-acts_as.git'
This will just pull the latest version of the Gem that will hopefully be fixed. Worked for me.
I think you're seeing side effects of ruby personally.
I think the better way to do what you're doing would be with
2.0.0-p481#meri :008 > [*1..5]
=> [1, 2, 3, 4, 5]
User.where(id: [*1..5]).to_sql
"SELECT `users`.* FROM `users` WHERE `users`.`id` IN (1, 2, 3, 4, 5)"
As this creates an Array vs an Array with element 1 of class Range.
OR
use an explicit Range to trigger the BETWEEN in AREL.
# with end element, i.e. exclude_end=false
2.0.0-p481#meri :013 > User.where(id: Range.new(1,5)).to_sql
=> "SELECT `users`.* FROM `users` WHERE (`users`.`id` BETWEEN 1 AND 5)"
# without end element, i.e. exclude_end=true
2.0.0-p481#meri :022 > User.where(id: Range.new(1, 5, true)).to_sql
=> "SELECT `users`.* FROM `users` WHERE (`users`.`id` >= 1 AND `users`.`id` < 5)"
If you care about having control of the queries you generate and the full power of the SQL language and database features then I would suggest moving from ActiveRecord/Arel to Sequel.
I can honestly say there are a lot more quirks and infuriating times ahead for you with ActiveRecord, especially when you move beyond simple crud like queries. When you start trying to query your data in anger, perhaps needing to join a few join tables here and there and realize you really do need join conditions or union all type queries.
It is also significantly faster and more reliable in its query generation and result handling and much easier to compose the queries you want. It also has real documentation you can actually read unlike arel.
I just wish I had discovered it much earlier rather than persisting with the rails default data access layer.

How do you add comments to MySQL queries, so they show in logs?

I want to be able to add a little note, at the beginning of each query, so when I see it in the processlist, or "mytop", I can tell where it’s running.
Is something like this possible?
I am not sure this would work, but it is worth trying.
Simply add "/* some comment or tag */ " before whatever SQL query is sent normally.
It is possible that MySQL server will remove this comment as part of its query analysis/preparation, but it may just leave it as well, so it shows as such in logs and other monitoring tools.
In case the comments get stripped away, and assuming SELECT queries, a slight variation on the above would be to add a calculated column as the first thing after SELECT, something like
SELECT IF('some comment/tag' = '', 1, 0) AS BogusMarker, here-start-the-original-select-list
-- or
SELECT 'some [short] comment/tag' AS QueryID, here-start-the-original-select-list
This approach has the drawback of introducing an extra column value, with each of the results row. The latter form, actually uses the "comment/tag" value as this value, which may be helpful for debugging purposes.
I discovered this today:
MySQL supports the strange syntax:
/*!<min-version> code here */
to embed code that will only be interpreted by MySQL, and only MySQL of a specified minimum version.
This is documented at 9.7 Comments (used by mysqldump, for example).
Such comments will not be parsed out, and included in the processlist, unlike normal ones, and that even if the code is not actually executed.
So you can do something like
/*!999999 comment goes here */ select foo from bla;
to have a comment that will show up in the processlist, but not alter the code. (Until MySQL developers decide to bloat their release numbering and release a version numbered over 99.99.99, in which case you will just have to add another digit.)

RegEx to insert a string before each table in a MySQL query

I need to take a MySQL query and insert a string before each table name. The solution doesn't need to be one line but obviously it's a regex problem. It will be implemented in PHP so having programming logic is also fine.
Rationale and Background:
I'm revamping my code base to allow for table prefixes (eg: 'nx_users' instead of 'users') and I'd like to have a function that will automate that for me so I don't need to find every query and modify it manually.
Example:
SELECT * FROM users, teams WHERE users.team_id = teams.team_id ORDER BY users.last_name
Using the prefix 'nx_', it should change to
SELECT * FROM nx_users, nx_ teams WHERE nx_ users.team_id = nx_ teams.team_id ORDER BY nx_ users.last_name
Obviously it should handle other cases such as table aliases, joins, and other common MySQL commands.
Has anybody done this?
How big of a code base are we talking about here? A regular expression for something like this is seriously flirting with disaster and I think you're probably better off looking for every mysql_query or whatever in your code and making the changes yourself. It shouldn't take more than the hour you'd spend implementing your regex and fixing all the edge cases that it will undoubtedly miss.
Using a regex to rewrite code is going to be problematic.
If you need to dynamically change this string, then you need to separate out your sql logic into one place, and have a $table_prefix variable that is appropriately placed in every sql query. The variable can then be set by the calling code.
$query = "SELECT foo from " . $table_prefix . "bar WHERE 1";
If you are encapsulating this in a class, all the better.
This example does not take into consideration any escaping or security concerns.
First off, regular expressions alone are not up to the task. Consider things like:
select sender from email where subject like "from users group by email"
To really do this you need something that will parse the SQL, produce a parse tree which you can modify, and then emit the modified SQL from the modified parse tree. With that, it's doable, but not advisable (for the reasons Paolo gave).
A better approach would be to grep through your source looking for either the table names, the function you use to sent SQL, the word from, or something like it at script something to throw you into an editor at those points.