How to do a safe (constraint) implicit join in SQLAlchemy? - sqlalchemy

I consider a safe JOIN one which is constrained by its foreign key. And an implicit one which only mentions its target.
In SQLAlchemy, session.query(User).join(Addresse) might or might not be constraint, depending on whether SQLAlchemy knows about the relation between User and Addresses.
Sometimes I construct a complicated query and I want to make sure all joins are constraint. How can I check that? I would like best if SQLAlchemy gives me an exception when an implicit .join(table) didn’t find its foreign key...
Example for “complicated query”:
session.query(addr_alias1).join(User).join(addr_alias2)
For the last join, a overly explicit way would be .join(addr_alias2, User.addresses), but as I said, I want an implicit syntax, which barks if it failed.

SQLAlchemy already does so.
Unfortunately, the tutorial suggests otherwise and the API documentation doesn’t clear that up.

Related

What's the purpose of an IMPLICIT JOIN in SQL?

So, I don't really understand the purpose of using an implicit join in SQL. In my opinion, it makes a join more difficult to spot in the code, and I'm wondering this:
Is there a greater purpose for actually wanting to do this besides the simplicity of it?
Fundamentally there is no difference between the implicit join and the explicit JOIN .. ON ... Execution plans are the same.
I prefer the explicit notation as it makes it easier to read and debug.
Moreover, in the explicit notation you define the relationship between the tables in the ON clause and the search condition in the WHERE clause.
Explicit vs implicit SQL joins
When you join several tables no matter how the join condition written, anyway optimizer will choose execution plan it consider the best. As for me:
1) Implicit join syntax is more concise.
2) It easier to generate it automatically, or produce using other SQL script.
So I use it sometimes.
Others have answered the question from the perspective of what most people understand by "implicit JOIN", an INNER JOIN that arises from table lists with join predicates in the WHERE clause. However, I think it's worth mentioning also the concept of an "implicit JOIN" as some ORM query languages understand it, such as Hibernate's HQL or jOOQ or Doctrine and probably others. In those cases, the join is expessed as a path expression anywhere in the query, such as e.g.
SELECT
b.author.first_name,
b.author.last_name,
b.title,
b.language.cd AS language
FROM book b;
Where the path b.author implicitly joins the AUTHOR table to the BOOK table using the foreign key between the two tables. Your question still holds for this type of "implicit join" as well, and the answer is the same, some users may find this syntax more convenient than the explicit one. There is no other advantage to it.
Disclaimer: I work for the company behind jOOQ.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

Erratic behaviour of a mysql query

I have the following query :
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario, days.local, admin.eMail, admin.passcode, days.id,
admin.username
FROM days,admin
WHERE days.id='9' AND days.nombre=admin.username
The problem is that the query somethimes work but sometimes doesnt. Sometimes works with only certain IDs. Is there any other way to formulate the query?
You are currently using implicit joins. Explicit joins are easier to read and understand for you and tend to make for much more consistent queries.
You could rewrite your query using JOINs. So, instead of:
SELECT days.from, days.to, days.nombre, days.totalDays, days.bloque,
days.comentario , days.local, admin.eMail, admin.passcode,
days.id, admin.username
FROM days,admin
WHERE days.id='9'
AND days.nombre=admin.username
You can use:
SELECT days.from,days.to,days.nombre,days.totalDays,days.bloque,
days.comentario,days.local,admin.eMail,admin.passcode,
days.id,admin.username
FROM days
INNER JOIN admin ON days.nombre=admin.username
WHERE days.id='9'
You may be able to note already how much easier it is to understand what is happening here. While this shouldn't in and of itself fix your query, it is far easier to read and thus to debug.
If you find that certain cases are not working, the best way to figure out why is to remove some restrictions and see if it then works. In this instance, make sure that the usernames that are not showing up have the column days.id equal to 9. Other potential issues when using a natural key are things like extra white space. Check for this in cases that do not work as the JOIN property days.nombre=admin.username may be failing.
Your other option, if, in fact, whitespaces are causing you issues, is to do away with your natural keys and implement surrogate keys. Surrogate keys mean that you will be using a standard and unique key code like an int that increments over time. Rather than have days.nombre as your foreign key, you would have days.admin_id as your foreign key.
As a rule, while there are many pros to natural keys and it is a debate which rages on, it is generally accepted that natural keys only work if the keys are consistent and unique.
Just guessing, but here's something that caused a problem for me recently: check your table and column definitions that the character sets are consistent. It looks like you have a mixture of English and Spanish, so perhaps some non-ASCII characters like ñ are not matching as expected.

Complex Queries in Crate DB possible?

i just want to convert my all MYSQL tables into crate tables. This is actually an mobile app backend. Is it really possible in Crate to do exact query operation similar to MYSQL.
I didn't see any JOIN, Intersect, union ..etc. Even i can't use subquery (IN operator) in crate.
I also didn't see primary key ==> foreignKey relations set on table.
Please help me to do all the above on Crate DB.
"I love crate". it seems really faster, but it lacks on Complex Query to excute as like normal MYSQL.
Crate currently doesn't support joins or subselects. Although support will be added in the future (see https://news.ycombinator.com/item?id=7611399)
There are also no relations between tables which is why there are no foreign key.
Many of the things that are accomplished using joins can instead be done by de-normalizing the model and make use of the object and array types.
Update: With 0.54.X there is initial (limited) support for joins.
Limited in that some forms (outer joins for example) are missing and that there is still a lot of room for performance improvements.

Foreign keys when cascades aren't needed

If I don't need to use cascade/restrict and similar constraints in a field which would logically be a foreign key, do I have any reason to explicitly declare it as a foreign key, other than aesthetics?
Wouldn't it actually decrease performance, since it has to test for integrity?
edit: to clarify, I don't need it since:
I won't edit nor delete those values anyway, so I don't need to do cascade and similar checks
Before calling INSERT, I'll check anyway if the target key exists, so I don't need restrict checks either
I understand that this kind of constraint will ensure that that relation will be still valid if the database becomes somehow corrupted, and that is a good thing. However, I'm wondering if there is any other reason to use this function in my case. Am I missing something?
The answers to this quesiton might actually also apply to your question.
If you have columns in tables which reference rows in other tables, you should always be using foreign keys, since even if you think that you 'do not need' the features offered by those checks, it will still help guarantee data integrity in case you forgot a check in your own code.
The performance impact of foreign key checks is neglegible in most cases (see above link), since relational databases use very optimised algorithms to perform them (after all, they are a key feature since they are what actually defines relations between entities).
Another major advantage of FKs is that they will also help others to understand the layout of your database.
Edit:
Since the question linked above is referring to SQL-Server, here's one with replies of a very similar kind for MySQL: Does introducing foreign keys to MySQL reduce performance
You must to do it. If it will touch performance in write -- it's a "pixel" problem.
Main performance problems are in read -- FKs could help query optimizer to select best plan and etc. Even if you DBMS(-s) (if you provide cross-DBMS solution) will gain from it now -- it can happen later.
So answer is -- yes, it's not only aestetics.