SQLAlchemy: Does with_entities preload the selected objects and columns? - sqlalchemy

I have an SQL alchemy query in a dataloader.
query = (
db.session.query(Thing1)
.join(Thing2)
.outerjoin(Thing2.possible_related_object_1)
.outerjoin(Thing2.possible_related_object_1)
.with_entities(
Thing1.id,
Thing2,
PossibleRelatedObject1.name,
PossibleRelatedObject2.name,
)
)
My question: I only need the name fields from the possible related objects, if they exist. Does the fact that I am using with_entities here ensure that those columns on PossibleRelatedObject1 and PossibleRelatedObject2 get preloaded?
Because this is in a dataloader, I am trying to avoid any lazy loading. Normally, the strategy in this repo I'm working in has been to use joinedload to ensure eager loading, but in this case I don't see how to use joinedload because the possible related objects don't have a direct relationship to the query root. And my current strategy above seems pretty fast.

Related

TypeORM: how to load relations with CreateQueryBuilder, without using JOINs?

I'm developing an API using NestJS & TypeORM to fetch data from a MySQL DB. Currently I'm trying to get all the instances of an entity (HearingTonalTestPage) and all the related entities (e.g. Frequency). I can get it using createQueryBuilder:
const queryBuilder = await this.hearingTonalTestPageRepo
.createQueryBuilder('hearing_tonal_test_page')
.innerJoinAndSelect('hearing_tonal_test_page.hearingTest', 'hearingTest')
.innerJoinAndSelect('hearingTest.page', 'page')
.innerJoinAndSelect('hearing_tonal_test_page.frequencies', 'frequencies')
.innerJoinAndSelect('frequencies.frequency', 'frequency')
.where(whereConditions)
.orderBy(`page.${orderBy}`, StringToSortType(pageFilterDto.ascending));
The problem here is that this will produce a SQL query (screenshot below) which will output a line per each related entity (Frequency), when I want to output a line per each HearingTonalTestPage (in the screenshot example, 3 rows instead of 12) without losing its relations data. Reading the docs, apparently this can be easily achieved using the relations option with .find(). With QueryBuilder I see some relation methods, but from I've read, under the hood it will produce JOINs, which of course I want to avoid.
So the 1 million $ question here is: is it possible with CreateQueryBuilder to load the relations after querying the main entities (something similar to .find({ relations: { } }) )? If yes, how can I achieve it?
I am not an expert, but I had a similar case and using:
const qb = this.createQueryBuilder("product");
// apply relations
FindOptionsUtils.applyRelationsRecursively(qb, ["createdBy", "updatedBy"], qb.alias, this.metadata, "");
return qb
.orderBy("product.id", "DESC")
.limit(1)
.getOne();
it worked for me, all relations are correctly loaded.
ref: https://github.com/typeorm/typeorm/blob/master/src/find-options/FindOptionsUtils.ts
You say that you want to avoid JOINs, and are seeking an analogue of find({relations: {}}), but, as the documentation says, find({relations: {}}) uses under the hood, expectedly, LEFT JOINs. So when we talk about query with relations, it can't be without JOIN's.
Now about the problem:
The problem here is that this will produce a SQL query (screenshot
below) which will output a line per each related entity (Frequency),
when I want to output a line per each HearingTonalTestPage
Your query looks fine. And the result of the query, also, is ok. I think that you expected to have as a result of the query something similar to json structure(when the relation field contains all the information inside itself instead of creating new rows and spread all its values on several rows). But that is how the SQL works. By the way, getMany() method should return 3 HearingTonalTestPage objects, not 12, so what the SQL query returns should not worry you.
The main question:
is it possible with CreateQueryBuilder to load the relations after
querying the main entities
I did't get what do you mean by saying "after querying the main entities". Can you provide more context?

Django: Is there a way to effienctly bulk get_or_create()

I need to import a database (given in JSON format) of papers and authors.
The database is very large (194 million entries) so I am forced to use django's bulk_create() method.
To load the authors for the first time I use the following script:
def load_authors(paper_json_entries: List[Dict[str, any]]):
authors: List[Author] = []
for paper_json in paper_json_entries:
for author_json in paper_json['authors']:
# len != 0 is needed as a few authors dont have a id
if len(author_json['ids']) and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
authors.append(Author(author_id=author_json['ids'][0], name=author_json['name']))
Author.objects.bulk_create(set(authors))
However, this is much too slow.
The bottleneck is this query:
and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
Unfortunately I have to make this query, because of course one author can write multiple papers and otherwise there will be a key conflict.
Is there a way to implement something like the normal get_or_create() efficiently with bulk_create?
To avoid creating entries with existing unique keys, you can enable the ignore_conflicts parameter:
def load_authors(paper_json_entries: List[Dict[str, any]]):
Author.objects.bulk_create(
(
Author(author_id=author_json['ids'][0], name=author_json['name'])
for paper_json in paper_json_entries
for author_json in paper_json['authors']
),
ignore_conflicts=True
)

Multiple, unknown number of fields passed into a query

Is it possible to create a generic query that would work for different types of documents? For example I have "cases" and "factories",
They have different set of fields. e.g:
{
id: 'case_o1',
name: 'Case numero uno',
amount: 40
}
{
id: 'factory_002',
location: 'Venezuela',
workers: 200,
operating: true
}
Is it possible to create a generic query where I would pass the type of an entity (case or factory) and additional parameters and it would filter results based on those?
I could of course use javascript view, but it doesn't allow me to filter by multiple fields. Let's say I want to fetch all factories located in Venezuela, with number of workers between 20 and 55.
I started with this, but then I got stuck:
select * from `mybucket` as entity
where position(meta(entity).id, $entity_type) == 0
How do I pass multiple predicates and have the query to recognize them?
I can of course list fields like this:
where position(meta(entity).id, $entity_type) == 0
and entity.location == 'Venezuela'
and entity.workers > $workers_min
and entity.workers < $workers_max
but then
I'm gonna have to create a separate query for each entity
And even then it won't solve my problem - I have no idea how to ignore predicates, what if next time $workers_min and $workers_max are not passed, does it mean I have to create a query for every single predicate (column)?
For security reasons I cannot generate free-form queries and pass them to Couchbase server, all the queries are already stored in the database, our api just picks them up out of a document and executes them
I think it's possible to create a query that would be "short-circuiting" for args that's undefined (e.g. WHERE $location IS MISSING OR entity.location == $location or something like that)
Is it possible at all to create a query that would be able to effectively filter and order a dataset based on arbitrary parameters? Or there's no way?
#Agzam. Sorry. I were writting my comment when you said it. But anyway. What you are asking for is possible by using coalesces in a not too complex expressions, but it is a REALLY bad idea because this will drastically throw down most of internal database optimizations. Including the use of any existing index. So, except if you are dealing with a relatively small database (and you are sure it will remain being approximately the same size), I suggest you to better try distinct approach… This is, in fact, the reason I implmented sqlapi.
If you need to have all querys previously stored in database, it probably could be much better to sort given arguments by its name and precalculate and store precalculated querys for each possible combination.
You can do it by assigning a default value to the variable when is not used. For instance if $location is not used you can set it to -1 as default value.
Then the where condition would be:
WHERE ($location=-1 OR entity.location = $location)

Is it faster to follow relations in a query parameter or using model attribute lookup in Django Models?

Say I have three Django Models:
User,
Staff which is one-to-one with User,
Thing which is many-to-one with Staff on the 'owner' field.
Using a MySQL database, which of these performs better?
Thing.objects.filter(owner=user.staff) # A
Thing.objects.filter(owner__user=user) # B
What about if I am checking that the Thing that I want is owned by a User:
try:
Thing.objects.get(id=some_id, owner=user.staff) # D
Thing.objects.get(id=some_id, owner__user=user) # E
except Thing.DoesNotExist:
return None
else:
pass # do stuff
# Or F:
thing = Thing.objects.get(id=some_id)
if thing.owner.user != user:
return None
pass # do stuff
It very much depends on how you got the original objects and what you've done with them since. If you've already accessed user.staff, or you originally queried User with select_related, then the first query is better as it is a simple SELECT on one table, whereas the second will do a JOIN to get to the User table.
However, if you have not already accessed user.staff and did not originally get it via select_related, the first expression will cause user.staff to be evaluated, which triggers a separate query, before even doing the Thing lookup. So in this case the second query will be preferable, since a single query with a JOIN is better than two simple queries.
Note however that this is almost certainly a micro-optimization and will have very little impact on your overall run time.
Both those queries might end up as the same SQL, dependent on your settings, models, indexes and database driver. You can verify that with the .query member variable. If they differ the only real test will be empirical. I can recommend django-devserver and ipython as profiling tools.
Thing.objects.filter(owner=user.staff) # A
Thing.objects.filter(owner__user=user) # B
I think that the second one it "better". Assuming you've got the user record from the request:
B will generate 1 SQL query only on Thing.
I think A will generate a query for user.staff and then one on Thing. (also that might take more memory for the staff instance)
To be sure try this and inspect the timing and the generated queries with the debug toolbar:
for i in range(0, 100):
things = Thing.objects.filter(owner=user.staff) # A
#things = Thing.objects.filter(owner__user=user) # B
# that will execute the queries
for thing in things.all():
print thing.name
Then replace with B...

Changing FROM in all queries for an ActiveRecord model

I'm working on a rails project that is connected to a third-party MySQL database that I cannot change the schema for. So far, I've been able to shoe-horn everything into rails and make it play nice, but I've come across an interesting problem.
I have a table, we'll call it foos. I have an ActiveRecord model called Foo that uses this table. The problem is that this table represents two similar but distinct types of record. We'll call them Foo type A and Foo type B. To get around this, I've created two classes, FooTypeA and FooTypeB that inherit from Foo and have default scopes so that they only contain records of their respective types.
My code looks something like this:
class Foo < ActiveRecord::Base
# methods common to both types
end
class FooTypeA < Foo
default_scope -> { where is_type_a: true }
# methods for type A
end
class FooTypeB < Foo
default_scope -> { where is_type_a: false }
# methods for type B
end
For the most part, this works pretty well, except for the fact that sometimes an association chain joins over both of these models. Since they come from the same table, this causes ambiguity problems, and generates exploding SQL queries. I've been writing custom join queries to get around this, but it's quickly becoming cumbersome.
I know I can change the default table name for a model with the self.table_name value, but is there a way that I can tell rails to change the FROM portion of the SQL query for a model so that I can make all queries from FooTypeA read as: SELECT foo_as.* FROM foos AS foo_as ...
I'm open to other suggestions, but this seems like the easiest solution if it's possible.
Wouldn't the ActiveRecord .from method solve your problem?
You could also create two views (depending on mysql version) and use those for table sources but unless you only read from the tables, you can get into writable view issues which I would try and avoid.