I need to import a database (given in JSON format) of papers and authors.
The database is very large (194 million entries) so I am forced to use django's bulk_create() method.
To load the authors for the first time I use the following script:
def load_authors(paper_json_entries: List[Dict[str, any]]):
authors: List[Author] = []
for paper_json in paper_json_entries:
for author_json in paper_json['authors']:
# len != 0 is needed as a few authors dont have a id
if len(author_json['ids']) and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
authors.append(Author(author_id=author_json['ids'][0], name=author_json['name']))
Author.objects.bulk_create(set(authors))
However, this is much too slow.
The bottleneck is this query:
and not Author.objects.filter(author_id=author_json['ids'][0]).exists():
Unfortunately I have to make this query, because of course one author can write multiple papers and otherwise there will be a key conflict.
Is there a way to implement something like the normal get_or_create() efficiently with bulk_create?
To avoid creating entries with existing unique keys, you can enable the ignore_conflicts parameter:
def load_authors(paper_json_entries: List[Dict[str, any]]):
Author.objects.bulk_create(
(
Author(author_id=author_json['ids'][0], name=author_json['name'])
for paper_json in paper_json_entries
for author_json in paper_json['authors']
),
ignore_conflicts=True
)
Related
I have an SQL alchemy query in a dataloader.
query = (
db.session.query(Thing1)
.join(Thing2)
.outerjoin(Thing2.possible_related_object_1)
.outerjoin(Thing2.possible_related_object_1)
.with_entities(
Thing1.id,
Thing2,
PossibleRelatedObject1.name,
PossibleRelatedObject2.name,
)
)
My question: I only need the name fields from the possible related objects, if they exist. Does the fact that I am using with_entities here ensure that those columns on PossibleRelatedObject1 and PossibleRelatedObject2 get preloaded?
Because this is in a dataloader, I am trying to avoid any lazy loading. Normally, the strategy in this repo I'm working in has been to use joinedload to ensure eager loading, but in this case I don't see how to use joinedload because the possible related objects don't have a direct relationship to the query root. And my current strategy above seems pretty fast.
The SQLite JSON1 extension has some really neat capabilities. However, I have not been able to figure out how I can update or insert individual JSON attribute values.
Here is an example
CREATE TABLE keywords
(
id INTEGER PRIMARY KEY,
lang INTEGER NOT NULL,
kwd TEXT NOT NULL,
locs TEXT NOT NULL DEFAULT '{}'
);
CREATE INDEX kwd ON keywords(lang,kwd);
I am using this table to store keyword searches and recording the locations from which the search was ininitated in the object locs. A sample entry in this database table would be like the one shown below
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1}'
The location object attributes here are indices to the actual locations stored elsewhere.
Now imagine the following scenarios
A search for stackoverflow is initiated from location index "2". In this case I simply want to increment the value at that index so that after the operation the corresponding row reads
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":2,"5":1}'
A search for stackoverflow is initiated from a previously unknown location index "7" in which case the corresponding row after the update would have to read
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1,"7":1}'
It is not clear to me that this can in fact be done. I tried something along the lines of
UPDATE keywords json_set(locs,'$.2','2') WHERE kwd = 'stackoverflow';
which gave the error message error near json_set. I'd be most obliged to anyone who might be able to tell me how/whether this should/can be done.
It is not necessary to create such complicated SQL with subqueries to do this.
The SQL below would solve your needs.
UPDATE keywords
SET locs = json_set(locs,'$.7', IFNULL(json_extract(locs, '$.7'), 0) + 1)
WHERE kwd = 'stackoverflow';
I know this is old, but it's like the first link when searching, it deserves a better solution.
I could have just deleted this question but given that the SQLite JSON1 extension appears to be relatively poorly understood I felt it would be more useful to provide an answer here for the benefit of others. What I have set out to do here is possible but the SQL syntax is rather more convoluted.
UPDATE keywords set locs =
(select json_set(json(keywords.locs),'$.**N**',
ifnull(
(select json_extract(keywords.locs,'$.**N**') from keywords where id = '1'),
0)
+ 1)
from keywords where id = '1')
where id = '1';
will accomplish both of the updates I have described in my original question above. Given how complicated this looks a few explanations are in order
The UPDATE keywords part does the actual updating, but it needs to know what to updatte
The SELECT json_set part is where we establish the value to be updated
If the relevant value does not exsit in the first place we do not want to do a + 1 on a null value so we do an IFNULL TEST
The WHERE id = bits ensure that we target the right row
Having now worked with JSON1 in SQLite for a while I have a tip to share with others going down the same road. It is easy to waste your time writing extremely convoluted and hard to maintain SQL in an effort to perform in-place JSON manipulation. Consider using SQLite in memory tables - CREATE TEMP TABLE... to store intermediate results and write a sequence of SQL statements instead. This makes the code a whole lot eaiser to understand and to maintain.
I am using the code below to extract table names on a database at a GET call in a Flask app.:
session = db.session()
qry = session.query(models.BaseTableModel)
results = session.execute(qry)
table_names = []
for row in results:
for column, value in row.items():
#this seems like a bit of a hack
if column == "tables_table_name":
table_names.append(value)
print('{0}: '.format(table_names))
Given that tables in the database may added/deleted regularly, is the code above an efficient and reliable way to get the names of tables in a database?
One obvious optimization is to use row["tables_table_name"] instead of second loop.
Assuming that BaseTableModel is a table, which contains names of all other tables, than you're using the fastest approach to get this data.
I am doing something like this:
data = Model.where('something="something"')
random_data = data.rand(100..200)
returns:
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Once I get this random data, I need to iterate through that data, like this:
random_data.each do |rd|
...
I know there's a way to fetch random data in MySQL, but I need to pick the random data like 400 times, so I think to load data once from database and 400 times to pick random number is more efficient than to run the query 400 times on MySQL.
But - how to get rid of that error?
NoMethodError (private method `rand' called for #<User::ActiveRecord_Relation:0x007fbab27d7ea8>):
Thank you in advance
I would add the following scope to the model (depends on the database you are using):
# to model/model.rb
# 'RANDOM' works with postgresql and sqlite, whereas mysql uses 'RAND'
scope :random, -> { order('RAND()') }
Then the following query would load a random number (in the range of 200-400) of objects in one query:
Model.random.limit(rand(200...400))
If you really want to do that in Rails and not in the database, then load all records and use sample:
Model.all.sample(rand(200..400))
But that to be slower (depending on the number of entries in the database), because Rails would load all records from the database and instantiate them what might take loads of memory.
It really depends how much effort you want to put into optimizing this, because there's more than one solution. Here's 2 options..
Something simple is to use ORDER BY RAND() LIMIT 400 to randomly select 400 items.
Alternatively, just select everything under the moon and then use Ruby to randomly pick 400 out of the total result set, ex:
data = Model.where(something: 'something').all # all is necessary to exec query
400.times do
data.sample # returns a random model
end
I wouldn't recommend the second method, but it should work.
Another way, which is not DB specific is :
def self.random_record
self.where('something = ? and id = ?', "something", rand(self.count))
end
The only things here is - 2 queries are being performed. self.count is doing one query - SELECT COUNT(*) FROM models and the other is your actual query to get a random record.
Well, now suppose you want n random records. Then write it like :
def self.random_records n
records = self.count
rand_ids = Array.new(n) { rand(records) }
self.where('something = ? and id IN (?)',
"something", rand_ids )
end
Use data.sample(rand(100..200))
for more info why rand is not working, read here https://rails.lighthouseapp.com/projects/8994-ruby-on-rails/tickets/4555
I've got to add like 25000 records to database at once in Rails.
I have to validate them, too.
Here is what i have for now:
# controller create action
def create
emails = params[:emails][:list].split("\r\n")
#created_count = 0
#rejected_count = 0
inserts = []
emails.each do |email|
#email = Email.new(:email => email)
if #email.valid?
#created_count += 1
inserts.push "('#{email}', '#{Date.today}', '#{Date.today}')"
else
#rejected_count += 1
end
end
return if emails.empty?
sql = "INSERT INTO `emails` (`email`, `updated_at`, `created_at`) VALUES #{inserts.join(", ")}"
Email.connection.execute(sql) unless inserts.empty?
redirect_to new_email_path, :notice => "Successfuly created #{#created_count} emails, rejected #{#rejected_count}"
end
It's VERY slow now, no way to add such number of records 'cause of timeout.
Any ideas? I'm using mysql.
Three things come into mind:
You can help yourself with proper tools like:
zdennis/activerecord-import or jsuchal/activerecord-fast-import. The problem is with, your example, that you will also create 25000 objects. If you tell activerecord-import to not use validations, it will not create new objects (activerecord-import/wiki/Benchmarks)
Importing tens thousands of rows into relational database will never be super fast, it should be done asynchronously via background process. And there are also tools for that, like DelayedJob and more: https://www.ruby-toolbox.com/
Move the code that belongs to model out of controller(TM)
And after that, you need to rethink the flow of this part of application. If you're using background processing inside a controller action like create, you can not just simply return HTTP 201, or HTTP 200. What you need to do is to return "quick" HTTP 202 Accepted, and provide a link to another representation where user could check the status of their request (do we already have success response? how many emails failed?), as it is in now beeing processed in the background.
It can sound a bit complicated, and it is, which is a sign, that you maybe shouldn't do it like that. Why do you have to add like 25000 records in one request? What's the backgorund?
Why don't you create a rake task for the work? The following link explains it pretty well.
http://www.ultrasaurus.com/sarahblog/2009/12/creating-a-custom-rake-task/
In a nutshell, once you write your rake task, you can kick off the work by:
rake member:load_emails
If speed is your concern, I'd attack the problem from a different angle.
Create a table that copies the structure of your emails table; let it be emails_copy. Don't copy indexes and constraints.
Import the 25k records into it using your database's fast import tools. Consult your DB docs or see e.g. this answer for MySQL. You will have to prepare the input file, but it's way faster to do — I suppose you already have the data in some text or tabular form.
Create indexes and constraints for emails_copy to mimic emails table. Constraint violations, if any, will surface; fix them.
Validate the data inside the table. It may take a few raw SQL statements to check for severe errors. You don't have to validate emails for anything but very simple format anyway. Maybe all your validation could be done against the text you'll use for import.
insert into emails select * from emails_copy to put the emails into the production table. Well, you might play a bit with it to get autoincrement IDs right.
Once you're positive that the process succeeded, drop table emails_copy.