Doing a bulk SQL insert in django - mysql

Suppose I have a CSV file with 1M email addresses. I need to iterate through the file and add each entry, for example:
with open(file) as csv:
for item in csv:
Email.objects.create(email=item)
This seems like it would be very slow going through the django ORM like this to create 1M objects and insert them into the db. Is there a better way than this, or should I go away from django for this task and do it directly with the db?

You can also try using new bulk_create

Besides bulk_create, you could put all inserts into one transaction as long as your DB backend supports it:
from django.db.transaction import commit_on_success
# with commit_on_success(), open(file) as csv: # in Python2.7
with commit_on_success():
for item in csv:
Email.objects.create(email=item)
Also note that bulk_create treats items w/ same values to be same, thus
Email.objects.bulk_create([Email(email=item), Email(email=item)])
actually creates one row instead of two
Because of more SQLs turnaround, the transaction solution is still slower than the bulk_create one, but you don't have to create all one million Email() instances in memory (generator seems not work here)
Furthermore, you could do it in SQL-level directly

This is something you should drop to DB-API to accomplish, since you bypass creating all the model objects.

IMHO, I don't see very big problem with speed if it's only one-time insert (1M records won't take you hours). If you'll be using django api to access those objects in the future, then probably you should avoid resorting to SQL level insert, and do it through django's methods, like suggested by livar (if using django 1.4)

You might want to look into the Django DSE package, which is apparently an efficient bulk insert/update library.

Related

Validation of migrated data for MySQL

I'm migrating a large(approx. 10GB) MySQL database(InnoDB engine).
I've figured out the migration part. Export -> mysqldump, Import -> mysql.
However, I'm trying to figure out the optimum way to validate if the migrated data is correct. I thought of the following approaches but they don't completely work for me.
One approach could have been using CHECKSUM TABLE. However, I can't use it since the target database would have data continuously written to it(from other sources) even during migration.
Another approach could have been using the combination of MD5(), GROUP_CONCAT, and CONCAT. However, that also won't work for me as some of the columns contain large JSON data.
So, what would be the best way to validate that the migrated data is correct?
Thanks.
How about this?
Do SELECT ... INTO OUTFILE from each old and new table, writing them into .csv files. Then run diff(1) between the files, eyeball the results, and convince yourself that the new tables' rows are an appropriate superset of the old tables'.
These flat files are modest in size compared to a whole database and diff is fast enough to be practical.

Solr: continuous migration from MySQL

This may sound like an opinion question, but it's actually a technical one: Is there a standard process for maintaining a simple data set?
What I mean is this: let's say all I have is a list of something (we'll say books). The primary storage engine is MySQL. I see that Solr has a data import handler. I understand that I can use this to pull in book records on a first run - is it possible to use this for continuous migration? If so, would it work as well for updating books that have already been pulled into Solr as it would for pulling in new book records?
Otherwise, if the data import handler isn't the standard way to do it, what other ways are there? Thoughts?
Thank you very much for the help!
If you want to update documents from within Solr, I believe you'll need to use the UpdateRequestHandler as opposed to the DataImportHandler. I've never had need to do this where I work, so I don't know all that much about it. You may find this link of interest: Uploading Data With Index Handlers.
If you want to update Solr with records that have newly been added to your MySQL database, you would use the DataImportHandler for a delta-import. Basically, how it works is you have some kind of field in MySQL that shows the new record is, well, new. If the record is new, Solr will import it. For example, where I work, we have an "updated" field that Solr uses to determine whether or not it should import that record. Here's a good link to visit: DataImportHandler
The question looks similar to the one which we are doing, but not with SQL. Its with HBase(hadoop stack DB). However there we have Hbase indexer, which after mapping DB with Solr, listens to the events in hbase(DB) for new rows, and then executes code to fetch those values from DB and add in Solr. Not sure if there is such for SQL. However the concept looks similar. IN SQL I know about triggers which can listen to inserts and updates. At that even, you can trigger something to execute the steps of adding them in continuosly manner.

Sybase to MySQL automatic exportation

I have two databases: Sybase and MySQL. I need to export records to MySql when these are inserted in Sybase or export in some scheduled event.
I've tried with output statement but this can not be used in triggers or procedures.
Any suggestion to solve this problem?
(disclaimer, I've done similar things previously, but by no means would I consider the answer below the state of the art - just one possible approach
google around something like 'cross-database replication' or 'cross rdbms replication' to see who's done this before.
).
I would first of all see if you can't score an ETL tool do the job without too much work. There are free open source ones and even things like Microsoft SSIS might work on non-MS databases.
If not, I would split this into different steps.
Find an appropriate Sybase output command that exports a subset of rows from one or more tables. By subset I mean you need to be able to add a WHERE clause, not just do a full table dump.
Use an appropriate MySQL import script/command to load the data gotten out of step #1. You may need to cycle back and forth between the 2 till you have something that works manually.
Write a Sybase trigger to insert lookup keys into a to-export table. You want to store at least the tablename & source Sybase table's keys for each inserted row. Use column names like key1_char, key2_char, not the actual column names, that makes it easier to extend to other source tables as needed. keep trigger processing as light as possible. What about updates btw?
Write a scheduled batch on Sybase side to run step #1 for the rows flagged in #3.
Write a scheduled batch on Mysql to import ,via #2, the results of #4. Or kick it off from #4.
Another approach is to do the #3 flagging bit as needed, but use to drive one scheduled batch that SELECTs data from Sybase and INSERTs it into mysql directly.
You'll have to pick up the data from Sybase's SELECT and bind it manually to the INSERT of mysql. But you probably get finer control over whats going on and you don't have to juggle 2 batches. That's what I think a clever ETL would already be doing on your behalf. Any half clever scripting language like php, python or ruby ought to handle it easily. Especially important if you have things like surrogate/auto-generated keys.
Keep in mind that in both cases you'll have to either delete the to-export rows that you've successfully inserted or flag them as done.

Inserting Symfony NestedSet data via MySQL queries

On a Symfony exploration project, I have a model using doctrine NestedSet behaviour. Since the data is prepared in a flat file, I wrote a conversion utility to generate corresponding YAML. It turns out that processing NestedSet YML of around 100 records (max_depth=4) consumes over 40MB of PHP memory, which is not available to me.
Is there a work-around to this problem?
I'm thinking of 2 possible solutions.
Write an equivalent PHP script to populate objects & save them
Insert data via SQL statements, the challenge being to compute the left & right nodes
What do Symfonians suggest?
I suggest you insert the data in several times, one time per level starting at level 0, for instance.
Option 2 was better.
I wrote a simple macro in Excel to compute the lgt and rgt values required for the pre-order tree. Logic as mentioned here: http://www.sitepoint.com/hierarchical-data-database-2/
The same Excel utility would convert values to a SQL query that could be dumped via file.
Going through the fixtures/object route exceeded the allowed memory limit.

SQL Alchemy - INSERT results of query

I'm looking for a way in SQLAlchemy to do a bulk INSERT whose rows are the result of a query. I know the session has the function add which can be used to add an individual object, but I can't seem to find how how it works with a subquery.
I know I could iterate over the results of the subquery and add them individually, but this would seem to be somewhat inefficient. In my case I am dealing with a potentially very large set of data that needs insertion.
I see following options:
using SA Model: create underlying objects with data loaded from the database, add them to session and commit.
Pros: if you have any AS Model level validation, you are able to use it; also you can insert into multiple tables if you model objects are mapped to multiple tables (Joined-Table Inheritance); is RDBMS independent
Cons: most expensive
using Insert statements: load data from the database into python and execute using Insert Expressions
Pros: somewhat faster when compared to 1.
Cons: still expensive as python structures are created; cannot directly handle Joined-Table Inheritance
create data using solely RDBMS: bulk insert using RDBMS only bypassing SA and python altogether.
Pros: fastest
Cons: no business object validation performed; potentially RDBMS-specific implementation required
I would suggest either option 1) or 3).
In fact, if you do not have any object validation and you use only one RDBMS, I would stick to option 3).
I believe the only way to do this in SQLAlchemy is issue a raw SQL statement using Session.execute
Since this is the top result on google for this common question, and there is actually a much better solution solution available, here's the updated answer. You may use the Insert.from_select() method. It is, although otherwise hard to find, documented here.
A quick primer
When working with Table objects you could use something like:
>>> from sqlalchemy.sql import select
>>> stmt = TargetTable.insert().from_select([TargetTable.c.user_id, TargetTable.c.user_name],
select([SrcTable.c.user_id, SrcTable.c.user_name]))
>>> print(stmt)
INSERT INTO "TargetTable" (user_id, user_name) SELECT "SrcTable".user_id, "SrcTable".user_name
FROM "SrcTable"
Finally execute with engine.execute(stmt) or the like.
The final output statement is compiled by SQLAlchemy depending on the dialect used in the engine. Here I used the SQLite dialect.
This successfully avoids loading any data into python objects, and let's the database engine efficiently handle everything. Hurray!
As opposed to using textual sql statements with text(), this method is also RDBMS independent, because it still uses the SQLAlchemy Expression Language as described here. This language makes sure to compile to the right dialect when executed.
Using ORM Tables
The original question points to a use case where the ORM is used for interaction with the database. You probably defined your tables using the ORM base as well. Metadata, as stored in these objects, works just a little different now. So we'll modify the example a bit:
>>> from sqlalchemy.sql import select, insert
>>> stmt = insert(TargetTable).from_select([TargetTable.user_id, TargetTable.user_name],
select([SrcTable.user_id, SrcTable.user_name]))
>>> engine.execute(stmt)
INFO sqlalchemy.engine.base.Engine INSERT INTO "TargetTable" (user_id, user_name) SELECT "SrcTable".user_id, "SrcTable".user_name
FROM "SrcTable"
Well look at that. It actually even made it a little simpler.
And it will be much faster.
P.S. Here's another secret from the docs. Want to use sql WITH statements in the same dynamic way? You can do it with "CTE's"
Given that the rows are the result of a query you can try INSERT INTO SELECT, this way the rows are never transferred to the client. And don't forget the autocommit=True:
from sqlalchemy.sql import text
query_text = text(
"INSERT INTO dest_table (col1, col2) SELECT col3, col4 FROM src_table"
)
with engine.connect().execution_options(autocommit=True) as conn:
rs = conn.execute(query_text)