I'm looking for a way in SQLAlchemy to do a bulk INSERT whose rows are the result of a query. I know the session has the function add which can be used to add an individual object, but I can't seem to find how how it works with a subquery.
I know I could iterate over the results of the subquery and add them individually, but this would seem to be somewhat inefficient. In my case I am dealing with a potentially very large set of data that needs insertion.
I see following options:
using SA Model: create underlying objects with data loaded from the database, add them to session and commit.
Pros: if you have any AS Model level validation, you are able to use it; also you can insert into multiple tables if you model objects are mapped to multiple tables (Joined-Table Inheritance); is RDBMS independent
Cons: most expensive
using Insert statements: load data from the database into python and execute using Insert Expressions
Pros: somewhat faster when compared to 1.
Cons: still expensive as python structures are created; cannot directly handle Joined-Table Inheritance
create data using solely RDBMS: bulk insert using RDBMS only bypassing SA and python altogether.
Pros: fastest
Cons: no business object validation performed; potentially RDBMS-specific implementation required
I would suggest either option 1) or 3).
In fact, if you do not have any object validation and you use only one RDBMS, I would stick to option 3).
I believe the only way to do this in SQLAlchemy is issue a raw SQL statement using Session.execute
Since this is the top result on google for this common question, and there is actually a much better solution solution available, here's the updated answer. You may use the Insert.from_select() method. It is, although otherwise hard to find, documented here.
A quick primer
When working with Table objects you could use something like:
>>> from sqlalchemy.sql import select
>>> stmt = TargetTable.insert().from_select([TargetTable.c.user_id, TargetTable.c.user_name],
select([SrcTable.c.user_id, SrcTable.c.user_name]))
>>> print(stmt)
INSERT INTO "TargetTable" (user_id, user_name) SELECT "SrcTable".user_id, "SrcTable".user_name
FROM "SrcTable"
Finally execute with engine.execute(stmt) or the like.
The final output statement is compiled by SQLAlchemy depending on the dialect used in the engine. Here I used the SQLite dialect.
This successfully avoids loading any data into python objects, and let's the database engine efficiently handle everything. Hurray!
As opposed to using textual sql statements with text(), this method is also RDBMS independent, because it still uses the SQLAlchemy Expression Language as described here. This language makes sure to compile to the right dialect when executed.
Using ORM Tables
The original question points to a use case where the ORM is used for interaction with the database. You probably defined your tables using the ORM base as well. Metadata, as stored in these objects, works just a little different now. So we'll modify the example a bit:
>>> from sqlalchemy.sql import select, insert
>>> stmt = insert(TargetTable).from_select([TargetTable.user_id, TargetTable.user_name],
select([SrcTable.user_id, SrcTable.user_name]))
>>> engine.execute(stmt)
INFO sqlalchemy.engine.base.Engine INSERT INTO "TargetTable" (user_id, user_name) SELECT "SrcTable".user_id, "SrcTable".user_name
FROM "SrcTable"
Well look at that. It actually even made it a little simpler.
And it will be much faster.
P.S. Here's another secret from the docs. Want to use sql WITH statements in the same dynamic way? You can do it with "CTE's"
Given that the rows are the result of a query you can try INSERT INTO SELECT, this way the rows are never transferred to the client. And don't forget the autocommit=True:
from sqlalchemy.sql import text
query_text = text(
"INSERT INTO dest_table (col1, col2) SELECT col3, col4 FROM src_table"
)
with engine.connect().execution_options(autocommit=True) as conn:
rs = conn.execute(query_text)
Related
I use SQLDelight's MySQL dialect on my server. Recently I plan to migrate a table to combine many fields into a JSON field so the server code no longer needs to know the complex data structure. As part of the migration, I need to do something like this during runtime - when the sever sees a client with the new version, it knows the client won't access the old table anymore, so it's safe to migrate the record to new table.
INSERT OR IGNORE INTO new_table SELECT id, a, b, JSON_OBJECT('c', c, 'd', JSON_OBJECT(…)) FROM old_table WHERE id = ?;
The only problem is - Unlike the SQLite dialect, the MySQL dialect doesn't recognize JSON_OBJECT or other JSON expressions, even though in this case it doesn't have to - no matter how complex the query is, the result is not passed back to Kotlin.
I wish I could add the feature by myself, but I'm pretty new to Kotlin. So my question is: is there a way to evade the rigid syntax check? I could also retrieve from old table, convert the format in Kotlin, then write to the new table, but that would take hundreds of lines of complex code, instead of just one INSERT.
I assume from your links you're on the alpha releases already, in alpha03 you can add currently unsupported behaviour by creating a local SQLDelight module (see this example) and adding the JSON_OBJECT to the functionType override. Also new function types are one of the easiest things to contribute up to SQLDelight so if you want it in the next release
For the record I ended up using CONCAT with COALESCE as a quick and dirty hack to scrape the fields together as JSON.
Suppose I have a CSV file with 1M email addresses. I need to iterate through the file and add each entry, for example:
with open(file) as csv:
for item in csv:
Email.objects.create(email=item)
This seems like it would be very slow going through the django ORM like this to create 1M objects and insert them into the db. Is there a better way than this, or should I go away from django for this task and do it directly with the db?
You can also try using new bulk_create
Besides bulk_create, you could put all inserts into one transaction as long as your DB backend supports it:
from django.db.transaction import commit_on_success
# with commit_on_success(), open(file) as csv: # in Python2.7
with commit_on_success():
for item in csv:
Email.objects.create(email=item)
Also note that bulk_create treats items w/ same values to be same, thus
Email.objects.bulk_create([Email(email=item), Email(email=item)])
actually creates one row instead of two
Because of more SQLs turnaround, the transaction solution is still slower than the bulk_create one, but you don't have to create all one million Email() instances in memory (generator seems not work here)
Furthermore, you could do it in SQL-level directly
This is something you should drop to DB-API to accomplish, since you bypass creating all the model objects.
IMHO, I don't see very big problem with speed if it's only one-time insert (1M records won't take you hours). If you'll be using django api to access those objects in the future, then probably you should avoid resorting to SQL level insert, and do it through django's methods, like suggested by livar (if using django 1.4)
You might want to look into the Django DSE package, which is apparently an efficient bulk insert/update library.
When using Liquibase, is there any way to use existing data to generate some of the data that is to be inserted?
For example, say I'd want to update a row with id 5, but I don't know up front that the id will be 5, as this is linked to another table where I will actually be getting the id from. Is there any way for me to tell Liquibase to get the id from SELECT query?
I'm guessing this isn't really possible as I get the feeling Liquibase is really designed for a very structured non-dynamic approach, but it doesn't hurt to ask.
Thanks.
You cannot use the built-in changes to insert data based on existing data, but you can use the tag with insert statements with nested selects.
For example:
<changeSet>
<sql>insert into person (name, manager_id) values ('Fred', (select id from person where name='Ted'))</sql>
</changeSet>
Note: the SQL (and support for insert+select) depends on database vendor.
It is possible write your own custom refactoring class to generate SQL. The functionality is designed to support the generation of static SQL based on the changeset's parameters.
So.. it's feasible to obtain a connection to the database, but the health warning attached to this approach is that the generated SQL is dynamic (your data could change) and tied tightly to your database instance.
An example of problems this will cause is an inability to generate a SQL upgrade script for a DBA to run against a production database.
I've been thinking about this use-case for some time. I still don't know if liquibase is the best solution for this data management problem or whether it needs to be combined with an additional tool like dbunit.
I'm writing a test framework in which I need to capture a MySQL database state (table structure, contents etc.).
I need this to implement a check that the state was not changed after certain operations. (Autoincrement values may be allowed to change, but I think I'll be able to handle this.)
The dump should preferably be in a human-readable format (preferably an SQL code, like mysqldump does).
I wish to limit my test framework to use a MySQL connection only. To capture the state it should not call mysqldump or access filesystem (like copy *.frm files or do SELECT INTO a file, pipes are fine though).
As this would be test-only code, I'm not concerned by the performance. I do need reliable behavior though.
What is the best way to implement the functionality I need?
I guess I should base my code on some of the existing open-source backup tools... Which is the best one to look at?
Update: I'm not specifying the language I write this in (no, that's not PHP), as I don't think I would be able to reuse code as is — my case is rather special (for practical purposes, lets assume MySQL C API). Code would be run on Linux.
Given your requirements, I think you are left with (pseudo-code + SQL)
tables = mysql_fetch "SHOW TABLES"
foreach table in tables
create = mysql_fetch "SHOW CREATE TABLE table"
print create
rows = mysql_fetch "SELECT * FROM table"
foreach row in rows
// or could use VALUES (v1, v2, ...), (v1, v2, ...), .... syntax (maybe preferable for smaller tables)
insert = "INSERT (fiedl1, field2, field2, etc) VALUES (value1, value2, value3, etc)"
print insert
Basically, fetch the list of all tables, then walk each table and generate INSERT statements for each row by hand (most apis have a simple way to fetch the list of column names, otherwise you can fall back to calling DESC TABLE).
SHOW CREATE TABLE is done for you, but I'm fairly certain there's nothing analogous to do SHOW INSERT ROWS.
And of course, instead of printing the dump you could do whatever you want with it.
If you don't want to use command line tools, in other words you want to do it completely within say php or whatever language you are using then why don't you iterate over the tables using SQL itself. for example to check the table structure one simple technique would be to capture a snapsot of the table structure with SHOW CREATE TABLE table_name, store the result and then later make the call again and compare the results.
Have you looked at the source code for mysqldump? I am sure most of what you want would be contained within that.
DC
Unless you build the export yourself, I don't think there is a simple solution to export and verify the data. If you do it table per table, LOAD DATA INFILE and SELECT ... INTO OUTFILE may be helpful.
I find it easier to rebuild the database for every test. At least, I can know the exact state of the data. Of course, it takes more time to run those tests, but it's a good incentive to abstract away the operations and write less tests that depend on the database.
An other alternative I use on some projects where the design does not allow such a good division, using InnoDB or some other transactional database engine works well. As long as you keep track of your transactions, or disable them during the test, you can simply start a transaction in setUp() and rollback in tearDown().
I'm trying to manually manage some geometry (spatial) columns in a rails model.
When updating the geometry column I do this in rails:
self.geom="POINTFROMTEXT('POINT(#{lat},#{lng})')"
Which is the value I want to be in the SQL updates and so be evaluated by the database. However by the time this has been through the active record magic, it comes out as:
INSERT INTO `places` (..., `geom`) VALUES(...,'POINTFROMTEXT(\'POINT(52.2531519,20.9778386)\')')
In other words, the quotes are escaped. This is fine for the other columns as it prevents sql-injection, but not for this. The values are guaranteed to be floats, and I want the update to look like:
INSERT INTO `places` (..., `geom`) VALUES(...,'POINTFROMTEXT('POINT(52.2531519,20.9778386)')')
So is there a way to turn escaping off for a particular column? Or a better way to do this?
(I've tried using GeoRuby+spatial adapter, and spatial adaptor seems too buggy to me, plus I don't need all the functionality - hence trying to do it directly).
The Rails Spatial Adapter should implement exactly what you need. Although, before I found GeoRuby & Spatial Adapter, I was doing this:
Have two fields: one text field and a real geometry field, on the model
On a after_save hook, I ran something like this:
connection.execute "update mytable set geom_column=#{text_column} where id=#{id}"
But the solution above was just a hack, and this have additional issues: I can't create a spatial index if the column allows NULL values, MySQL doesn't let me set a default value on a geometry column, and the save method fails if the geometry column doesn't have a value set.
So I would try GeoRuby & Spatial Adapter instead, or reuse some of its code (on my case, I am considering extracting only the GIS-aware MysqlAdapter#quote method from the Spatial Adapter code).
You can use an after_save method, write them with a direct SQL UPDATE call. Annoying, but should work.
You should be able to create a trigger in your DB migration using the 'execute' method... but I've never tried it.
Dig into ActiveRecord's calculate functionality: max/min/avg, etc. Not sure whether this saves you much over the direct SQL call in after_save. See calculations.rb.
You could patch the function that quotes the attributes (looking for POINTFROMTEXT and then skip the quoting). This is pretty easy to find, as all the methods start with quote. Start with ActiveRecord::Base #quote_value.