Using sqlalchemy in a multi-threaded cmd-line application

Using sqlalchemy in a multi-threaded cmd-line application - sqlalchemy

I am using sqlite with sqlalchemy.core i.e. not using session
There is a single table
can multiple-threads access and insert in that table using sqlalchemy.core?
I see that there is a SingleThreadPool, and it seems that engine.connect returns a thread_local connection?
If I use do engine.connect each time using an insert, is it ok?

Consider this answer.
Which explains that you should use ScopedSession when multi-threading is required. This takes care of maintaining single session per thread.

Related

How does Hibernate get the AutoIncrement Value on Identity Insert

I am working on a high scale application of the order of 35000 Qps, using Hibernate and MySQL.
A large table has AutoIncrement Primary key, and generation defined is IDENTITY at Hibernate. Show Sql is true as well.
Whenever an Insert happens I see only one query being fired in DB, which is an
Insert statement.
Few Questions Follow:
1) I was wondering how does Hibernate get the AutoIncrement Value after insert?
2) If the answer is "SELECT LAST_INSERT_ID()", why does it not show up at VividCortex or in Show Sql Logs...?
3) How does "SELECT LAST_INSERT_ID()" account for multiple autoincrements in different tables?
4) If MySql returns a value on Insert, why aren't the MySql clients built so that we can see what is being returned?
Thanks in Advance for all the help.

You should call SELECT LAST_INSERT_ID().
Practically, you can't do the same thing as the MySQL JDBC driver using another MySQL client. You'd have to write your own client that reads and writes the MySQL protocol.
The MySQL JDBC driver gets the last insert id by parsing packets of the MySQL protocol. The last insert id is returned in this protocol by a MySQL result set.
This is why SELECT LAST_INSERT_ID() doesn't show up in query metrics. It's not calling that SQL statement, it's picking the integer out of the result set at the protocol level.
You asked how it's done internally. A relevant line of code is https://github.com/mysql/mysql-connector-j/blob/release/8.0/src/main/protocol-impl/java/com/mysql/cj/protocol/a/result/OkPacket.java#L55
Basically, it parses an integer from a known position in a packet as it receives a result set.
I'm not going to go into any more detail about parsing the protocol. I don't have experience coding a MySQL protocol client, and it's not something I wish to do.
I think it would not be a good use of your time to implement your own MySQL client.

It probably uses the standard JDBC mechanism to get generated values.
It's not
You execute it imediately after inserting in one table, and you thus get the values that have been generated by that insert. But that's not what is being used, so it's irrelevant
Not sure what you mean by that: the MySQL JDBC driver allows doing that, using the standard JDBC API

(Too long for a comment.)
SELECT LAST_INSERT_ID() uses the value already available in the connection. (This may explain its absence from any log.)
Each table has its own auto_inc value.
(I don't know any details about Hibernate.)
35K qps is possible, but it won't be easy.
Please give us more details on the queries -- SELECTs? writes? 35K INSERTs?
Are you batching the inserts in any way? You will need to do such.
What do you then use the auto_inc value in?
Do you use BEGIN..COMMIT? What value of autocommit?

Doing a bulk SQL insert in django

Suppose I have a CSV file with 1M email addresses. I need to iterate through the file and add each entry, for example:
with open(file) as csv:
for item in csv:
Email.objects.create(email=item)
This seems like it would be very slow going through the django ORM like this to create 1M objects and insert them into the db. Is there a better way than this, or should I go away from django for this task and do it directly with the db?

You can also try using new bulk_create

Besides bulk_create, you could put all inserts into one transaction as long as your DB backend supports it:
from django.db.transaction import commit_on_success
# with commit_on_success(), open(file) as csv: # in Python2.7
with commit_on_success():
for item in csv:
Email.objects.create(email=item)
Also note that bulk_create treats items w/ same values to be same, thus
Email.objects.bulk_create([Email(email=item), Email(email=item)])
actually creates one row instead of two
Because of more SQLs turnaround, the transaction solution is still slower than the bulk_create one, but you don't have to create all one million Email() instances in memory (generator seems not work here)
Furthermore, you could do it in SQL-level directly

This is something you should drop to DB-API to accomplish, since you bypass creating all the model objects.

IMHO, I don't see very big problem with speed if it's only one-time insert (1M records won't take you hours). If you'll be using django api to access those objects in the future, then probably you should avoid resorting to SQL level insert, and do it through django's methods, like suggested by livar (if using django 1.4)

You might want to look into the Django DSE package, which is apparently an efficient bulk insert/update library.

What is the best way to filter a multi-tenant MySQL database?

In MySQL I have a single database with one schema. In Microsoft Sql Server it is recommended to use a "Tenant View Filter" so in Microsoft Sql Server this gives me exactly what I need.
CREATE VIEW TenantEmployees AS
SELECT * FROM Employees WHERE TenantID = SUSER_SID()
What is the best way to accomplish the same in MySQL? An equivalent to the "Tenant View Filter" will work if it is performs well.
Thanks!!

The query you suggest (that I could find in MSDN) has text afterwards that explains exactly what are its assumptions. In particular, it mentions that it assumes that the "owner" of a row in the Employees table is specified in the TenantID field that is populated according to the SID of the user(s) you are partitioning for.
What that means is that you can replicate the same idea whatever way you decide to implement your data as long as you have clearly defined partitions of the data and know exactly how to associate it with the table you are creating a view for.
In particular, if you configure your system so that each partition accesses the DB with its own credentials, you could use the CURRENT_USER or USER constructs of MySQL as the IDs defining your partitions and the query to create the view would be basically the same as the one suggested in MSDN replacing SUSER_ID with CURRENT_USER.
But if you use the same user to access from all the partitions, then the suggested method is irrelevant on either database server.

Since you need to use your tenantId value to perform filtering, a table valued user defined function would be ideal, as a view normally does not accept parameters. Unfortunately, unlike many other database products MySQL doesn't support table-valued functions. However, there are MySQL hacks that claim to emulate parametrized views. These could be useful for you.

It's a little tricky in MySQL, but it can be done:
CREATE OR REPLACE VIEW {viewName}
AS
SELECT {fieldListWithoutTenantID}
FROM {tableName}
WHERE (id_tenant = SUBSTRING_INDEX(USER( ),'#',1))
I wrote up a full blog post on how I converted a single-tenant MySQL application to multi-tenant in one weekend with minimal changes. https://opensource.io/it/mysql-multi-tenant/

Using existing data with Liquibase?

When using Liquibase, is there any way to use existing data to generate some of the data that is to be inserted?
For example, say I'd want to update a row with id 5, but I don't know up front that the id will be 5, as this is linked to another table where I will actually be getting the id from. Is there any way for me to tell Liquibase to get the id from SELECT query?
I'm guessing this isn't really possible as I get the feeling Liquibase is really designed for a very structured non-dynamic approach, but it doesn't hurt to ask.
Thanks.

You cannot use the built-in changes to insert data based on existing data, but you can use the tag with insert statements with nested selects.
For example:
<changeSet>
<sql>insert into person (name, manager_id) values ('Fred', (select id from person where name='Ted'))</sql>
</changeSet>
Note: the SQL (and support for insert+select) depends on database vendor.

It is possible write your own custom refactoring class to generate SQL. The functionality is designed to support the generation of static SQL based on the changeset's parameters.
So.. it's feasible to obtain a connection to the database, but the health warning attached to this approach is that the generated SQL is dynamic (your data could change) and tied tightly to your database instance.
An example of problems this will cause is an inability to generate a SQL upgrade script for a DBA to run against a production database.
I've been thinking about this use-case for some time. I still don't know if liquibase is the best solution for this data management problem or whether it needs to be combined with an additional tool like dbunit.

update the destination table

I want to pull data from an source destination. How can I insert rows that are not already in the table and update rows that already exist ?

We could use LOOK UP on target for the existing recrods. On matching Update otherwise insert in the target.
Other approach of using the MERGE statement.
thanks
prav

Use a slowly changing dimension transform see http://msdn.microsoft.com/en-us/library/ms141715.aspx

I would recommend CozyRoc's TableDifference component. I have used the predecessor from SQLBI.EU and it's very good.
I also recommend that instead of using a Command compponent to run individual updates on the stream with updates detected, that you stream the updates to a table and then use a single UPDATE statement in a SQL task to perform the update.

I found this webcast very helpful in learning some different methods of doing "upserts" with SSIS. You can download the samples referenced in the webcast and see working examples of exactly what you need. MSDN Architecture Webcast: Using SQL Server 2005 Integration Services to Populate a Kimball Method Data Warehouse (Level 200)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using sqlalchemy in a multi-threaded cmd-line application - sqlalchemy

Consider this answer. Which explains that you should use ScopedSession when multi-threading is required. This takes care of maintaining single session per thread.

Related

How does Hibernate get the AutoIncrement Value on Identity Insert

Doing a bulk SQL insert in django

What is the best way to filter a multi-tenant MySQL database?

Using existing data with Liquibase?

update the destination table

Categories

Resources