Is it possible to run functions from asynchronous libraries like asyncpg in Beam ParDo? - sqlalchemy

We are using pg8000 as the driver for inserting records into our Cloud SQL Postgres instance. Our Beam code is in Python and uses the library sqlalchemy to interact with the database. We find pg8000 too slow and tested asyncpg instead locally (no Beam code) with promising results. We used this as a reference in using asyncpg:
https://github.com/GoogleCloudPlatform/cloud-sql-python-connector/blob/main/README.md
Can we use asynchronous functions like asyncpg in establishing DB connection within a Beam ParDo? How do we structure the PTransform's functions like setup (establishing an async connection), start_bundle (start the transaction), process (perform SQL operation), finish_bundle (commit/rollback the transaction), tear_down (close the connection)?

Yes, it is possible to use asynchronous functions if you are careful. Here is what you have to be careful of:
Beam objects passed to your DoFn are generally not intended for concurrent use, so you should not pass Beam objects to the concurrent functions. Instead, you should just use the functions to do generic processing and keep all the Beam-specific logic on the main thread.
You need to wait for all the asynchronous tasks to finish in finishBundle, so that all processing is done before the bundle is committed as "done".

Related

Asynchronous vs Synchronous Callbacks in Plotly Dash for Python

I am designing an app that will have multiple callbacks, each querying different tables of a SQL db.
As I am designingg, it will be helpful to understand if the callbacks will update (query the db) synchronously or asynchronously.
Per the docs, for advanced callbacks:
Whether or not these requests are executed in a synchronous or
asynchronous manner depends on the specific setup of the Dash back-end
server. If it is running in a multi-threaded environment, then all of
the callbacks can be executed simultaneously, and they will return
values based on their speed of execution. In a single-threaded
environment however, callbacks will be executed one at a time in the
order they are received by the server.
Given that this is in Python, would that mean by default that I am running a “single-thread environment?”
If asynchronous the the only option, is there a way to use a package like ‘asyncio’ to at least speed up the total time it takes for all callbacks to load?
thank you!!

Azure - Trigger which copies multible blobs

I am currently working on a ServiceBus trigger (using C#) which copies and moves related blobs to another blob storage and Azure Data Lake. After copying, the function has to emit a notification to trigger further processing tasks. Therefore, I need to know, when the copy/move task has been finished.
My first approach was to use a Azure Function which copies all these files. However, Azure Functions have a processing time limit of 10 minutes (when manually set) and therefore it seems to be not the right solution. I was considering calling azCopy or StartCopyAsync() to perform an asynchronous copy, but as far as I understand, the processing time of the function will be as long as azCopy takes. To solve the time limit problem, I could use WebJobs instead, but there are also other technologies like Logic Apps, Durable Azure functions, Batch jobs, etc. which makes me confused about choosing the right technology for this problem. The function won't be called every second but might copy large data. Does anybody have an idea?
I just found out that Azure Functions only have a time limit when using consumption plan. If there is no better solution for copy blob tasks, I'll go for Azure Functions.

Pushing data to client whenever a database field changes

I'm using socket.io to send data to client from my database. But my code send data to client every second even the data is the same. How can I send data only when field in db was changed not every 1 second.
Here is my code: http://pastebin.com/kiTNHgnu
With MySQL there is no easy/simple way to get notified of changes. A couple of options include:
If you have access to the server the database is running on, you could stream MySQL's binlog through some sort of parser and then check for events that modify the desired table/column. There are already such binlog parsers on npm if you want to go this route.
Use a MySQL trigger to call out to a UDF (user-defined function). This is a little tricky because there aren't a whole lot of these for your particular need. There are some that could be useful however, such as mysql2redis which pushes to a Redis queue which would work if you already have Redis installed somewhere. There is a STOMP UDF for various queue implementations that support that wire format. There are also other UDFs such as log_error which writes to a file and sys_exec which executes arbitrary commands on the server (obviously dangerous so it should be an absolute last resort). If none of these work for you, you may have to write your own UDF which does take quite some time (speaking from experience) if you're not already familiar with the UDF C interface.
I should also note that UDFs could introduce delays in triggered queries depending on how long the UDF takes to execute.

NodeJS cache mysql data whith clustering enabled

I want to cache data that I got from my MySQL DB and for this I am currently storing the data in an object.
Before querying the database, I check if the needed data exists in the meantioned object or not. If not, I will query and insert it.
This works quiet well and my webserver is now just fetching the data once and reuses it.
My concern is now: Do I have to think of concurrent writes/reads for such data structures that lay in the object, when using nodejs's clustering feature?
Every single line of JavaScript that you write on your Node.js program is thread-safe, so to speak - at any given time, only a single statement is ever executed. The fact that you can do async operations is only implemented at a low level implementation that is completely transparent to the programmer. To be precise, you can only run some code in a "truly parallel" way when you do some input/output operation, i.e. reading a file, doing TCP/UDP communication or when you spawn a child process. And even then, the only code that is executed in parallel to your application is that of Node's native C/C++ code.
Since you use a JavaScript object as a cache store, you are guaranteed no one will ever read or write from/to it at the same time.
As for cluster, every worker is created its own process and thus has its own version of every JavaScript variable or object that exists in your code.

Sqlalchemy sessions and autobahn

I'm using the autobahn server in twisted to provide an RPC API. Some calls require queries to the database and multiple clients may be connected via websocket to the server.
I am using the SqlAlchemy ORM to access the database.
What are the pros and cons of the two following approaches for dealing with SqlAlchemy sessions.
Create and destroy a session for every RPC call
Create a single session when the server starts and use it in every RPC call
Which would you recommend and why? (I'm leaning towards 2)
The recommended way of doing SQL-based database access from Twisted (and Autobahn) with databases like PostgreSQL, Oracle or SQLite would be twisted.enterprise.adbapi.
twisted.enterprise.adbapi will run queries on a background thread pool, which is required, since most database drivers are blocking.
Sidenote: for PostgreSQL, there is a native-asynchronous, non-blocking
driver also: txpostgres.
Now, if you put an ORM like SQLAlchemy on top of the native SQL driver, I'm not sure how this will work together (if at all) with twisted.enterprise.adbapi.
So from the options you mention
Is a no go, since most drivers are blocking (and Autobahn's RPCs run on the main thread = Twisted reactor thread - and you MUST not block that).
With this, you need to put the database session(s) in background threads (again, to not block).
Also see here.
If you're using SQLAlchemy and Twisted together, consider using Alchimia rather than the built-in adbapi.