SnappyData multiple jobs to achieve parallelism - snappydata

I am using Snappydata and SQL to run some analysis, however the job is slow and involves join operations on very large input data.
I am considering partition the input data first, then run the jobs on different partitions at the same time to speed up the process. But
in the embedded mode I am using, my code gets the SnappySession passed in, and I can use bin/snappy-sql to query the tables, So I assume all snappydata jobs would share the same SnappySession (or same table namespace, like the same database in Postgresql in my understanding).
So I assume if I submit my job using the same jar with different input arguments, the tables namespace would be the same for different jobs, thus causing errors.
So my question is: is it possible to have multiple snappySession (or multiple namespace like database names) that run a series of operations independently, preferably in one snappydata job to avoid managing many jobs at the same time?

I am not sure I follow the question. Maybe this will help:
When queries are submitted using snappy-sql this shell uses JDBC to connect and run the query. Internally snappy will start a Job and run concurrent tasks on each partition depending on the query. And, yes, this SQL session internally is associated with a unique SnappySession (spark session).
Or, maybe, you are trying to partition the data across many tables and start processing on these tables independently but in parallel ?

Related

Update large amount of data in SQL database via Airflow

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.

Equivalent of oracle's Parallel option in Mysql

In oracle we can create a table and insert data and select it with parallel option.
Is there any similar option in mysql. I am migrating from oracle to mysql and my system has more select and less data change, so any option to select parallely is what i am seeking for.
eg: Lets consider my table has 1 million rows and if i use parallel(5) option then five threads are running the same query with limit and fetching approximately 200K each and as final result i get 1 million record in 1/5th of usual time.
In short, the answer is no.
The MySQL server is designed to execute concurrent user sessions in parallel, but not to execute one given user session in several parts in parallel.
This is a personal opinion, but I would refrain from wanting to apply optimizations up front, making assumptions about how the RDBMS works. Better measure the query first, and see if the response time is a real concern or not, and only then investigate possible optimizations.
"Premature optimization is the root of all evil." (Donald Knuth)
Queries within MySQL are always run parallel. If you want to run different queries simultaneously through your program, however, you would need to open different connections through workers that your program would have async access to.
You could also run tasks through creating events or using delayed inserts, however I don't think that applies very well here. Something else to consider:
Generally, some operations are guarded between individual query
sessions (called transactions). These are supported by InnoDB
backends, but not MyISAM tables (but it supports a concept called
atomic operations). There are various level of isolation which differ
in which operations are guarded from each other (and thus how
operations in one parallel transactions affect another) and in their
performance impact. - Holger Just
He also mentions the MySQL transcations page, which breifly goes over the different engine types available to MySQL (MyISAM being faster, but not as reliable):
MySQL Transcations

Copying data from PostgreSQL to MySQL

I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.
In my MySQL database, I have a table which contains an identical schema to the output of the query described above.
What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.
Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.
The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.
To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.
The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.
In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.
Open to any suggestions.
Use the same database system on both ends and use replication
If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.
If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.
Sync using an external script
There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.
Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.
If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.
I'd do the work in this order:
Connect to PostgreSQL
Connect to MySQL
Begin a PostgreSQL transaction
Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
Insert them into MySQL
DELETE any rows from the queue table in PostgreSQL if you haven't already.
COMMIT the MySQL transaction.
If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.
The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:
PREPARE TRANSACTION in PostgreSQL
COMMIT in MySQL
then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.
This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.
BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.
Use DBI-Link in PostgreSQL
Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.
When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.
Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.
Use a trigger to queue changes and DBI-link to sync
If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.
Add MySQL to PostgreSQL with a foreign data wrapper
Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.
WITH moved_rows AS (
DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
In short you have two scenarios:
1) Make destination pull the data from source into its own structure
2) Make source push out the data from its structure to destination
I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling.
I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.
External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates&times from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.

Async Bulk(batch) insert to MySQL(or MongoDB?) via Node.js

Straight to the Qeustion ->.
The problem : To do async bulk inserts (not necessary bulk, if MySql can Handle it) using Node.js (coming form a .NET and PHP background)
Example :
Assume i have 40(adjustable) functions doing some work(async) and each adding a record in the Table after its single iteration, now it is very probable that at the same time more than one function makes an insertion call. Can MySql handle it that ways directly?, considering there is going to be an Auto-update field.
In C#(.NET) i would have used a dataTable to contain all the rows from each function and in the end bulk-insert the dataTable into the database Table. and launch many threads for each function.
What approach will you suggest in this case,
Shall the approach change in case i need to handle 10,000 or 4 million rows per table?
ALso The DB schema is not going to change, will MongoDB be a better choice for this?
I am new to Node, NoSql and in the noob learning phase at the moment. So if you can provide some explanation to your answer, it would be awesome.
Thanks.
EDIT :
Answer : Neither MySql or MongoDB support any sort of Bulk insert, under the hood it is just a foreach loop.
Both of them are capable of handling a large number of connections simultanously, the performance will largely depend on you requirement and production environment.
1) in MySql queries are executed sequentially per connection. If you are using one connection, your 40~ functions will result in 40 queries enqueued (via explicit queue in mysql library, your code or system queue based on syncronisation primitives), not necessarily in the same order you started 40 functions. MySQL won't have any race conditions problems with auto-update fields in that case
2) if you really want to execute 40 queries in parallel you need to open 40 connections to MySQL (which is not a good idea from performance point of view, but again, Mysql is designed to handle auto-increments correctly for multiple clients)
3) There is no special bulk insert command in the Mysql protocol on the wire level, any library exposing bulk insert api in fact just doing long 'insert ... values' query.

Mysql - Find unused tables and columns

Let's say we have "system A" comprising a MySQL database, with several tables.
After a while you want to optimize system A by removing any unused tables and/or columns, system A being quite large and difficult to overlook. Is there a tool or method that lets you run the system for a while, and then do an analysis which prints out general use of tables, columns etc - which would make it possible to find unused tables and columns.
I'm thinking of maybe hooking into the system, logging all SQL commands going to the server, but would in that case rather be doing that on the database side, rather than the application side.
The General Query Log is probably what you're looking for.
According to MySQL, with the General Query Log on:
The server writes information to this log when clients connect or disconnect, and it logs each SQL statement received from clients.
You need to start MySQL with the --log[=file_name] or -l [file_name] option in order to use it.
Assuming you are already linking the database with an application of some sort, it shouldn't then be hard to find the columns that are being used over the columns that are not.
Further a program such as profiler (not sure on the MySQL equivalence) can be used to display all the SQL calls. From this you will know all the columns that are being used.