Machine monitoring application ( mysql ) - mysql

I am building a monitoring application for a machine, position must be read and stored every second for a period of a month. I wrote a procedure to fill a table with initial 0 value.
CREATE DEFINER=`root`#`localhost` PROCEDURE `new_procedure`(start_date datetime, end_date datetime)
BEGIN
DECLARE interval_var INT DEFAULT 1;
WHILE end_date >= start_date DO
INSERT INTO table1(datetime, value) VALUES(start_date, 0);
SET start_date = DATE_ADD(start_date, INTERVAL interval_var SECOND);
END WHILE;
END
This process is very slow, and most of the time the connection with the sql database is lost. For example, when I try to fill the table from "2016-01-14 07:00:00" to "2016-01-15 07:00:00" the procedure reached 2016-01-14 07:16:39 and crashed due to lost connection with database.
Is there a more efficient way to create a template table for a month with second increments and 0 values? My monitoring application is built on vb.net and I have tried to create a code on vb to create this template table, but it was slower and more likely to crash than direct procedure on mysql workstation.

I would recommend looking at the application architecture first. Expecting the whole system to run without failing for even 1 second for an entire month is asking an awful lot. Also, try to think about if you really need this much data.
1 record/sec*3600 sec/hr*24 hr/day*30 day/mo is more than 3.1 million records. Trying to process that much information will cause a lot of software to choke.
If the measurement is discrete, you may be able to cut this dramatically by only recording changes in the database. If it's analog, you may have no choice.
I would recommend creating two separate applications: a local monitor that stores local data, and then reports to the mysql server application every hour or so. That way, if the database is unavailable, it keeps right on collecting data, and when the database is available again, it can transfer everything that has been recorded since the last connection. Then the mysql application can store the data in the database in one shot. If that fails, it can retry, and keep it's own copy of the data until it gets stored into the database.
Ex:
machine => monitoring station app => mysql app => mysql database
It's a little more work, but each application will be pretty small, and they may even be reusable. And it will make troubleshooting much easier, and dramatically increase the fault tolerance in the system.

Related

Update large amount of data in SQL database via Airflow

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?
The constrain are:
The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
Some of the ideas I have:
Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits
My recent airflow related ETL project could be a reference for you.
Input DB: LargeDB (billion row level Oracle)
Interim DB: Mediam DB( tens of million level HD5 file)
Output
DB: Mediam DB (tens of millsion level mysql )
As far as I encountered, write to db is main block for such ETL process. so as you can see,
For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes.
Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf
For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode.
total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.
if you like the answer, pls vote it up
One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.
Currently, you main constrains are:
the table need still be readable while the job is running
It means no lock allowed.
the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time
it should capable with multiple writing in sample time.
I add one things for you may considered as well:
reasonable read performance while writing.
** performance and user experience is key
Partition table could reach all requirements. It is transparence to client applicationi.
At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.
The main steps are:
Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):
-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)
-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.
If you like the answer, pls vote it up.
Best Regards,
WY
A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.
Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.
A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.
LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.

Logging end of a run in SQL

I have a database, let's say in MySQL, that logs runs of client programs that connect to the database. When doing a run, the client program will connect to the database, insert a "Run" record with the start timestamp into the "Runs" table, enter its data into other tables for that run, and then update the same record in the "Runs" table with the end timestamp of the run. The end timestamp is NULL until the end of the run.
The problem is that the client program can be interrupted -- someone can hit Ctrl^C, the system can crash, etc. This would leave the end timestamp as NULL; i.e. I couldn't tell the difference between a run that's still ongoing and one that terminated ungracefully at some point.
I wouldn't want to wrap the entire run in a transaction because the runs can take a long time and upload a lot of data, and all of the data from a partial run would be desired. (There will be lots of smaller transactions during the run, however.) I also need to be able to view the data in real-time in another SQL connection as it's being uploaded by a client, so a mega-transaction for the entire run would not be good for that purpose.
During a run, the client will have a continuous session with the SQL server, so it would be nice if there could be a "trigger" or similar functionality on the connection closing that would update the Run record with the ending timestamp. It would also be nice if such a "trigger" could add a status like "completed successfully" vs. "terminated ungracefully" to boot.
Is there a solution for this in MySQL? How about PostgreSQL or any other popular relational database system?

Consecutive MySQL queries from Go become MUCH slower after some point

I'm writing a job in go that goes through some MySQL tables, selects some of the rows based on some critera, extracts email addresses from them and sends an email to each.
The filtering process looks at a table (let's call it storage) which is pretty big (~6gb dumped) and looks like this:
Columns:
id varchar(64) PK
path varchar(64) PK
game varchar(64)
guid varchar(64)
value varchar(512)
timestamp timestamp
There are two indices: (id, path) (the PK as seen above) and guid.
The job first retrieves a long list of guids from one table, then batches them and performs consecutive queries like this on the storage table:
SELECT guid, timestamp FROM storage
WHERE game = 'somegame'
AND path = 'path' AND value = 'value' AND timestamp >= '2015-04-22 00:00:00.0' AND timestamp <= '2015-04-29T14:53:07+02:00'
AND guid IN ( ... )
Where the IN clause contains a list of guids.
I need to retrieve timestamps to be able to filter further.
When running against my local MySQL, everything works as expected, the query takes about 180ms with batches of 1000 guids.
When running against the same DB on Amazon RDS, the queries begin quick, but after some point, they suddenly start taking around 30 seconds, and continue to do so until the job ends.
I have tried many many things to fix this, but can't figure out the reason. Some notes:
The job uses only one sql.DB object. Also, I prepare the above statement once and reuse it quite heavily.
At first, I thought it was because the RDS DB was running MySQL 5.5, and I was running 5.6. I made a replica of the RDS DB, upgraded to 5.6, ran the job again. The problem happened again.
The volume of data in the two databases is the same: I dumped the production database and imported it into my local database and ran the job. Same behaviour (it still ran quickly locally).
The AWS monitoring of the RDS nodes doesn't show any significant spikes. The CPU usage jumps from 1% to up to 10%, and the job seems to open just a few connections (~4).
I had a colleague run the job on their PC, pointing to my MySQL DB, just to make sure the great performance didn't stem from the fact that the connection was local. It ran just as quickly as on my PC (admittedly, over LAN).
I ran the job against RDS both from my local PC and from an Amazon EC2 node, which is considerably closer to RDS. From EC2, it performed better, but the problem still appeared.
The job is very concurrent, each step has input and output channels (with buffer sizes of 1000), and the work is performed by goroutines. Between the steps, I have other goroutines that batch the output of the previous goroutine.
The slowdown is sudden, one query takes milliseconds, and the next one takes tens of seconds.
I haven't the faintest idea why this happens. Any suggestions would be appreciated.
So, after lots and lots of experimentation, I found the solution.
I am using Magnetic Storage on the RDS instances involved, which guarantees approximately 100 IOPS. This limited the speed with which we could query the data.
I tested using 2000 Provisioned IOPS, and the job ran quickly all the way.

Copying data from PostgreSQL to MySQL

I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.
In my MySQL database, I have a table which contains an identical schema to the output of the query described above.
What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.
Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.
The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.
To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.
The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.
In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.
Open to any suggestions.
Use the same database system on both ends and use replication
If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.
If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.
Sync using an external script
There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.
Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.
If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.
I'd do the work in this order:
Connect to PostgreSQL
Connect to MySQL
Begin a PostgreSQL transaction
Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
Insert them into MySQL
DELETE any rows from the queue table in PostgreSQL if you haven't already.
COMMIT the MySQL transaction.
If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.
The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:
PREPARE TRANSACTION in PostgreSQL
COMMIT in MySQL
then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.
This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.
BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.
Use DBI-Link in PostgreSQL
Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.
When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.
Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.
Use a trigger to queue changes and DBI-link to sync
If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.
Add MySQL to PostgreSQL with a foreign data wrapper
Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.
WITH moved_rows AS (
DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
In short you have two scenarios:
1) Make destination pull the data from source into its own structure
2) Make source push out the data from its structure to destination
I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling.
I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.
External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates&times from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.

Extremely slow insert from Delphi to Remote MySQL Database

Having a major hair-pulling issue with extremely slow inserts from Delphi 2010 to a remote MySQL 5.09 server.
So far, I have tried:
ADO using MySQL ODBC Driver
Zeoslib v7 Alpha
MyDAC
I have used batching and direct insert with ADO (using table access), and with Zeos I have used SQL insertion with a Query, then used Table direct mode and also cached updates Table mode using applyupdates and commit. With MyDAC I used table access mode, then direct SQL insert and then batched SQL insert
All technologies I have tried, I set compression on and off with no discernable difference.
So far I have seen a pretty much the same across the board 7.5 records per second!!!
Now, I would from this point assume that the remote server is just slow, but the MySQL Workbench is amazingly fast, and the Migration toolkit managed the initial migration very quickly (to be honest, I don't recall how quickly - which kind of means that it was quick)
Edit 1
It is quicker for me to write the sql to a file, upload the file to the server via ftp and then import it direct on the remote server - I wonder if they perhaps are throttling incoming MySQL traffic, but that doesn't explain why the MySQL Workbench was so quick!
Edit 2
At the most basic level, the code has been:
while not qMSSQL.EOF do
begin
qMySQL.SQL.Clear;
qMySQL.SQL.Add('INSERT INTO tablename (fieldname1) VALUES (:fieldname1)');
qMySQL.ParamByName('fieldname1').asString:=qMSSQL.FieldByName('fieldname1').asString;
qMySQL.ExecSQL;
qMSSQL.Next;
end;
I then tried
qMySQL.CachedUpdates:=true;
i:=0;
while not qMSSQL.EOF do
begin
qMySQL.SQL.Clear;
qMySQL.SQL.Add('INSERT INTO tablename (fieldname1) VALUES (:fieldname1)');
qMySQL.ParamByName('fieldname1').asString:=qMSSQL.FieldByName('fieldname1').asString;
qMySQL.ExecSQL;
inc(i);
if i>100 then
begin
qMySQL.ApplyUpdates;
i:=0;
end;
qMSSQL.Next;
end;
qMySQL.ApplyUpdates;
Now, in this code with CachedUpdates:=False (which obviously never actually wrote back to the database) the speed was blisteringly fast!!
To be perfectly honest, I think it's the connection - I feel it's the connection... Just waiting for them to get back to me!
Thanks for all your help!
You can try AnyDAC and it Array DML feature. It may speedup a standard SQL INSERT for few times.
Sorry that this reply comes long after you asked the question.
I had a similar problem. BDS2006 to MySQL via ODBC across the network - took 25 minutes to run - around 25 inserts per second. I was using a TDatabase connection and attached the TTable Tquery to it. Prepared the SQL statements.
The major improvement was when I started starting transactions within the loop. A simple example, Memebrships have Member Period. Start a transaction before the insert of the Membership and Members, Commit after. The number of memberships was 01585 and before transactions it took 279.90 seconds to process all the Membership records but after it took 6.71 seconds.
Almost too good to believe and am still working through fixing the code for the other slow bits.
Maybe Mark you have solved your problem but it may help someone else.
Are you using query parameters? The fastest way to insert should be using plain queries and parameters (i.e. INSERT INTO table (field) VALUES (:field) ), preparing the query and then assigning parameters and executing as many times as required within a single transaction - committing at the end (don't use any flavour of autocommit)
That in most databases avoids hard parses each time the query is executed, which requires time. Parameters allow the query to be parsed only once, and then re-executed many times as needed.
Use the server facilites to check what's going on - many offer a way to inspect what running statements are doing.
I'm not sure about ZeosLib, but using ADO with ODBC driver, you will not get the fastest way to insert the records, here few step that may make your insertion faster:
Use Mydac for direct access, they work without the slow ODBC > ADO > OLEDB > MySqlLib to connect to Mysql.
Open the connection at first before the insertion.
if you have large insertion such as 1000 or more, try use transaction and commit after 100 record or more depend on number of records.
Point 3 may makes your insertion faster even with ZeosLib or ADO.
You've got two separate things going on here. First, your Delphi program is creating Insert statements and sending them to the DB server, and then the server is handling them. You need to examine both ends to find the bottleneck. I'm not to familiar with MySql tools, but I bet you could find a SQL profiler for it easily enough. Use it to profile your inserts from the Delphi app, and compare it to running inserts from the Workbench tool and see if there's a significant difference.
If not, then the slowdown is in your app. Try hooking it up to Sampling Profiler or some other profiling tool that understands Delphi, and it'l show you where you're spending lots of time on. Once you know that, then you can work on attacking the problem, or maybe come back here to ask a more specific question. But until you know where the problem is coming from, any answers you get here are just gonna be educated guesses at best.