I'm pulling match data from an online game API and saving details to a locally hosted mysql database. Each call to the API returns about 100 matches and I'm inserting 15 matches at a time. For each match i'm inserting anywhere from 150-250 rows in 5 tables.
I've used the optimizations described here: Fastest Way of Inserting in Entity Framework
I've been able to insert about 9 matches/sec but now that I've saved 204,000 matches the insertion time has slowed to 2.5 matches/sec. I hope to save all matches since inception which is probably around 300+ million matches.
I can't use SqlBulkCopy because this is a mysql database.
Is there any further optimizations that I can do? I'd like to parallelize, but I suppose I'll still be blocked on DB.
Thanks.
Related
When INSERTING does the order of columns in the SQL statement make any difference? Particularly with regard to BLOB types?
I'm assuming the answer is NO but the answers I've found so far are focused on SELECTS query performance. Example 1, Example 2
DETAILS:
There has been a refactor of an application using this DB and I'm seeing insert performance degrade as the table grows in rows. I did not see any degradation in insert performance previously.
The MySQL db is literally the same between the old app and the new app: same config settings, same table structure. Therefore while DB config optimizations may be applicable here, at the moment I'm trying to understand why there is a difference in INSERT query performance.
On the app side it's the same db driver between old and new app
There is a difference in the persistence code (JDBC Template vs Hibernate) and one difference I see is the SQL INSERT statements were previously generated with the BLOB column (longblob) as the last column and now the SQL INSERT is generated with BLOB as one of the columns in the middle of the statement.
The data being inserted is for the most part the same between the old and new app.
Commits are being issued in the same manner and frequency.
Yes, I understand there are other factors at play in this scenario but I figured I'd just confirm INSERT column ordering does not matter when inserting BLOB types.
EDIT 1: By performance degradation I mean the following:
The time it takes to insert 100k rows (in 1k per commit) increases from about 25 seconds per 100k to 27, 30, 35 seconds for each new 100k batch. Since this is a test db I am able to clear the data as needed.
I did not see this behavior before and still do not see it on HSQL.
I'm measuring time to insert 100k on the app side.
Edit 2: I'm using version 5.6.20 so Explain is not available on INSERT. However, I did use show profile as described here. Oddly MySQL reports the duration of the new query as much faster than the old query (which I am struggling to understand)
New query duration: 0.006389
Old query duration: 0.028531
I'm working on a migration from MySQL to Postgres on a large Rails app, most operations are performing at a normal rate. However, we have a particular operation that will generate job records every 30 minutes or so. There are usually about 200 records generated and inserted after which we have separate workers that pick up the jobs and work on them from another server.
Under MySQL it takes about 15 seconds to generate the records, and then another 3 minutes for the worker to perform and write back the results, one at a time (so 200 more updates to the original job records).
Under Postgres it takes around 30 seconds, and then another 7 minutes for the worker to perform and write back the results.
The table being written to has roughly 2 million rows, and 1 sequence column under ID.
I have tried tweaking checkpoint timeouts and sizes with no luck.
The table is heavily indexed and really shouldn't be any different than it was before.
I can't post code samples as its a huge codebase and without posting pages and pages of code it wouldn't make sense.
My question is, can anyone think of why this would possibly be happening? There is nothing in the Postgres log and the process of creating these objects has not changed really. Is there some sort of blocking synchronous write behavior I'm not aware of with Postgres?
I've added all sorts of logging in my code to spot errors or transaction failures but I'm coming up with nothing, it just takes twice as long to run, which doesn't seem correct to me.
The Postgres instance is hosted on AWS RDS on a M3.Medium instance type.
We also use New Relic, and it's showing nothing of interest here, which is surprising
Why does your job queue contain 2 million rows? Are they all live or are have not moved them to an archive table to keep your reporting more simple?
Have you used EXPLAIN on your SQL from a psql prompt or your preferred SQL IDE/tool?
Postgres is a completely different RDBMS then MySQL. It allocates space differently and manipulates space differently so may need to be indexed differently.
Additionally there's a tool called pgtune that will suggest configuration changes.
edit: 2014-08-13
Also, rails comes with a profiler that might add some insight. Here's a StackOverflow thread about rails profiling.
You also want to watch your DB server at the disk IO level. Does your job fulfillment to a large number of updates? Postgres created new rows when you update a existing rows, and marks the old rows as available, instead of just overwriting the existing row. So you may be seeing a lot more IO as a result of your RDBMS switch.
I'm trying to find the best candidate for storing the follower/following user data,
I was thinking initially to store it in Redis in a set where user -> set of user ids but then I thought about scenario where there is over 1 million or even 10 million followers for a user, how Redis would handle such a huge set? also there is no way I can do pagination on a set in redis, I have to retrieve the whole set which will not work if a user wants to browse who following after him.
If I store it in MySQL I can definitely do pagination but it might take a long time fetching 10 million records from the database whenever I have to build the user feed, I can do this in old batch fashion but it still sounds pretty painful whenever a user who has many followers will post something and then processing these 10 million records would take forever just for fetching his followers.
Would it be worthwhile to store it in MySQL for pagination (mainly Frontend) and in Redis for the event-driven messaging which builds the activity feed?
It's a personal decision whether to use redis or mysql for this task. Both will have no problem with those 10 million records.
MySQL has the LIMIT x,y command for getting a subset of the followers from the database.
For redis you can use sorted sets and use the userid of the follower or the time user started following as score for an sorted set. And like MySQL redis supports getting a subset of the large sorted set.
I am trying to create a web application, primary objective is to insert request data into database.
Here is my problem, One request itself contains 10,000 to 1,00,000 data sets of information
(Each data set needs to be inserted separately as a row in the database)
I may get multiple request on this application concurrently, so its necessary for me to make the inserts fast.
I am using MySQL database, Which approach is better for me, LOAD DATA or BATCH INSERT or is there a better way than these two?
How will your application retrieve this information?
- There will be another background thread based java application that will select records from this table process them one by one and delete them.
Can you queue your requests (batches) so your system will handle them one batch at a time?
- For now we are thinking of inserting it to database straightaway, but yes if this approach is not feasible enough we may think of queuing the data.
Do retrievals of information need to be concurrent with insertion of new data?
- Yes, we are keeping it concurrent.
Here are certain answers to your questions, Ollie Jones
Thankyou!
Ken White's comment mentioned a couple of useful SO questions and answers for handling bulk insertion. For the record volume you are handling, you will enjoy the best success by using MyISAM tables and LOAD DATA INFILE data loading, from source files in the same file system that's used by your MySQL server.
What you're doing here is a kind of queuing operation. You receive these batches (you call them "requests") of records (you call them "data sets.) You put them into a big bucket (your MySQL table). Then you take them out of the bucket one at a time.
You haven't described your problem completely, so it's possible my advice is wrong.
Is each record ("data set") independent of all the others?
Does the order in which the records are processed matter? Or would you obtain the same results if you processed them in a random order? In other words, do you have to maintain an order on the individual records?
What happens if you receive two million-row batches ("requests") at approximately the same time? Assuming you can load ten thousand records a second (that's fast!) into your MySQL table, this means it will take 200 seconds to load both batches completely. Will you try to load one batch completely before beginning to load the second?
Is it OK to start processing and deleting the rows in these batches before the batches are completely loaded?
Is it OK for a record to sit in your system for 200 or more seconds before it is processed? How long can a record sit? (this is called "latency").
Given the volume of data you're mentioning here, if you're going into production with living data you may want to consider using a queuing system like ActiveMQ rather than a DBMS.
It may also make sense simply to build a multi-threaded Java app to load your batches of records, deposit them into a Queue object in RAM (a ConcurrentLinkedQueue instance may be suitable) and process them one by one. This approach will give you much more control over the performance of your system than you will have by using a MySQL table as a queue.
Straight to the Qeustion ->.
The problem : To do async bulk inserts (not necessary bulk, if MySql can Handle it) using Node.js (coming form a .NET and PHP background)
Example :
Assume i have 40(adjustable) functions doing some work(async) and each adding a record in the Table after its single iteration, now it is very probable that at the same time more than one function makes an insertion call. Can MySql handle it that ways directly?, considering there is going to be an Auto-update field.
In C#(.NET) i would have used a dataTable to contain all the rows from each function and in the end bulk-insert the dataTable into the database Table. and launch many threads for each function.
What approach will you suggest in this case,
Shall the approach change in case i need to handle 10,000 or 4 million rows per table?
ALso The DB schema is not going to change, will MongoDB be a better choice for this?
I am new to Node, NoSql and in the noob learning phase at the moment. So if you can provide some explanation to your answer, it would be awesome.
Thanks.
EDIT :
Answer : Neither MySql or MongoDB support any sort of Bulk insert, under the hood it is just a foreach loop.
Both of them are capable of handling a large number of connections simultanously, the performance will largely depend on you requirement and production environment.
1) in MySql queries are executed sequentially per connection. If you are using one connection, your 40~ functions will result in 40 queries enqueued (via explicit queue in mysql library, your code or system queue based on syncronisation primitives), not necessarily in the same order you started 40 functions. MySQL won't have any race conditions problems with auto-update fields in that case
2) if you really want to execute 40 queries in parallel you need to open 40 connections to MySQL (which is not a good idea from performance point of view, but again, Mysql is designed to handle auto-increments correctly for multiple clients)
3) There is no special bulk insert command in the Mysql protocol on the wire level, any library exposing bulk insert api in fact just doing long 'insert ... values' query.