Mysql LOAD DATA INFILE performance, Windows vs Unix - mysql

I have created a benchmark of multiple LOAD DATA operations inserting millions of records into different tables.
i have surprised when i discovered that on windows my Benchmark finished after 1.15 hours, vs Unix where it finished after 1.45 hours.
Then i have created different benchmark for read operations which involves full table scanning using single time index (my measurement has a time value which is non unique).
For the second benchmark i received 5 times better results on Unix vs Windows.
What may be the reason for this? (I have only MyISAM tables)

Related

How to do a one-time load for 4 billion records from MySQL to SQL Server

We have a need to do the initial data copy on a table that has 4+ billion records to target SQL Server (2014) from source MySQL (5.5). The table in question is pretty wide with 55 columns, however none of them are LOB. I'm looking for options for copying this data in the most efficient way possible.
We've tried loading via Attunity Replicate (which has worked wonderfully for tables not this large) but if the initial data copy with Attunity Replicate fails then it starts over from scratch ... losing whatever time was spent copying the data. With patching and the possibility of this table taking 3+ months to load Attunity wasn't the solution.
We've also tried smaller batch loads with a linked server. This is working but doesn't seem efficient at all.
Once the data is copied we will be using Attunity Replicate to handle CDC.
For something like this I think SSIS would be the most simple. It's designed for large inserts as big as 1TB. In fact, I'd recommend this MSDN article We loaded 1TB in 30 Minutes and so can you.
Doing simple things like dropping indexes and performing other optimizations like partitioning would make your load faster. While 30 minutes isn't a feasible time to shoot for, it would be a very straightforward task to have an SSIS package run outside of business hours.
My business doesn't have a load on the scale you do, but we do refresh our databases of more than 100M nightly which doesn't take more than 45 minutes, even with it being poorly optimized.
One of the most efficient way to load huge data is to read them by chunks.
I have answered many similar question for SQLite, Oracle, Db2 and MySQL. You can refer to one of them for to get more information on how to do that using SSIS:
Reading Huge volume of data from Sqlite to SQL Server fails at pre-execute (SQLite)
SSIS failing to save packages and reboots Visual Studio (Oracle)
Optimizing SSIS package for millions of rows with Order by / sort in SQL command and Merge Join (MySQL)
Getting top n to n rows from db2 (DB2)
On the other hand there are many other suggestions such as drop indexes in destination table and recreate them after insert, Create needed indexes on source table, use fast-load option to insert data ...

Big InnoDB tables in MySQL

There are 10 InnoDB partitioned tables. MySQL is configured with option innodb-file-per-table=1 (innodb-file per table/partition - for some reasons). Tables size is abount 40GB each. They contains statictics data.
During normal operation, the system can handle the load. The accumulated data is processed every N minutes. However, if for some reason, there was no treatment for more than 30 minutes (eg, maintenance of the system - it is rare, but once a year is necessary to make changes), begin to lock timeout.
I will not tell you how to come to such an architecture, but it is the best solution - way was long.
Đ•ach time, making changes requires more and more time. Today, for example, a simple ALTER TABLE took 2:45 hours. This is unacceptable.
So, as I said, processing the accumulated data requires a lot of resources and SELECT statements are beginning to return lock timeout errors. Of course, the tables in the query are not involved, and the work goes to the results of queries to them. Total size of these 10 tables is about 400GB, and a few dozen small tables, the total size of which is comparable to (and maybe not yet) to the size of an big table. Problems with small tables there.
My question is: how can I solve the problem with a lock timeout error? A server is not bad - 8 core xeon, 64 RAM. And this is only the database server. Of course, the entire system is not located on the same machine.
There is an only reason why I get this errors: on data transfrom process from big tables to small.
Any ideas?

Large number of INSERT statements locking inserts on other tables in MySQL 5.5

I'm trying to understand an issue I am having with a MySQL 5.5 server.
This server hosts a number of databases. Each day at a certain time a process runs a series of inserts into TWO tables within this database. This process lasts from 5 to 15 minutes depending on the amount of rows being inserted.
This process runs perfectly. But it has a very unexpected side effect. All other inserts and update's running on tables unrelated to the two being inserted to just sit and wait until the process has stopped. Reads and writes outside of this database work just fine and SELECT statements too are fine.
So how is it possible for a single table to block the rest of a database but not the entire server (due to loading)?
A bit of background:-
Tables being inserted to are MyISAM with 10 - 20 million rows.
MySQL is Percona V5.5 and is serving one slave both running on
Debian.
No explicit locking is called for by the process inserting the
records.
None of the Insert statements do not select data from any other
table. They are also INSERT IGNORE statements.
ADDITIONAL INFO:
While this is happening there are no LOCK table entries in PROCESS LIST and the processor inserting the records causing this problem does NOT issue any table locks.
I've already investigated the usual causes of table locking and I think I've rules them out. This behaviour is either something to do with how MySQL works, a quirk of having large database files or possibly even something to do with the OS/File System.
After a few weeks of trying things I eventually found this: Yoshinori Matsunobu Blog - MyISAM and Disk IO Scheduler
Yoshinori demonstrates that changing the scheduler queue to 100000 (from the default 128) dramatically improves the throughput of MyISAM on most schedulers.
After making this change to my system there were no longer any dramatic instances of database hang on MyISAM tables while this process was running. There was slight slowdown as to be expected with the volume of data but the system remained stable.
Anyone experiencing performance issues with MyISAM should read Yoshinori's blog entry and consider this fix.

Performance of MySQL LOAD DATA INFILE over small file

I'm loading a small data file which consists around 1K rows into a MyISAM table
{
id INT(8),
text TEXT(or VARCHAR(1000))
}
The cost is around 2 seconds for LOAD DATA INFILE. I've seen MySQL could load more than 10K rows per second in average when loading large files. And I roughly know there are cost such as open/close tables. Can someone help me know what exactly happen in this 2 seconds and is it possible to optimize it under seconds as my program is running in a time critical environment. Thanks.
Somebody asked a similar question here
http://forums.mysql.com/read.php?144,558753,558753.
Looks like it has not been well answered yet.
Scenario Description
The whole MySQL setup is for some academic projects, which has around 300G databases for various projects. Most of these databases are in MyISAM engine if not ALL. These databases contains imported dumps, and processed intermediate tables in experiments. There are delete and update operations on these tables, but now they are all idle. I have a project which generate some result tuples that are inserted into a table in one of the databases. The table is initialized to be empty. The schema is very simple which contains only two columns as I pasted. Now if I set the ENGINE=MyISAM, it always takes 2s to insert 1-1K row, however, if I switch to ENGINE=INNODB, it becomes 0.01s. I installed a new MySQL in the other machine, create the table with ENGINE=MyISAM, and insert the same number of rows, it only takes 0.01s.
At 1k rows, you may find that multi-inserts are faster. Some benchmarking should help. This should be helpful as well:
http://dev.mysql.com/doc/refman/5.5/en/optimizing-innodb.html

DB with best inserts/sec performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
We deploy an (AJAX - based) Instant messenger which is serviced by a Comet server. We have a requirement to store the sent messages in a DB for long-term archival purposes in order to meet legal retention requirements.
Which DB engine provides the best performance in this write-once, read never (with rare exceptions) requirement?
We need at least 5000 Insert/Sec. I am assuming neither MySQL nor PostgreSQL
can meet these requirements.
Any proposals for a higher performance solution? HamsterDB, SQLite, MongoDB ...?
Please ignore the above Benchmark we had a bug inside.
We have Insert 1M records with following columns: id (int), status (int), message (140 char, random).
All tests was done with C++ Driver on a Desktop PC i5 with 500 GB Sata Disk.
Benchmark with MongoDB:
1M Records Insert without Index
time: 23s, insert/s: 43478
1M Records Insert with Index on Id
time: 50s, insert/s: 20000
next we add 1M records to the same table with Index and 1M records
time: 78s, insert/s: 12820
that all result in near of 4gb files on fs.
Benchmark with MySQL:
1M Records Insert without Index
time: 49s, insert/s: 20408
1M Records Insert with Index
time: 56s, insert/s: 17857
next we add 1M records to the same table with Index and 1M records
time: 56s, insert/s: 17857
exactly same performance, no loss on mysql on growth
We see Mongo has eat around 384 MB Ram during this test and load 3 cores of the cpu, MySQL was happy with 14 MB and load only 1 core.
Edorian was on the right way with his proposal, I will do some more Benchmark and I'm sure we can reach on a 2x Quad Core Server 50K Inserts/sec.
I think MySQL will be the right way to go.
If you are never going to query the data, then i wouldn't store it to a database at all, you will never beat the performance of just writing them to a flat file.
What you might want to consider is the scaling issues, what happens when it's to slow to write the data to a flat file, will you invest in faster disk's, or something else.
Another thing to consider is how to scale the service so that you can add more servers without having to coordinate the logs of each server and consolidate them manually.
edit: You wrote that you want to have it in a database, and then i would also consider security issues with havening the data on line, what happens when your service gets compromised, do you want your attackers to be able to alter the history of what have been said?
It might be smarter to store it temporary to a file, and then dump it to an off-site place that's not accessible if your Internet fronts gets hacked.
If you don't need to do queries, then database is not what you need. Use a log file.
it's only stored for legal reasons.
And what about the detailed requirements? You mention the NoSQL solutions, but these can't promise the data is realy stored on disk. In PostgreSQL everything is transaction safe, so you're 100% sure the data is on disk and is available. (just don't turn of fsync)
Speed has a lot to do with your hardware, your configuration and your application. PostgreSQL can insert thousands of record per second on good hardware and using a correct configuration, it can be painfully slow using the same hardware but using a plain stupid configuration and/or the wrong approach in your application. A single INSERT is slow, many INSERT's in a single transaction are much faster, prepared statements even faster and COPY does magic when you need speed. It's up to you.
I don't know why you would rule out MySQL. It could handle high inserts per second. If you really want high inserts, use the BLACK HOLE table type with replication. It's essentially writing to a log file that eventually gets replicated to a regular database table. You could even query the slave without affecting insert speeds.
Firebird can easily handle 5000 Insert/sec if table doesn't have indices.
Depending in your system setup MySql can easily handle over 50.000 inserts per sec.
For tests on a current system i am working on we got to over 200k inserts per sec. with 100 concurrent connections on 10 tables (just some values).
Not saying that this is the best choice since other systems like couch could make replication/backups/scaling easier but dismissing mysql solely on the fact that it can't handle so minor amounts of data it a little to harsh.
I guess there are better solutions (read: cheaper, easier to administer) solutions out there.
Use Event Store (https://eventstore.org), you can read (https://eventstore.org/docs/getting-started/which-api-sdk/index.html) that when using TCP client you can achieve 15000-20000 writes per second. If you will ever need to do anything with data, you can use projections or do the transformations based on streams to populate any other datastore you wish.
You can create even cluster.
If money plays no role, you can use TimesTen.
http://www.oracle.com/timesten/index.html
A complete in memory database, with amazing speed.
I would use the log file for this, but if you must use a database, I highly recommend Firebird. I just tested the speed, it inserts about 10k records per second on quite average hardware (3 years old desktop computer). The table has one compound index, so I guess it would work even faster without it:
milanb#kiklop:~$ fbexport -i -d test -f test.fbx -v table1 -p **
Connecting to: 'LOCALHOST'...Connected.
Creating and starting transaction...Done.
Create statement...Done.
Doing verbatim import of table: TABLE1
Importing data...
SQL: INSERT INTO TABLE1 (AKCIJA,DATUM,KORISNIK,PK,TABELA) VALUES (?,?,?,?,?)
Prepare statement...Done.
Checkpoint at: 1000 lines.
Checkpoint at: 2000 lines.
Checkpoint at: 3000 lines.
...etc.
Checkpoint at: 20000 lines.
Checkpoint at: 21000 lines.
Checkpoint at: 22000 lines.
Start : Thu Aug 19 10:43:12 2010
End : Thu Aug 19 10:43:14 2010
Elapsed : 2 seconds.
22264 rows imported from test.fbx.
Firebird is open source, and completely free even for commercial projects.
I believe the answer will as well depend on hard disk type (SSD or not) and also the size of the data you insert. I was inserting a single field data into MongoDB on a dual core Ubuntu machine and was hitting over 100 records per second. I introduced some quite large data to a field and it dropped down to about 9ps and the CPU running at about 175%! The box doesn't have SSD and so I wonder if I'd have gotten better with that.
I also ran MySQL and it was taking 50 seconds just to insert 50 records on a table with 20m records (with about 4 decent indexes too) so as well with MySQL it will depend on how many indexes you have in place.