MySQL | Massive Data Insertion - mysql

We are using MySQL 5.5 InnoDB Engine for managing our database, one of the table which has equally SELECT/INSERT operations over it will be having 100-150 million Insertion operations on a daily basis.
I have already read about MySQL Partitioning, and was planning to implement but before I implement I'd love to take thoughts. So What is the best way to deal with this kind of challenge without compromising user's response time?

First of all, make sure the primary key is auto-increment, as it's clustering index for InnoDB tables. This means that if it's auto-increment, the insertion is append-only operation, if not - it's random write, and this is major performance killer. Make sure the PK is small and you don't have unnecessary indexes. If possible, batch inserts, as updating the indexes is large part of the insert operation.
Make sure other I/O settings make sense, like how often the data is actually flushed to the disk; you can put the binary log file on an SSD to ensure it's written as fast as possible.
After all of this; it's common to separate reads from writes with a master-slave servers, so spikes in insert queries do not affect reading of the data (assuming it's ok to read potentially stale data)

Related

MYSQL DB Fast insert AND select

I am new to mysql and i want to make a table that is very fast with concurrent insertion and selection .
For example,I want to store 1 million rows in about less than 1 second and also read these rows whenever they are stored.
Any suggestions about the storage engine (MYISAM or InnoDB), how to fast insert all these rows and how to read them.
Thanks
The storage engine MyISAM is primarily for read-mostly workloads, because of locking at table level. If you really need concurrent insertion and selection, you'd rather choose the storage engine InnoDB, because it uses row locking. Be aware that InnoDB is a little bit slower because of the overhead.
In any case, make sure you're using batch inserts. Try to keep the number of indices on the table as low as possible to not have index creation overhead. Also you should configure your MySQL server for good performance. For example I would use innodb_flush_log_at_trx_commit=0 in your MySQL server configuration, if you don't mind loosing one second of data when your server crashes. There are a few books on optimizing MySQL, look for "High Performance MySQL".
Besides software, also the hardware plays an important role. You're likely to be disk bound. Thus having a fast disk is essential (for example SSD or RAID).

Which engine to be used for more than 100 insert query per second

Which engine to be used for more than 100 insert query per second
I read differences and pros and cons of MYISAM and Innodb.
But i am still confused for 100+ insert query in a table (basically for tracking purpose) which db should i use.
I refered What's the difference between MyISAM and InnoDB?
Based on my understanding, for each insert MYISAM will lock table and hence innodb should be used for row locking.
But on the otherhand performance of MYISAM are 100times better.So what should be the optimal and correct selection and why?
Simple code that does one-row INSERTs without any tuning maxes out at about 100 rows per second in any engine, especially InnoDB.
But, it is possible to get 1000 rows per second or even more.
The quick fix for InnoDB is to set innodb_flush_log_at_trx_commit = 2; that will uncork the main thing stopping InnoDB at 100 inserts/second using a commodity spinning disk. Setting innodb_buffer_pool_size to about 70% of available RAM is also important.
If a user is inserting multiple rows into the same table at the same time, then LOAD DATA or a batch Insert (INSERT ... VALUES (...), (...), ...) of 100 rows or more will insert ten times as fast. This applies to any Engine.
MyISAM is not 100 times as fast; it is not even 10 times as fast as InnoDB. Today (5.6 or newer), you would be hard pressed to find a well tuned application that is more than a little faster in MyISAM. You are, or will be, I/O-limited.
As for corruption -- No engine suffers from corruption except during a crash. A power failure may mangle MyISAM indexes, usually recoverably. Moreover, a batch insert could be half done. InnoDB will be clean -- the entire batch is done or none of it is done; no corruption.
ARCHIVE saves disk space, but costs CPU.
MEMORY is often faster because it has no I/O. But you have too much data for that Engine, correct?
MariaDB with TokuDB can probably run faster than anything I describe here; but you have not indicated the need for it.
100 rows inserted per second = 8M/day = 3 Billion/year. Will you be purging the data eventually? Will you be querying the data? Purging: Let's talk about PARTITION. Querying: Let's talk about Summary Tables.
Indexing: Minimize the number of indexes. If you have a 'random' index, such as a UUID, and you have a billion rows, you will be stuck with 100 rows/second, regardless of which Engine and regardless of any tuning. Do I need to explain further?
If this is a queuing system, I say "Don't queue it, just do it."
Bottom line: Use a InnoDB. Tune it. Use batch inserts. Avoid random indexes. etc.
You are correct that MyISAM is a faster choice if your operational use case is lots of insertions. But that answer can change drastically based on the kind of use you make of the data. If this is an archival application you might consider the ARCHIVE storage engine. It is best for write-once, read-rarely applications.
You should investigate INSERT DELAYED as it will allow your client programs to fire-and-forget these inserts rather than waiting for completion. This burns RAM in your mysqld process, though. If that style of operation meets your needs, this is a compelling reason to go with MyISAM.
Beware indexes in the target table of your inserts. Maintaining indexes is a big part of the server's insert workload.
Don't forget to look into MariaDB. It's a compatible fork of MySQL with some more advanced storage engines and features.
I have experience with a similar application. In our case, the application scaled up beyond the original insert rate, and the server could not keep up.(It's always good when an application workload grows!) We ended up doing two things, one after the other.
Using a message queuing system, and running just a couple of processes to actually do the inserts. The original clients wrote their logging records to the message queue rather than directly to the database. (Amazon AWS's SQS is an example of such a queuing system).
reworking the insert process to use LOAD DATA INFILE to load great gobs of log rows at once.
(You probably have figured out that this kind of workload isn't feasible on a cheap shared hosting service or an AWS micro instance.)

Storage engine for large amounts of constantly inserted data which should be available instantly

Our server (several Java applications on Debian) handles incoming data (GNSS observations) that should be:
immediately (delay <200ms) delivered to other applications,
stored for further use.
Sometimes (several times a day maybe) about million of archived records will be fetched from the database. Record size is about 12 double precision fields + timestamp and some ids. There are no UPDATEs; DELETEs are very rare but massive. Incoming flow is up to hundred records per second. So I had to choose storage engine for this data.
I tried using MySQL (InnoDB). One application inserts, others constantly check last record id and if it is updated, fetch new records. This part works fine. But I've met following issues:
Records are quite large (about 200-240 bytes per record).
Fetching million of archived records is unacceptable slow (tens of minutes or more).
File-based storage will work just fine (since there are no inserts in the middle of DB and selections are mostly like 'WHERE ID=1 AND TIME BETWEEN 2000 AND 3000', but there are other problems:
Looking for new data might be not so easy.
Other data like logs and configs are stored in same database and I prefer to have one database for everything.
Can you advice some suitable database engine (SQL preferred, but not necessary)? Maybe it is possible to fine-tune MySQL to reduce record size and fetch time for continious strips of data?
MongoDB is not acceptable since DB size is limited on 32-bit machines. Any engine that does not provide quick access for recently inserted data is not acceptable too.
I'd recommend using TokuDB storage engine for MySQL. It's free for up to 50GB of user data, and it's pricing model isn't terrible, making it a great choice for storing large amounts of data.
It's got higher insert speed compared to InnoDB and MyISAM and scales much better as the dataset grows (InnoDB tends to deteriorate once working dataset doesn't fit the RAM making its performance dependant on the I/O of the HDD subsystem).
It's also ACID compliant and supports multiple clustered indexes (which would be a great choice for massive DELETEs you're planning to do). Also, hot schema changes are supported (ALTER TABLE doesn't lock the tables, and changes are quick on huge tables - I'm talking gigabyte-sized tables being altered in mere seconds).
From my personal use, I experienced about 5 - 10 times less disk usage due to TokuDB's compression, and it's much, much faster than MyISAM or InnoDB.
Even though it sounds like I'm trying to advertise this product - I'm not, it's just simply amazing since you can use monolithic data-store without expensive scaling plans like partitioning across nodes to scale the writes.
There really is no getting around how long it takes to load millions of records from disk. Your 32-bit requirement means you are limited in how much RAM you can use for memory based data structures. But, if you want to use MySQL, you may be able to get good performance using multiple table types.
If you need really fast non-blocking inserts. You can use the black hole table type and replication. The server where the inserts occur has a black hole table type that replicates to another server where the table is Innodb or MyISAM.
Since you don't do UPDATEs, I think MyISAM would be better than Innodb in this scenario. You can use the MERGE table type for MyISAM (not available for Innodb). Not sure what your data set is like, but you could have 1 table per day (hour, week?), your MERGE table would then be a superset of those tables. Assuming you want to delete old data by day, just redeclare the MERGE table to not include the old tables. This action is instantaneous. Dropping old tables is also extremely fast.
To check for new data, you can look at "todays" table directly rather than going through the MERGE table.

Cassandra write performance vs Releational Databases

I am trying to grasp some performance differences between Cassandra and relational databases.
From what I have read, Cassandra's write performance remains constant regardless of data volume. By write performance, I am assuming this implies both new rows being added as well as existing rows being replaced on a key match (like an update in the relational world). Is that assumption correct?
Also, from what I understand about relational databases updates get slower when tables/partitions become larger. This is because a full table scan must be performed to locate the row, or an index lookup needs to be performed and both of these things will take longer as the table or partition grows. So updates take perpetually longer based on the data volume of the table/partition?
When new data is inserted to a relational database, I know any indexes need to to have the new data but there is no lookup involved correct? So will inserts also become perpetually slower as data volume increases or stay constant with relational databases?
Thanks for any tips
They will become slower if the table has indexes. Not only the data must be written, but the index must be updated too. Inserting in a table that has no indexes and no constraints is lightning fast, because no checks need to be done. The record can just be written at the end of the table space.
On the relational DB side, I've been doing load testing on our RDBMS where I can see that the performance drops exponentially as data is added to the DB.
I'm still working on a Cassandra setup to be able to realize a comparable test. In the meantime, this Cassandra presentation gives some info on Cassandra compared to MySQL:
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql

Will a MySQL table with 20,000,000 records be fast with concurrent access?

I ran a lookup test against an indexed MySQL table containing 20,000,000 records, and according to my results, it takes 0.004 seconds to retrieve a record given an id--even when joining against another table containing 4,000 records. This was on a 3GHz dual-core machine, with only one user (me) accessing the database. Writes were also fast, as this table took under ten minutes to create all 20,000,000 records.
Assuming my test was accurate, can I expect performance to be as as snappy on a production server, with, say, 200 users concurrently reading from and writing to this table?
I assume InnoDB would be best?
That depends on the storage engine you're going to use and what's the read/write ratio.
InnoDB will be better if there are lot of writes. If it's reads with very occasional write, MyISAM might be faster. MyISAM uses table level locking, so it locks up whole table whenever you need to update. InnoDB uses row level locking, so you can have concurrent updates on different rows.
InnoDB is definitely safer, so I'd stick with it anyhow.
BTW. remember that right now RAM is very cheap, so buy a lot.
Depends on any number of factors:
Server hardware (Especially RAM)
Server configuration
Data size
Number of indexes and index size
Storage engine
Writer/reader ratio
I wouldn't expect it to scale that well. More importantly, this kind of thing is to important to speculate about. Benchmark it and see for yourself.
Regarding storage engine, I wouldn't dare to use anything but InnoDB for a table of that size that is both read and written to. If you run any write query that isn't a primitive insert or single row update you'll end up locking the table using MyISAM, which yields terrible performance as a result.
There's no reason that MySql couldn't handle that kind of load without any significant issues. There are a number of other variables involved though (otherwise, it's a 'how long is a piece of string' question). Personally, I've had a number of tables in various databases that are well beyond that range.
How large is each record (on average)
How much RAM does the database server have - and how much is allocated to the various configurations of Mysql/InnoDB.
A default configuration may only allow for a default 8MB buffer between disk and client (which might work fine for a single user) - but trying to fit a 6GB+ database through that is doomed to failure. That problem was real btw - and was causing several crashes a day of a database/website till I was brought in to trouble-shoot it.
If you are likely to do a great deal more with that database, I'd recommend getting someone with a little more experience, or at least oing what you can to be able to give it some optimisations. Reading 'High Performance MySQL, 2nd Edition' is a good start, as is looking at some tools like Maatkit.
As long as your schema design and DAL are constructed well enough, you understand query optimization inside out, can adjust all the server configuration settings at a professional level, and have "enough" hardware properly configured, yes (except for sufficiently pathological cases).
Same answer both engines.
You should probably perform a load test to verify, but as long as the index was created properly (meaning indexes are optimized to your query statements), the SELECT queries should perform at an acceptable speed (the INSERTS and/or UPDATES may be more of a speed issue though depending on how many indexes you have, and how large the indexes get).