MySQL Memory Engine vs InnoDB on RAMdisk - mysql

I'm writing a bit of software that needs to flatten data from a hierarchical type of format into tabular format. Instead of doing it all in a programming language every time and serving it up, I want to cache the results for a few seconds, and use SQL to sort and filter. When in use, we're talking 400,000 writes and 1 or 2 reads over the course of those few seconds.
Each table will contain 3 to 15 columns. Each row will contain from 100 bytes to 2,000 bytes of data, although it's possible that in some cases, some rows may get up to 15,000 bytes. I can clip data if necessary to keep things sane.
The main options I'm considering are:
MySQL's Memory engine
A good option, almost specifically written for my use case! But.. "MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. MEMORY tables cannot contain BLOB or TEXT columns." - Unfortunately, I do have text fields with a length up to maybe 10,000 characters - and even that is a number that is not specifically limited. I could adjust the varchar length based on the max length of text columns as I loop through doing my flattening, but that's not totally elegant. Also, for my occasional 15,000 character row, does that mean I need to allocate 15,000 characters for every row in the database? If there was 100,000 rows, that's 1.3 gb not including overhead!
InnoDB on RAMDisk
This is meant to run on the cloud, and I could easily spin up a server with 16gb of ram, configure MySQL to write to tmpfs and use full featured MySQL. My concern for this is space. While I'm sure engineers have written the memory engine to prevent consuming all temp storage and crashing the server, I doubt this solution would know when to stop. How much actual space will my 2,000 bytes of data consume when in database format? How can I monitor it?
Bonus Questions
Indexes
I will in fact know in advance which columns need to be filtered and sorted by. I could set up an index before I do inserts, but what kind of performance gain could I honestly expect on top of a ram disk? How much extra overhead to indexes add?
Inserts
I'm assuming inserting multiple rows with one query is faster. But the one query, or series of large queries are stored in memory, and we're writing to memory, so if I did that I'd momentarily need double the memory. So then we talk about doing one or two or a hundred at a time, and having to wait for that to complete before processing more.. InnoDB doesn't lock the table but I worry about sending two queries too close to each other and confusing MySQL. Is this a valid concern? With the MEMORY engine I'd have to definitely wait for completion, due to table locks.
Temporary
Are there any benefits to temporary tables other than the fact that they're deleted when the db connection closes?

I suggest you use MyISAM. Create your table with appropriate indexes for your query. Then disable keys, load the table, and enable keys.
I suggest you develop a discipline like this for your system. I've used a similar discipline very effectively.
Keep two copies of the table. Call one table_active and the second one table_loading.
When it's time to load a new copy of your data, use commands like this.
ALTER TABLE table_loading DISABLE KEYS;
/* do your insertions here, to table_loading */
/* consider using LOAD DATA INFILE if it makes sense. */
ALTER TABLE table_loading ENABLE KEYS; /* this will take a while */
/* at this point, suspend your software that's reading table_active */
RENAME TABLE table_active TO table_old;
RENAME TABLE table_loading TO table_active;
/* now you can resume running your software */
TRUNCATE TABLE table_old;
RENAME TABLE table_old TO table_loading;
Alternatively, you can DROP TABLE table_old; and create a new table for table_loading instead of the last rename.
This two-table (double-buffered) strategy should work pretty well. It will create some latency because your software that's reading the table will work on an old copy. But you'll avoid reading from an incompletely loaded table.
I suggest MyISAM because you won't run out of RAM and blow up and you won't have the fixed-row-length overhead or the transaction overhead. But you might also consider MariaDB and the Aria storage engine, which does a good job of exploiting RAM buffers.
If you do use the MEMORY storage engine, be sure to tweak your max_heap_table_size system variable. If your read queries will use index range scans (sequential index access) be sure to specify BTREE style indexes. See here: http://dev.mysql.com/doc/refman/5.1/en/memory-storage-engine.html

Related

mysql partitioning by key internal hashing function

We have a table partitioned by key (binary(16))
Is there any option to calculate which partition record will go outside of MySQL?
What is the hash function (not linear one)?
The reason is to sort the CSV files outside MySQL and insert them in parallel in right partitions with LOAD DATA INFILE and then index in parallel too.
I can't find the function in MySQL docs
What's wrong with Linear? Are trying to LOAD in parallel?
How many indexes do you have? If only that hash, sort the table, then load into a non-partitioned InnoDB with the PK already in place. Meanwhile, make sure every column uses the smallest possible datatype. How much RAM do you have?
If you are using MyISAM, consider MERGE - With that, you can load each partition-like table as in a separate thread. When finished, construct the "merge" table that combines them.
What types of queries will you be using? Single row lookups by the BINARY(16)? Anything else might have big performance issues.
How much RAM? We need to tune either key_buffer_size or innodb_buffer_pool_size.
Be aware of the limitations. MyISAM defaults to a 7-byte data pointer and a 6-byte index pointer. 15TB would need only a 6-byte data pointer if the rows are DYNAMIC (byte pointer), or 5 bytes if they are FIXED (row number). So that could be 1 or 2 bytes to be saved. If anything is variable length, go with Dynamic; it would waste too much space (and probably not improve speed) to go fixed. I don't know of the index pointer can be shrunk in your case.
You are in 5.7? MySQL; 8.0 removes MyISAM. Meanwhile, MariaDB still handles MyISAM.
Will you first split the data by "partition"? Or send off INSERTs to different "partitions" one by one. (This choice adds some more wrinkles and possibly optimizations.)
Maybe...
Sort the incoming data by the binary version of MD5().
Split into chunks based on the first 4 bits. (Or do the split without sorting first) Be sure to run LOAD DATA for one 4-bit value in only one thread.
Have PARTITION BY RANGE with 16 partitions:
VALUES LESS THAN 0x1000000000000000
VALUES LESS THAN 0x2000000000000000
...
VALUES LESS THAN 0xF000000000000000
VALUES LESS THAN MAXVALUE
I don't know of a limit on the number of rows in a LOAD DATA, but I would worry about ACID locks having problems if you go over, say, 10K rows at a time.
This technique may even work for a non-partitioned table.

Storage engine for large amounts of constantly inserted data which should be available instantly

Our server (several Java applications on Debian) handles incoming data (GNSS observations) that should be:
immediately (delay <200ms) delivered to other applications,
stored for further use.
Sometimes (several times a day maybe) about million of archived records will be fetched from the database. Record size is about 12 double precision fields + timestamp and some ids. There are no UPDATEs; DELETEs are very rare but massive. Incoming flow is up to hundred records per second. So I had to choose storage engine for this data.
I tried using MySQL (InnoDB). One application inserts, others constantly check last record id and if it is updated, fetch new records. This part works fine. But I've met following issues:
Records are quite large (about 200-240 bytes per record).
Fetching million of archived records is unacceptable slow (tens of minutes or more).
File-based storage will work just fine (since there are no inserts in the middle of DB and selections are mostly like 'WHERE ID=1 AND TIME BETWEEN 2000 AND 3000', but there are other problems:
Looking for new data might be not so easy.
Other data like logs and configs are stored in same database and I prefer to have one database for everything.
Can you advice some suitable database engine (SQL preferred, but not necessary)? Maybe it is possible to fine-tune MySQL to reduce record size and fetch time for continious strips of data?
MongoDB is not acceptable since DB size is limited on 32-bit machines. Any engine that does not provide quick access for recently inserted data is not acceptable too.
I'd recommend using TokuDB storage engine for MySQL. It's free for up to 50GB of user data, and it's pricing model isn't terrible, making it a great choice for storing large amounts of data.
It's got higher insert speed compared to InnoDB and MyISAM and scales much better as the dataset grows (InnoDB tends to deteriorate once working dataset doesn't fit the RAM making its performance dependant on the I/O of the HDD subsystem).
It's also ACID compliant and supports multiple clustered indexes (which would be a great choice for massive DELETEs you're planning to do). Also, hot schema changes are supported (ALTER TABLE doesn't lock the tables, and changes are quick on huge tables - I'm talking gigabyte-sized tables being altered in mere seconds).
From my personal use, I experienced about 5 - 10 times less disk usage due to TokuDB's compression, and it's much, much faster than MyISAM or InnoDB.
Even though it sounds like I'm trying to advertise this product - I'm not, it's just simply amazing since you can use monolithic data-store without expensive scaling plans like partitioning across nodes to scale the writes.
There really is no getting around how long it takes to load millions of records from disk. Your 32-bit requirement means you are limited in how much RAM you can use for memory based data structures. But, if you want to use MySQL, you may be able to get good performance using multiple table types.
If you need really fast non-blocking inserts. You can use the black hole table type and replication. The server where the inserts occur has a black hole table type that replicates to another server where the table is Innodb or MyISAM.
Since you don't do UPDATEs, I think MyISAM would be better than Innodb in this scenario. You can use the MERGE table type for MyISAM (not available for Innodb). Not sure what your data set is like, but you could have 1 table per day (hour, week?), your MERGE table would then be a superset of those tables. Assuming you want to delete old data by day, just redeclare the MERGE table to not include the old tables. This action is instantaneous. Dropping old tables is also extremely fast.
To check for new data, you can look at "todays" table directly rather than going through the MERGE table.

Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?

So, one of my tables in MySQL which uses the InnoDB storage engine will contain multi-billion rows(with potentially no limit to how many will be inserted).
Can you tell me what sort of optimizations i can do to help speed up things?
Cause with a few million rows already, it will start getting slow.
Of course if you suggest to use something else. The only options i have are PostgreSQL and Sqlite3. But I've been told that sqlite3 is not a good choice for that.
As for postgresql, i have absolutely no idea how it is, as i've never used it.
I imagine though, at least about 1000-1500 inserts per second in that table.
A simple answer to your question would be yes InnoDB would be the perfect choice for a multi-billion row data set.
There is a host of optimization that is possbile.
The most obvious optimizations would be setting a large buffer pool, as buffer pool is the single most important thing when it comes to InnoDB because InnoDB buffers the data as well as the index in the buffer pool. If you have a dedicated MySQL server with only InnoDB tables, then you should set upto 80% of the available RAM to be used by InnoDB.
Another most important optimization is having proper indexes on the table (keeping in mind the data access/update pattern), both primary and secondary. (Remember that primary indexes are automatically appended to secondary indexes).
With InnoDB there are some extra goodies, such as protection from data corruption, auto-recovery etc.
As for increasing write-performance, you should setup your transaction log files to be upto a total of 4G.
One other thing that you can do is partition the table.
You can eek out more performance, by setting the bin-log-format to "row", and setting the auto_inc_lock_mode to 2 (that will ensure that innodb does not hold table level locks when inserting into auto-increment columns).
If you need any specific advice you can contact me, I would be more than willing to help.
optimizations
Take care not to have too many indexes. They are expensive when inserting
Make your datatypes fit your data, as tight fit you can. (so don't go saving ip-adresses in a text or a blob, if you know what i mean). Look in to varchar vs char. Don't forget that because varchar is more flexible, you are trading in some things. If you know a lot about your data it might help to use char's, or it might be clearly better to use varchars. etc.
Do you read at all from this table? If so, you might want to do all the reading from a replicated slave, although your connection should be good enough for that amount of data.
If you have big inserts (aside from the number of inserts), make sure your IO is actually quick enough to handle the load.
I don't think there is any reason MySQL wouldn't support this. Things that can slow you down from "thousands" to "millions" to "billions" are stuff like aforementioned indexes. There is -as far as i know- no "mysql is full" problem.
Look into Partial indexes. From wikipedia (quickest source I could find, didn't check the references, but I'm sure you can manage:)
MySQL as of version 5.4 does not
support partial indexes.[3] In MySQL,
the term "partial index" is sometimes
used to refer to prefix indexes, where
only a truncated prefix of each value
is stored in the index. This is
another technique for reducing index
size.[4]
No idea on the MySQL/InnoDB part (I'd assume it'll cope). But if you end up looking at alternatives, PostgreSQL can manage a DB of unlimited size on paper. (At least one 32TB database exists according to the FAQ.)
Can you tell me what sort of optimizations i can do to help speed up things?
Your milage will vary depending on your application. But with billions of rows, you're at least looking into partitioning your data, in order to work on smaller tables.
In the case of PostgreSQL, you'd also look into creating partial indexes where appropriate.
You may want to have a look at:
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
http://forums.whirlpool.net.au/archive/954126
If you have a very large table (Billions of records) and need to data mine the table (queries that read lots of data), mysql can slow to a crawl.
Large databases (200+GB) are fine, but they are bound by IO/ temp table to disk and multiple other issues when attempting to read large groups that don't fit in memory.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?

Slow MySQL inserts

I am using and working on software which uses MySQL as a backend engine (it can use others such as PostgreSQL or Oracle or SQLite, but this is the main application we are using). The software was design in such way that the binary data we want to access is kept as BLOBs in individual columns (each table has one BLOB column, other columns have integers/floats to characterize the BLOB, and one string column with the BLOB's MD5 hash). The tables have typically 2, 3 or 4 indexes, one of which is always the MD5 column, which is made UNIQUE. Some tables already have millions of entries, and they have entered the multi-gigabyte in size. We keep separate per-year MySQL databases in the same server (so far). The hardware is quite reasonable (I think) for general applications (a Dell PowerEdge 2U-form server).
MySQL SELECT queries are relatively fast. There's little complaint there, since these are (most of the time) in batch mode. However, INSERT queries take a long time, which increases with table size (number of rows). Admittedly, this is because the MD5 column is of type UNIQUE and so each INSERT has to figure out whether each new row has a corresponding, already-inserted, MD5 string. And it's not too strange (I think) if the performance gets worse if there are other indexes (not unique). But I still can't put my mind to rest that this software architecture choice (I suspect keeping BLOBs in the table row instead of disk has a significant, negative impact) is not the best choice. Insertions are not critical, but it is an annoying feeling to have.
Does anyone have experience in similar situations? With MySQL, or even other (preferably Linux-based) RDBMes? Any insights you would care to provide, maybe some performance figures?
BTW, the working language is C++ (which wraps C calls to MySQL's API).
It could be a time for horizontal partitioning and moving blob field into a separate table. In this article in 'A Quick Side Note on Vertical Partitioning' author removes a larger varchar field from a table and it increases speed of a query about order of magnitude.
The reason is physical traversal of the data on a disk becomes significantly faster if there is less space to cover, so moving bigger fields elsewhere increases performance.
Also (and you probably do it already) it is beneficial to decrease the size of your index column to its absolute minumum (char(32) in ascii encoding for md5), because size of the key is directly proportional to the speed of its use.
If you do multiple inserts at a time with InnoDB tables you can significantly increase speed of inserts by wrapping them into transaction and doing mupliple inserts in one query:
START TRANSACTION
INSERT INTO x (id, md5, field1, field2) values (1, '123dab...', 'data1','data2'),(2,'ab2...','data3','data4'),.....;
COMMIT
See Speed of INSERT Statements. Do you have frequent MD5 collisions? I believe these should not happen too many times, so maybe you can use something like INSERT ... ON DUPLICATE to handle the collisions. If you have specific insert periods, you can disable keys for the time of the insert and restore them later. Another option is to use replication, using a master machine for the inserts and a slave for the selects.
Are you using MyISAM?
AFAIK MyISAM has a very good read-performance, but bad write performance.
InnoDB should be balanced in speed.
Does your data fit in RAM? If not, get more RAM until that becomes uneconomic (16G is usually about the point for most people).
Then, do your indexes fit in the MyISAM key buffer?
If you're running a 32-bit OS, don't. Once you're on a 64-bit OS, set the key buffer to be approx 1/3 of the ram. RAM is used by the OS's cache to cache data files (which does little for inserts but is beneficial for selects).
Having multi-gigabyte tables in MyISAM can be a pain because in the event of an unclean shutdown, very lengthy repair operation(s) are required, but
Don't switch MySQL engines without significant validation of your application, it will change the behaviour in many ways (not just performance). It will affect disc space usage.
I asked a somewhat-related question today as well.
One of the answers provided is to consider the INSERT DELAYED so that it goes into the insert queue, and is handled when the db is not as busy.