Cassandra write performance vs Releational Databases - relational-database

I am trying to grasp some performance differences between Cassandra and relational databases.
From what I have read, Cassandra's write performance remains constant regardless of data volume. By write performance, I am assuming this implies both new rows being added as well as existing rows being replaced on a key match (like an update in the relational world). Is that assumption correct?
Also, from what I understand about relational databases updates get slower when tables/partitions become larger. This is because a full table scan must be performed to locate the row, or an index lookup needs to be performed and both of these things will take longer as the table or partition grows. So updates take perpetually longer based on the data volume of the table/partition?
When new data is inserted to a relational database, I know any indexes need to to have the new data but there is no lookup involved correct? So will inserts also become perpetually slower as data volume increases or stay constant with relational databases?
Thanks for any tips

They will become slower if the table has indexes. Not only the data must be written, but the index must be updated too. Inserting in a table that has no indexes and no constraints is lightning fast, because no checks need to be done. The record can just be written at the end of the table space.

On the relational DB side, I've been doing load testing on our RDBMS where I can see that the performance drops exponentially as data is added to the DB.
I'm still working on a Cassandra setup to be able to realize a comparable test. In the meantime, this Cassandra presentation gives some info on Cassandra compared to MySQL:
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql

Related

Does a lot of writing/inserting affect database indexes?

Does a database have to rebuild its indexes every time a new row is inserted?
And by that token, wouldn't it mean if I was inserting alot, the index would be being rebuilt constantly and therefore less effective/useless for querying?
I'm trying to understand some of this database theory for better database design.
Updates definitely don't require rebuilding the entire index every time you update it (likewise insert and delete).
There's a little bit of overhead to updating entries in an index, but it's reasonably low cost. Most indexes are stored internally as a B+Tree data structure. This data structure was chosen because it allows easy modification.
MySQL also has a further optimization called the Change Buffer. This buffer helps reduce the performance cost of updating indexes by caching changes. That is, you do an INSERT/UPDATE/DELETE that affects an index, and the type of change is recorded in the Change Buffer. The next time you read that index with a query, MySQL reads the Change Buffer as a kind of supplement to the full index.
A good analogy for this might be a published document that periodically publishes "errata" so you need to read both the document and the errata together to understand the current state of the document.
Eventually, the entries in the Change Buffer are gradually merged into the index. This is analogous to the errata being edited into the document for the next time the document is reprinted.
The Change Buffer is used only for secondary indexes. It doesn't do anything for primary key or unique key indexes. Updates to unique indexes can't be deferred, but they still use the B+Tree so they're not so costly.
If you do OPTIMIZE TABLE or some types of ALTER TABLE changes that can't be done in-place, MySQL does rebuild the indexes from scratch. This can be useful to defragment an index after you delete a lot of the table, for example.
Yes, inserting affects them, but it's not as bad as you seem to think. Like most entities in relational databases, indexes are usually created and maintained with an extra amount of space to accommodate for growth, and usually set up to increase that extra amount automatically when index space is nearly exhausted.
Rebuilding the index starts from scratch, and is different from adding entries to the index. Inserting a new row does not result in the rebuild of an index. The new entry gets added in the extra space mentioned above, except for clustered indexes which operate a little differently.
Most DB administrators also do a task called "updating statistics," which updates an internal set of statistics used by the query planner to come up with good query strategies. That task, performed as part of maintenance, also helps keep the query optimizer "in tune" with the current state of indexes.
There are enormous numbers of high-quality references on how databases work, both independent sites and those of the publishers of major databases. You literally can make a career out of becoming a database expert. But don't worry too much about your inserts causing troubles. ;) If in doubt, speak to your DBA if you have one.
Does that help address your concerns?

How to increase the performance of database schema creation?

For our testing environment, I need to setup und tear down a database multiple times (each test should run independently of any other).
The process is the following:
Create database schema and insert necessary data
Run test 1
Remove all tables in database
Create database schema and insert necessary data
Run test 2
Remove all tables in database
...
The schema and data are the same for each test in the test case.
Basically, this works. The big problem is, that the creation and clearing of the database takes a lot of time. Is there a possibility to improve the performance of mysql for the creation of tables and the insertion of data? Or can you think of a different process for the tests?
Thank for you your help!
Optimize the logical design
The logical level is about the structure of the query and tables themselves. Try to maximize this first. The goal is to access as few data as possible at the logical level.
Have the most efficient SQL queries
Design a logical schema that support the application's need (e.g. type of the columns, etc.)
Design trade-off to support some use case better than other
Relational constraints
Normalization
Optimize the physical design
The physical level deals with non-logical consideration, such as type of indexes, parameters of the tables, etc. Goal is to optimize the IO which is always the bottleneck. Tune each table to fit it's need. Small table can be loaded permanently loaded in the DBMS cache, table with low write rate can have different settings than table with high update rate to take less disk spaces, etc. Depending on the queries, different index can be used, etc. You can denormalized data transparently with materialized views, etc.
Tables paremeters (allocation size, etc.)
Indexes (combined, types, etc.)
System-wide parameters (cache size, etc.)
Partitioning
Denormalization
Try first to improve the logical design, then the physical design. (The boundary between both is however vague, so we can argue about my categorization).
Optimize the maintenance
Database must be operated correctly to stay as efficient as possible. This include a few mainteanance taks that can have impact on the perofrmance, e.g.
Keep statistics up to date
Re-sequence critical tables periodically
Disk maintenance
All the system stuff to have a server that rocks
source from:How to increase the performance of a Database?
I suggest you can write all your need operations into an script using shell、perl or python(init_db).
The first use, you can create、 insert and delete manually,then dump both the schema and data .
You can choose bulk insert and drop table for deleting data to improve the total performance.
Hope this can help you.
Instead of DROP TABLE + CREATE TABLE, just do TRUNCATE TABLE. This may, or may not, be faster; give it a try.
If you are INSERTing multiple rows each time, then either batch them (all rows in one INSERT), or use LOAD DATA. Either of these is much faster than row-by-row INSERTs.
Also fast... If you have the initial data in another table (which you could keep permanently), then do
CREATE TABLE test SELECT * FROM perm_table;
... (run tests using `test`)
DROP TABLE test;

MySQL | Massive Data Insertion

We are using MySQL 5.5 InnoDB Engine for managing our database, one of the table which has equally SELECT/INSERT operations over it will be having 100-150 million Insertion operations on a daily basis.
I have already read about MySQL Partitioning, and was planning to implement but before I implement I'd love to take thoughts. So What is the best way to deal with this kind of challenge without compromising user's response time?
First of all, make sure the primary key is auto-increment, as it's clustering index for InnoDB tables. This means that if it's auto-increment, the insertion is append-only operation, if not - it's random write, and this is major performance killer. Make sure the PK is small and you don't have unnecessary indexes. If possible, batch inserts, as updating the indexes is large part of the insert operation.
Make sure other I/O settings make sense, like how often the data is actually flushed to the disk; you can put the binary log file on an SSD to ensure it's written as fast as possible.
After all of this; it's common to separate reads from writes with a master-slave servers, so spikes in insert queries do not affect reading of the data (assuming it's ok to read potentially stale data)

Storage engine for large amounts of constantly inserted data which should be available instantly

Our server (several Java applications on Debian) handles incoming data (GNSS observations) that should be:
immediately (delay <200ms) delivered to other applications,
stored for further use.
Sometimes (several times a day maybe) about million of archived records will be fetched from the database. Record size is about 12 double precision fields + timestamp and some ids. There are no UPDATEs; DELETEs are very rare but massive. Incoming flow is up to hundred records per second. So I had to choose storage engine for this data.
I tried using MySQL (InnoDB). One application inserts, others constantly check last record id and if it is updated, fetch new records. This part works fine. But I've met following issues:
Records are quite large (about 200-240 bytes per record).
Fetching million of archived records is unacceptable slow (tens of minutes or more).
File-based storage will work just fine (since there are no inserts in the middle of DB and selections are mostly like 'WHERE ID=1 AND TIME BETWEEN 2000 AND 3000', but there are other problems:
Looking for new data might be not so easy.
Other data like logs and configs are stored in same database and I prefer to have one database for everything.
Can you advice some suitable database engine (SQL preferred, but not necessary)? Maybe it is possible to fine-tune MySQL to reduce record size and fetch time for continious strips of data?
MongoDB is not acceptable since DB size is limited on 32-bit machines. Any engine that does not provide quick access for recently inserted data is not acceptable too.
I'd recommend using TokuDB storage engine for MySQL. It's free for up to 50GB of user data, and it's pricing model isn't terrible, making it a great choice for storing large amounts of data.
It's got higher insert speed compared to InnoDB and MyISAM and scales much better as the dataset grows (InnoDB tends to deteriorate once working dataset doesn't fit the RAM making its performance dependant on the I/O of the HDD subsystem).
It's also ACID compliant and supports multiple clustered indexes (which would be a great choice for massive DELETEs you're planning to do). Also, hot schema changes are supported (ALTER TABLE doesn't lock the tables, and changes are quick on huge tables - I'm talking gigabyte-sized tables being altered in mere seconds).
From my personal use, I experienced about 5 - 10 times less disk usage due to TokuDB's compression, and it's much, much faster than MyISAM or InnoDB.
Even though it sounds like I'm trying to advertise this product - I'm not, it's just simply amazing since you can use monolithic data-store without expensive scaling plans like partitioning across nodes to scale the writes.
There really is no getting around how long it takes to load millions of records from disk. Your 32-bit requirement means you are limited in how much RAM you can use for memory based data structures. But, if you want to use MySQL, you may be able to get good performance using multiple table types.
If you need really fast non-blocking inserts. You can use the black hole table type and replication. The server where the inserts occur has a black hole table type that replicates to another server where the table is Innodb or MyISAM.
Since you don't do UPDATEs, I think MyISAM would be better than Innodb in this scenario. You can use the MERGE table type for MyISAM (not available for Innodb). Not sure what your data set is like, but you could have 1 table per day (hour, week?), your MERGE table would then be a superset of those tables. Assuming you want to delete old data by day, just redeclare the MERGE table to not include the old tables. This action is instantaneous. Dropping old tables is also extremely fast.
To check for new data, you can look at "todays" table directly rather than going through the MERGE table.

Is InnoDB (MySQL 5.5.8) the right choice for multi-billion rows?

So, one of my tables in MySQL which uses the InnoDB storage engine will contain multi-billion rows(with potentially no limit to how many will be inserted).
Can you tell me what sort of optimizations i can do to help speed up things?
Cause with a few million rows already, it will start getting slow.
Of course if you suggest to use something else. The only options i have are PostgreSQL and Sqlite3. But I've been told that sqlite3 is not a good choice for that.
As for postgresql, i have absolutely no idea how it is, as i've never used it.
I imagine though, at least about 1000-1500 inserts per second in that table.
A simple answer to your question would be yes InnoDB would be the perfect choice for a multi-billion row data set.
There is a host of optimization that is possbile.
The most obvious optimizations would be setting a large buffer pool, as buffer pool is the single most important thing when it comes to InnoDB because InnoDB buffers the data as well as the index in the buffer pool. If you have a dedicated MySQL server with only InnoDB tables, then you should set upto 80% of the available RAM to be used by InnoDB.
Another most important optimization is having proper indexes on the table (keeping in mind the data access/update pattern), both primary and secondary. (Remember that primary indexes are automatically appended to secondary indexes).
With InnoDB there are some extra goodies, such as protection from data corruption, auto-recovery etc.
As for increasing write-performance, you should setup your transaction log files to be upto a total of 4G.
One other thing that you can do is partition the table.
You can eek out more performance, by setting the bin-log-format to "row", and setting the auto_inc_lock_mode to 2 (that will ensure that innodb does not hold table level locks when inserting into auto-increment columns).
If you need any specific advice you can contact me, I would be more than willing to help.
optimizations
Take care not to have too many indexes. They are expensive when inserting
Make your datatypes fit your data, as tight fit you can. (so don't go saving ip-adresses in a text or a blob, if you know what i mean). Look in to varchar vs char. Don't forget that because varchar is more flexible, you are trading in some things. If you know a lot about your data it might help to use char's, or it might be clearly better to use varchars. etc.
Do you read at all from this table? If so, you might want to do all the reading from a replicated slave, although your connection should be good enough for that amount of data.
If you have big inserts (aside from the number of inserts), make sure your IO is actually quick enough to handle the load.
I don't think there is any reason MySQL wouldn't support this. Things that can slow you down from "thousands" to "millions" to "billions" are stuff like aforementioned indexes. There is -as far as i know- no "mysql is full" problem.
Look into Partial indexes. From wikipedia (quickest source I could find, didn't check the references, but I'm sure you can manage:)
MySQL as of version 5.4 does not
support partial indexes.[3] In MySQL,
the term "partial index" is sometimes
used to refer to prefix indexes, where
only a truncated prefix of each value
is stored in the index. This is
another technique for reducing index
size.[4]
No idea on the MySQL/InnoDB part (I'd assume it'll cope). But if you end up looking at alternatives, PostgreSQL can manage a DB of unlimited size on paper. (At least one 32TB database exists according to the FAQ.)
Can you tell me what sort of optimizations i can do to help speed up things?
Your milage will vary depending on your application. But with billions of rows, you're at least looking into partitioning your data, in order to work on smaller tables.
In the case of PostgreSQL, you'd also look into creating partial indexes where appropriate.
You may want to have a look at:
http://www.mysqlperformanceblog.com/2006/06/09/why-mysql-could-be-slow-with-large-tables/
http://forums.whirlpool.net.au/archive/954126
If you have a very large table (Billions of records) and need to data mine the table (queries that read lots of data), mysql can slow to a crawl.
Large databases (200+GB) are fine, but they are bound by IO/ temp table to disk and multiple other issues when attempting to read large groups that don't fit in memory.