I know this question is asked multiple times in stackoverflow.
I am posting this question to find out what will be the best choice in for my design.
I have following schema for my job details.
_unique_key varchar(256) NULL
_job_handle varchar(256) NULL
_data varchar(1024) NULL
_user_id int(11) NULL
_server_ip varchar(39) NULL
_app_version varchar(256) NULL
_state int(11) NULL
_is_set_stopped bool
What operation we are doing on this table:
For each job we will be having one update and 10 select query on this table. So we need high frequency for read and write.
There are many application which are manipulating this table by doing filter on:
_unique_key
_state
is_set_stopped
_user_id
_data field size varies from 5KB to 1 MB based on type of application and user.
Application can update selective attribute.
Solution we thought:
MySQL InnoDB
I think MySQL will not scale enough due to requirement on high read and write.
MySQL In Memory Table
Problem with this solution is that
It doesn't support dynamic field size. MEMORY tables use a fixed-length row-storage format. Variable-length types such as VARCHAR are stored using a fixed length. Source http://dev.mysql.com/doc/refman/5.0/en/memory-storage-engine.html
select for .... update it will lock a entire table. I don't know will it be a problem.
Redis
Redis look likes a good choice. But I think my table is not good for key value cache server.
It support only very let's set of datatypes. I can store only string in list. I need to store fields as JSON or some other format.
If clients want to update a particular attribute they need to download full value and then do parsing of object and repush to server.
May be I am wrong is there a way to do that?
Filtering based on value will not be possible.
May be I am wrong is there a way to do that?
MySQL InnoDB on TMPFS file system
This look promising. But don't no will it scale enough similar to Redis or MySQL in memory table.
In this question, you are confusing raw performance (i.e. efficiency) with scalability. They are different concepts.
Between the InnoDB and memory engines, InnoDB is likely to be the most scalable. InnoDB supports multi-versioning concurrency control, has plenty of optimizations to deal with contention, so it will handle concurrent accesses much better than the memory engine. Even if it may be slower in some I/O bound situations.
Redis is a single-threaded server. All the operations are serialized. It has zero scalability. It does not mean it is inefficient. On the contrary, it will likely support more connections that MySQL (due to its epoll-based event loop), and more traffic (due to its very efficient lock-free implementation and in-memory data structures).
To answer your question, I would give a try to MySQL with InnoDB. If it is properly configured (no synchronous commit, enough cache buffer, etc ...), it can sustain a good throughput. And instead of running it on top on tmpfs, I would consider SSD hardware.
Now, if you prefer to use Redis (which is not a relational store btw), you can certainly do it. There is no need to systematically serialize/deserialize your data. And filtering is indeed possible, provided you can anticipate all access paths and find an adapted data structure.
For instance:
one hash object per job. The key is _unique_key. The fields of the hash should correspond to the columns of your relational table.
one set per state value
2 sets for is_set_stopped
one set per userid value
For each job insertion, you need to pipeline the following commands:
HMSET job:AAA job_handle BBB data CCC user_id DDD server_ip EEE app_version FFF state GGG is_set_stopped HHH
SADD state:GGG AAA
SADD is_set_stopped:HHH AAA
SADD user_id:DDD AAA
You can easily update any field individually provided you maintain the corresponding sets.
You can perform filtering queries by intersecting the sets. For instance:
SINTER is_set_stopped:HHH state:GGG
With Redis, the bottleneck will likely be the network, especially if the data field is large. I hope you will have more jobs of 5KB than jobs of 1MB. For instance 1000 write/s of 1 MB objects represents 8 GBits/s, probably more than what your network can sustain. This is true for both Redis and MySQL.
I suggest postgresql, it's more capable (has more features and better support for complex queries and datatypes) than mysql and has a lot of tuning options.
If you give postgresql enough memory and tune the parameters right it will cache everything in memory.
Alternatively you could also use it on tmpfs if that's your preference and use streaming replication to a on-disk database for a hard copy.
Streaming replication has 3 operating modes asyncronously, on receive, and on fsync. If you use the first one, async, you don't have to wait for a sync to disk on the replication server so any updates will be very fast with tmpfs.
Since you also seem to have a lot of text fields, another feature might help, postgresql can store a textsearch vector on a row, and you can add an index on that and update it via a trigger with the concatenated content of all the rows you are searching on. That will give you an incredible boost in performance when doing text search on multiple columns versus any way you can possibly write that in mysql.
Regardless of database you use:
You state that _data is varchar[1024], yet you say it contains 5K to 1M of data? Is this actually a blob? Even if it was a mistake in the length mysql doesn't support varchar fields longer than 65535 bytes in length! I suppose that's not updated as much as the other rows, it might be wise to separate this into two tables, one with the static data and one with the dynamic data to minimize disk access.
Related
I am experiencing an interesting situation and although is not an actual problem, I can't understand why this is happening.
We had a mongo database, consisting mainly of some bulk data stored into an array. Due to the fact that over 90% of the team was familiar with mysql while only a few of us were familiar with mongo, combined with the fact that is not a critical db and all queries are done over 2 of the fields (client or product) we decided to move the data in mysql, in a table like this
[idProduct (bigint unsigned), idClient (bigint unsigned), data (json)]
Where data is a huge json containing hundreds of attributes and their values.
We also partitioned in 100 partitions by a hash over idClient.
PARTITION BY HASH(idClient)
PARTITIONS 100;
All is working fine but I noticed an interesting fact:
The original mongo db had about 70 GB, give or take. The mysql version (containing actually less data because re removed some duplicates that we were using as indexes in mongo) has over 400 GB.
Why does it take so much more space? In theory bson should actually be slightly larger than json (at least in most cases). Even if indexes are larger in mysql... the difference is huge (over 5x).
I did a presentation How to Use JSON in MySQL Wrong (video), in which I imported Stack Overflow data dump into JSON columns in MySQL. I found the data I tested with took 2x to 3x times more space than importing the same data into normal tables and columns using conventional data types for each column.
JSON uses more space for the same data, for example because it stores integers and dates as strings, and also because it stores key names on every row, instead of just once in the table header.
That's comparing JSON in MySQL vs. normal columns in MySQL. I'm not sure how MongoDB stores data and why it's so much smaller. I have read that MongoDB's WiredTiger engine supports options for compression, and snappy compression is enabled by default since MongoDB 3.0. Maybe you should enable compressed format in MySQL and see if that gives you better storage efficiency.
JSON in MySQL is stored like TEXT/BLOB data, in that it gets mapped into a set of 16KB pages. Pages are allocated one at a time for the first 32 pages (that is, up to 512KB). If the content is longer than that, further allocation is done in increments of 64 pages (1MB). So it's possible if a single TEXT/BLOB/JSON content is say, 513KB, it would allocate 1.5MB.
Hi I think the main reason could be due to the fact that internally Mongo stores json as bson ( http://bsonspec.org/ ) and in the spec it is stressed that this representation is Lightweight.
The WiredTiger Storage Engine in MongoDB uses compression by default. I don't know the default behavior of MySQL.
Unlike MySQL, the MongoDB is designed to store JSON/BSON, actually it does not store anything else. So, this kind of "competition" might be a bit unfair for MySQL which stores JSON like TEXT/BLOB data.
If you would have relational data, i.e. column-based values then most likely MySQL would be smaller as stated by #Bill Karwin. However, with smart bucketing in MongoDB you can reduce the data size significantly.
MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.
I need to store telemetry data that is being generated every few minutes from over 10000 nodes (which may increase), each supplying the data over the internet to a server for logging. I'll also need to query this data from a web application.
I'm having a bit of trouble deciding what the best storage solution would be..
Each node has a unique ID, and there will be a timestamp for each packet of variables. (probably will need to be generated by the server).
The telemetry data has all of the variables in the same packet, so conceptually it could easily be stored in a single database table with a column per variable. The serial number + timestamp would suffice as a key.
The size of each telemetry packet is 64 bytes, including the device ID and timestamp. So around 100Gb+ per year.
I'd want to be able to query the data to get variables across time ranges and also store aggregate reports of this data so that I can draw graphs.
Now, how best to handle this? I'm pretty familiar with using MySQL, so I'm leaning towards this. If I were to go for MySQL would it make sense to have a separate table for each device ID? - Would this make queries much faster or would having 10000s of tables be a problem?
I don't think querying the variables from all devices in one go is going to be needed but it might be. Or should I just stick it all in a single table and use MySQL cluster if it gets really big?
Or is there a better solution? I've been looking around at some non relational databases but can't see anything that perfectly fits the bill or looks very mature. MongoDB for example would have quite a lot of size overhead per row and I don't know how efficient it would be at querying the value of a single variable across a large time range compared to MySQL. Also MySQL has been around for a while and is robust.
I'd also like it to be easy to replicate the data and back it up.
Any ideas or if anyone has done anything similar you input would be greatly appreciated!
Have you looked at time-series databases? They're designed for the use case you're describing and may actually end up being more efficient in terms of space requirements due to built-in data folding and compression.
I would recommend looking into implementations using HBase or Cassandra for raw storage as it gives you proven asynchronous replication capabilities and throughput.
HBase time-series databases:
OpenTSDB
KairosDB
Axibase Time-Series Database - my affiliation
If you want to go with MySQL, keep in mind that although it will keep on going when you throw something like a 100GB per year at it easily on modern hardware, do be advised that you cannot execute schema changes afterwards (on a live system). This means you'll have to have a good, complete database schema to begin with.
I don't know if this telemetry data might grow more features, but if they do, you don't want to have to lock your database for hours if you need to add a column or index.
However, some tools such as http://www.percona.com/doc/percona-toolkit/pt-online-schema-change.html are available nowadays which make such changes somewhat easier. No performance problems to be expected here, as long as you stay with InnoDB.
Another option might be to go with PostgreSQL, which allows you to change schemas online, and sometimes is somewhat smarter about the use of indexes. (For example, http://kb.askmonty.org/en/index-condition-pushdown is a new trick for MySQL/MariaDB, and allows you to combine two indices at query time. PostgreSQL has been doing this for a long time.)
Regarding overhead: you will be storing your 64 bytes of telemetry data in an unpacked form, probably, so your records will take more than 64 bytes on disk. Any kind of structured storage will suffer from this.
If you go with an SQL solution, backups are easy: just dump the data and you can restore it afterwards.
I ran a lookup test against an indexed MySQL table containing 20,000,000 records, and according to my results, it takes 0.004 seconds to retrieve a record given an id--even when joining against another table containing 4,000 records. This was on a 3GHz dual-core machine, with only one user (me) accessing the database. Writes were also fast, as this table took under ten minutes to create all 20,000,000 records.
Assuming my test was accurate, can I expect performance to be as as snappy on a production server, with, say, 200 users concurrently reading from and writing to this table?
I assume InnoDB would be best?
That depends on the storage engine you're going to use and what's the read/write ratio.
InnoDB will be better if there are lot of writes. If it's reads with very occasional write, MyISAM might be faster. MyISAM uses table level locking, so it locks up whole table whenever you need to update. InnoDB uses row level locking, so you can have concurrent updates on different rows.
InnoDB is definitely safer, so I'd stick with it anyhow.
BTW. remember that right now RAM is very cheap, so buy a lot.
Depends on any number of factors:
Server hardware (Especially RAM)
Server configuration
Data size
Number of indexes and index size
Storage engine
Writer/reader ratio
I wouldn't expect it to scale that well. More importantly, this kind of thing is to important to speculate about. Benchmark it and see for yourself.
Regarding storage engine, I wouldn't dare to use anything but InnoDB for a table of that size that is both read and written to. If you run any write query that isn't a primitive insert or single row update you'll end up locking the table using MyISAM, which yields terrible performance as a result.
There's no reason that MySql couldn't handle that kind of load without any significant issues. There are a number of other variables involved though (otherwise, it's a 'how long is a piece of string' question). Personally, I've had a number of tables in various databases that are well beyond that range.
How large is each record (on average)
How much RAM does the database server have - and how much is allocated to the various configurations of Mysql/InnoDB.
A default configuration may only allow for a default 8MB buffer between disk and client (which might work fine for a single user) - but trying to fit a 6GB+ database through that is doomed to failure. That problem was real btw - and was causing several crashes a day of a database/website till I was brought in to trouble-shoot it.
If you are likely to do a great deal more with that database, I'd recommend getting someone with a little more experience, or at least oing what you can to be able to give it some optimisations. Reading 'High Performance MySQL, 2nd Edition' is a good start, as is looking at some tools like Maatkit.
As long as your schema design and DAL are constructed well enough, you understand query optimization inside out, can adjust all the server configuration settings at a professional level, and have "enough" hardware properly configured, yes (except for sufficiently pathological cases).
Same answer both engines.
You should probably perform a load test to verify, but as long as the index was created properly (meaning indexes are optimized to your query statements), the SELECT queries should perform at an acceptable speed (the INSERTS and/or UPDATES may be more of a speed issue though depending on how many indexes you have, and how large the indexes get).
For reasons that are irrelevant to this question I'll need to run several SQLite databases instead of the more common MySQL for some of my projects, I would like to know how SQLite compares to MySQL in terms of speed and performance regarding disk I/O (the database will be hosted in a USB 2.0 pen drive).
I've read the Database Speed Comparison page at http://www.sqlite.org/speed.html and I must say I was surprised by the performance of SQLite but since those benchmarks are a bit old I was looking for a more updated benchmark (SQLite 3 vs MySQL 5), again my main concern is disk performance, not CPU/RAM.
Also since I don't have that much experience with SQLite I'm also curious if it has anything similar to the TRIGGER (on update, delete) events in the InnoDB MySQL engine. I also couldn't find any way to declare a field as being UNIQUE like MySQL has, only PRIMARY KEY - is there anything I'm missing?
As a final question I would like to know if a good (preferably free or open source) SQLite database manager exists.
A few questions in there:
In terms of disk I/O limits, I wouldn't imagine that the database engine makes a lot of difference. There might be a few small things, but I think it's mostly just whether the database can read/write data as fast as your application wants it to. Since you'd be using the same amount of data with either MySQL or SQLite, I'd think it won't change much.
SQLite does support triggers: CREATE TRIGGER Syntax
SQLite does support UNIQUE constraints: column constraint definition syntax.
To manage my SQLite databases, I use the Firefox Add-on SQLite Manager. It's quite good, does everything I want it to.
In terms of disk I/O limits, I wouldn't imagine that the database engine makes
a lot of difference.
In Mysql/myISAM the data is stored UNORDERED, so RANGE reads ON PRIMARY KEY will theoretically need to issue several HDD SEEK operations.
In Mysql/InnoDB the data is sorted by PRIMARY KEY, so RANGE reads ON PRIMARY KEY will be done using one DISK SEEK operation (in theory).
To sum that up:
myISAM - data is written on HDD unordered. Slow PRI-KEY range reads if pri key is not AUTO INCREMENT unique field.
InnoDB - data ordered, bad for flash drives (as data needs to be re-ordered after insert = additional writes). Very fast for PRI KEY range reads, slow for writes.
InnoDB is not suitable for flash memory. As seeks are very fast (so you won't get too much benefit from reordering the data), and additional writes needed to maintain the order are damaging to flash memory.
myISAM / innoDB makes a huge difference for conventional and flash drives (i don't know what about SQLite), but i'd rather use mysql/myisam.
I actually prefer using SQLiteSpy http://www.portablefreeware.com/?id=1165 as my SQLite interface.
It supports things like REGEXP which can come in handy.