When is it time to switch to NoSQL? - mysql

I am dealing with a large database that is collecting historical pricing data. The schema is relatively simple and does not change.
Something like:
SKU (char), type(enum), price(double), datetime(datetime)
The issue is that this table now has over 500,000,000 rows and is around 20gb and growing. It is already getting a bit difficult to run queries. One common query is to get all skus from a specific date range consisting of maybe 500,000 records. Add any complexity like group by, and you can forget it.
This db is mostly writes. But we obviously need to crunch the data and run queries occasionally. I understand that better index planning can help speed up the queries, but I am wondering if this is the type of data that would benefit from a noSQL solution like MongoDB? Can I expect mysql (probably moving to MariaDB) to continue to work for us, even after it grows beyond 100-200 gb in size? Or should I explore alternatives before things get unweildly?

NoSQL is not a solution to a "large database" problem; NoSQL--specifically document databases--are designed for scenarios where the nature of the data you're storing varies, so you don't want to define rigid schemas and relationships up front.
What you have is simple, well-defined data. This is ideally suited for a relational database, but for something of that scale I would recommend looking something either commercial (i.e. SQL Server or Oracle, depending on your platform). The databases I work with in SQL Server are around four terabytes in size with several tables in the hundreds-of-millions records like you have. A relational database can easily accommodate the simple data you've outlined.

You actually have an ideal use-case for SQL, and a rather bad fit for NoSQL. MySQL devs report people using databases of 5,000,000,000 records. Some other SQL servers will be even more scalable than that. However, if you don't have a proper index support, it should be impossible to manage even a fraction of that.
BTW, what is your table schema, including indices?

You could switch to mariadb and then use the spider engine. The spider engine makes it possible to split your data across multiple mariadb instances without loosing the abillity to run queries against your existing instance.
So you can define your own rules for partitioning and then create one instance per partition. So in the end you have multiple instances of mariadb but all your records are virtual sumed up in one table with the spider engine.
Your performace gain would be because you split your data across multiple instances and therefore reduce the amount of records per table or instance and of course by using more hardware ressources.

Related

How to handle a table with billion of rows with lots of read and write operations

Please guide me through my problem
I receive data at every 1 sec at my server from different sources.My data is structured i parse it and now i have to store this parsed data into single table around 5 lacs of records in a day. Also daily i do lots of read operation on this table.After some time this table will have billions of record.
How should i solve this problem? I want to know should i go with RDBMS or HBase or any other option.
My question is regarding what sort of database repository you wish to use: RAM? Flash? Disk?
RAM responds in nanoseconds.
Flash in microseconds.
Disk in milliseconds.
And, of course, you might want to create a hybrid of all three, especially if some keys were "hotter" than others -- more likely to be read over and over.
If you want to do a lot of fast processing, and scale it "wide" (many CPUs in a cluster for faster read performance), you are a likely candidate for a NoSQL database. I'd need to know more about your data model to know whether it would work as a key-value store, and how it might require more internal structure such as JSON/BSON.
Caveat: I am biased towards Aerospike, my employer. Yet you should do some kicking-of-the-tires with us or any other key-value stores you're considering to see if it would work with your data before betting the farm. Obviously, each NoSQL vendor would claim itself to be "the best," but much depends on your use case. A vendor's "solution" will only work well for certain data models. We tend to be best for fast in-memory RAM/Flash or hybrid implementations.
If in case your table would reach billions of records, RDBMS definitely won't scale.
Regarding HBASE, it depends on your requirements whether it would be a good solution or not.
If you are looking for real time reads, Hbase would only help if you are only looking for a specific key. If you want to do random reads on different columns, Hbase won't be an ideal solution here. Hbase would scale really well in case of updates.
I would suggest you to design your Hbase schema efficiently and store your data in way which suits your querying.
However if you are interested in running aggregation queries you can also map your hbase table to an external table in Hive and run sql type queries on your data.
You can use HBase as a NoSQL database in this case. To make search more customized and faster use ElasticSearch along with Hbase.
If you writes are at 1/second, most of the available databases should be able to support this. Since you are looking for longer term/persistent store, you should consider a database that provides you horizontal scale so that you could add more nodes as and when you would like to increase the capacity. Databases with auto-sharding abilities would be great fit for you (cassandra, aerospike ...). Make sure you choose a auto-sharding database that doesn't require client/application to manage which data is stored where. In-memory databases would not fit the bill in this case.
When your storage is a few tera-bytes, you may have to worry about the database scale, throughput so that your infra cost doesn't bogg you down.
Your query patterns would be very crucial in choosing the right solution. You may not want to index everything, but fine-tune what you index so that you could query on the keys and/or only those data elements from within your records so that index storage overhead doesn't become too much, and hence you keep the cost under control. You should also look for time-range query ability for the database solutions, which seems to be part of your typical query pattern.
Last but not the least, you would want to have your queries processes in fastest possible time. You should try out Cassandra (good for horizontal scaling, less on the throughput) and aerospike (good for horizontal scaling, pretty good on throughput).

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.

Maximum number of workable tables in SQL Server And MySQL

I know that in SQL Server, the maximum number of "objects" in a database is a little over 2 billion. Objects contains tables, views, stored procedures, indexes, among other things . I'm not at all worried about going beyond 2 billion objects. However, what I would like to know, is, does SQL Server suffer a performance hit from having a large number of tables. Does each table you add have a performance hit, or is there basically no difference (assuming constant amount of data). Does anybody have any experience working with databases with thousands of tables? I'm also wondering the same about MySQL.
No difference, assuming constant amount of data.
Probably a gain in practical terms because of things like reduced maintenance windows (smaller index rebuilds), ability to have read-only file groups etc.
Performance is determined by queries and indexes (at the most basic level): not number of objects
In terms of the max number of tables I have had a database with 2 million tables. No performance hit at all. my tables where small around 15MB each.
I doubt SQL Server will have a performance problem working with thousands of tables, but I sure would.
I've worked on databases with hundreds of tables in SQL Server with no problems, though.
SQl Server can suffer a larger performance hit by using tables with many, many columns instead of breaking out a related table (even one with a one-to_one relationship). Plus a wide table likely can have problems when the data you want to input exceeds the number of bytes that you can store for a column. You can create a table that has the potential to store, for example, 10000 bytes but you will still only be able to store 8060 bytes.
In my experience I dont think that the number of tables will hit the performance. But then you should be able to justify why you are having so many tables in the database. That is because having so many tables at the database side will also effect the work of developer at the server side.
IMO if you divide the tables on the basis of functionality then you can not make the life of developer easy but also have performance gain in your application because you have fixed tables from where you suppose to get the required data.
Say like you need to store sales,purchase,receipt and payment details. All of the them have the same table structure then instead of storing them in single table you could store them all separately in separate tables. with these you can get all the details for the sales in single table, for purchase in its single table and likewise. thus it can help in improving the response time of database tier of the application which is one of the most slowest component in all web tiers...!!! ofcourse we imporve upon the performance of database by SQL quires but then such structuring can also indirectly help you improve upon the database performance.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716