Good database for large table with simple key access - mysql

I have a few large databases, greater than 100 million records. They consist of the following:
A unique key.
An integer value, not unique, but used for sorting the query.
A VARCHAR(200).
I have them in a mysql isam table now. My thought was, hey, I'll just set up a covering index on the data, and it should pull out reasonably fast. Queries are of the form...
select valstr,account
from datatable
where account in (12349809, 987987223,...[etc])
order by orderPriority;
This seemed OK in some tests, but on our newer installation, its terribly slow. It seems faster to have no index at all, which seems odd.
In any case, I'm thinking, maybe a different database? We use a datawarehousing db for other parts of the system, but its not well suited for anything in text. Any free, or fairly cheap, db's are an option, as long as they have reasonably useful API access. SQL optional.
Thanks in advance.
-Kevin

CouchDB and MongoDB and Riak are all going to be good at finding the key (account) relatively quickly.
The problems you're going to have (with any solution) are tied to the "order by" and "account in" clauses.
Problem #1: account in
120M records likely means gigabytes of data. You probably have an index over a gig. The reason this is a problem is that your "in" clause can easily span the whole index. If you search for accounts "0000001" and "9999581" you probably need to load a lot of index.
So just to find the records your DB first has to load potentially a gig of memory. Then to actually load the data you have to go back to the disk again. If your "accounts" on the in clause are not "close together" then you're going back multiple times to fetch various blocks. At some point it may be quicker to just do a table scan then to load the index and the table.
Then you get to problem #2...
Problem #2: order by
If you have a lot of data coming back from the "in" clause, then order by is just another layer of slowness. With an "order by" the server can't stream you the data. Instead it has to load all of the records in memory and then sort them and then stream them.
Solutions:
Have lots of RAM. If the RAM can't fit the entire index, then the loads will be slow.
Try limiting the number of "in" items. Even 20 or 30 items in this clause can make the query really slow.
Try a Key-Value database?
I'm a big fan of K/V databases, but you have to look at point #1. If you don't have a lot of RAM and you have lots of data, then the system is going to run slowly no matter what DB you use. That RAM / DB size ratio is really important if you want good performance in these scenarios (small look-ups in big datasets).

Here's a reasonably sized example of a MySQL database using the innodb engine which takes advantage of clustered indexes on a table with approx. 125 million rows and with a query runtime of 0.021 seconds which seems fairly reasonable.
Rewriting mysql select to reduce time and writing tmp to disk
http://pastie.org/1105206
Other useful links:
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html
Hope it proves of interest.

CouchDB will give you storage by key and you can create views to do the query/sorting. Second option could be cassandra, but there's a pretty big learning curve.

Related

How to improve the performance of table scans with innodb

Brief: Is there any way to improve the performance of table scans on InnoDB tables?
Please, do not suggest adding indexes to avoid table scans. (see below)
innodb_buffer_pool_size sits at 75% of server memory (48 GB/64GB)
I'm using the latest version of Percona (5.7.19) if that changes anything
Longer: We have 600Gb of recent time series data (we aggregate and delete older data) spread over 50-60 tables. So most of it is "active" data that is regularly queried. These tables are somewhat large (400+ numeric columns) and many queries run against a number of those columns (alarming) which is why it is impractical to add indexes (as we would have to add a few dozen). The largest tables are partitioned per day.
I am fully aware that this is an application/table design problem and not a "server tuning" problem. We are currently working to significantly change the way these tables are designed and queried, but have to maintain the existing system until this happens so I'm looking for a way to improve things a bit to buy us a little time.
We recently split this system and have moved a part of it to a new server. It previously used MyISAM, and we tried moving to TokuDB which seemed appropriate but ran into some weird problems. We switched to InnoDB but performance is really bad. I get the impression that MyISAM is better with table scans which is why, barring any better option, we'll go back to it until the new system is in place.
Update
All tables have pretty much the same structure:
-timestamp
-primary key (varchar(20) field)
-about 15 fields of various types representing other secondary attributes that can be filtered upon (along with an appropriately indexed criteria first)
-And then about a few hundred measures (floats), between 200-400.
I already trimmed the row length as much as I could without changing the structure itself. The primary key used to be a varchar(100), all measures used to be doubles, many of the secondary attributes had their data types changed.
Upgrading hardware is not really an option.
Creating small tables with just the set of columns I need would help some processes perform faster. But at the cost of creating that table with a table scan first and duplicating data. Maybe if I created it as a memory table. By my estimate, it would take a couple of GB away from the buffer pool. Also there are aggregation processes that read about as much data from the main tables on a regular basis, and they need all columns.
There is unfortunately a lot of duplication of effort in those queries which I plan to address in the next version. The alarming and aggregation processes basically reprocess the entire day's worth of data every time some rows inserted (every half hour) instead of just dealing with new/changed data.
Like I said, the bigger tables are partitioned, so it's usually a scan over a daily partition rather than the entire table, which is a small consolation.
Implementing a system to hold this in memory outside of the DB could work, but that would entail a lot of changes on the legacy system and development work. Might as well spend that time on the better design.
The fact that InnoDB table are so much bigger for the same data as MyISAM (2-3x as big in my case) really hinders the performance.
MyISAM is a little bit better at table-scans, because it stores data more compactly than InnoDB. If your queries are I/O-bound, scanning through less data on disk is faster. But this is a pretty weak solution.
You might try using InnoDB compression to reduce the size of data. That might get you closer to MyISAM size, but you're still I/O-bound so it's going to suck.
Ultimately, it sounds like you need a database that is designed for an OLAP workload, like a data warehouse. InnoDB and TokuDB are both designed for OLTP workload.
It smells like a Data Warehouse with "Reports". By judicious picking of what to aggregate (selected of your Floats) over what time period (hour or day is typical), you can build and maintain Summary Tables that work much more efficiently for the Reports. This has the effect of scanning the data only once (to build the Summaries), not repeatedly. The Summary tables are much smaller, so the reports are much faster -- 10x is perhaps typical.
It may also be possible to augment the Summary tables as the raw data is being Inserted. (See INSERT .. ON DUPLICATE KEY UPDATE ..)
And use Partitioning by date to allow for efficient DROP PARTITION instead of DELETE. Don't have more than about 50 partitions.
Summary Tables
Time series Partitioning
If you would like to discuss in more detail, let's start with one of the queries that is scanning so much now.
In the various projects I have worked on, there were between 2 and 7 Summary tables.
With 600GB of data, you may be pushing the limits on 'ingestion'. If so, we can discuss that, too.

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Huge MySQL Database -- Do's and Don'ts?

I'm interested to build a huge database (100s millions of records) using MySQL, to contain stock data in 1-min intervals. The database will contain data for 5000 stocks say for 10 years.
Two concerns:
(1) In the past, I had a problem of "slow insertions" -- meaning, at the beginning the rate of insertions was good, but as the table was filling up with millions of records, the insertion became slow (too slow!). At that time I used Windows and now I use Linux -- should it make a difference?
(2) I am aware of indexing techniques that will help queries (data retrievals) be faster. The thing is, is there a way to speed-up insertions? I know one can turn off indexing while inserting, but then 'building' the indexes post insertion (for 10s of millions of records!) also takes tons of time. any advice on that?
Any other Do's / Don'ts? Thanks in advance for any help.
It depends on what type of index you need and how you generate data. If you are satisfied with single index on time, just stick to that and when you generate data, keep on inserting in ascending order (with respect to the insert time for which you have the index). That way, the reordering required is minimal during insertion. Also, consider partitioning to optimize your queries. It can give you dramatic improvements in performance. Using auto-increment column can help for fast indexing, but then you won't have the index on time if auto-increment column is the only index. Make sure you use innodb storage engine for good performance. If you properly tune your database engine on Linux and keep the design simple, it will smoothly scale without much issues. I think the huge data requirements you talk about is not as difficult to build as it might seem first. However, if you are planning to run aggregate queries (with joins of tables), then that is more challenging.
You could always keep your data in a table with no indexes and then use Lucene (or similar) to index the data. This will keep inserts fast and allow you to query Lucene for fast data retrieval.
Consider using an SSD drive (or array) to store your data, especially if you can't afford to create a box with gigs of memory. Everything about it should be faster.

Which granulary to choose for database table partitioning?

I have a 20-million record table in MySQL database. SELECT's work really fast because I have set up good indexes, but INSERT and UPDATE operation is getting to be really slow. The database is back-end of a web application under heavy load. INSERTs and UPDATEs are really slow because there are some 5 indexes on this table and index size is about 1GB now - I guess it takes to much time to compute.
To solve this problem, I decided to partition a table. I run MySQL 4, and cannot upgrade (no direct control over server), so I'll do manual partitioning - create a separate table for each section.
The data-set is composed from about 18000 different logical slices, which could be queried completely separately. Therefore, I could create 18000 tables named (maindata1, maindata2, etc.). However, I'm not sure that this is optimal way do to it? Beside the obvious fact that I'll have to browse through 18000 items in administration tool whenever I want to do something manually, I'm concerned about file-system performance. File-system is ext3. I'm not sure how fast it is in locating files in a directory with 36000 files (there's data file and index file).
If this is a problem, I could join some slices of data together into a same table. For example: maindata10, maindata20, etc. where maindata10 would contain slices 1, 2, 3...10. If I would go for "groups" of 10, I would only have 1800 tables. If I would group 20, I would get 900 tables.
I wonder what would be the optimal size of this grouping, i.e. number of files in a directory vs table size?
Edit: I also wonder if it would be a good idea to use multiple separate databases to group files together. So, even if I would have 18000 tables, I could group them in, say, 30 databases of 600 tables each. It seems like this would be much easier to manage. I don't know if having multiple databases would increase or decrease performance or memory footprint (it would complicate backup and restore though)
There are a few tactics you could follow to boost performance. By "partitions" I assume you mean "versions of tables with the same column layout but different data contents."
Get a server that will run mySQL 5 if you possibly can. It's faster and better at this stuff, enough so that you may not have a problem after you upgrade.
Are you using InnoDB? If so, can you switch to myISAM? (If you need rigid transactional integrity you might not be able to switch).
For partitioning, you might try to figure out what kind of data-slice combination will give you roughly equal-size partitions (by row count). If I were you I'd go for no more than about 20 partitions unless you can prove to yourself that you need to.
If only a few of your data slices are being actively updated (for example, if they are "this month's data" and "last month's data), I might consider splitting those into smaller slices. For example, you might have "this week's data", "last week's," and "the week before" in their own partitions. Then, when your partitions cool off, copy their data and combine them into bigger groups like "the quarter before last." This has the disadvantage that it will require routine Sunday-evening style maintenance jobs to run. But it has the advantage that most or all updates only happen on a small fraction of your table.
you should look into the merge engine if you are using myISAM, this way you can get pretty much the same functionality as a partitioning of mysql5, you will be able to run the same select as you are running now.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716