Separate single database into active and archive - mysql

I have a single database, most of the tables are connected in some way.
It consists of over 500000 records.
I need to implement live search, but number of records bothers me.
Database will grow and live search in millions of records will sure cause problems. So i need to move old records (let's assume date field is present) to another database and only keep fresh ones available for search.
Old records won't be used anymore, that's for sure, but i still need to keep them.
Any ideas how that could be implemented in MySQL?

500,000 records really is not very many records.
Before you start taking drastic actions (such as limiting the ability of users to seamlessly see all the data at once), you should consider basics for improving performance:
Indexes to improve standard query performance.
Partitioning to limit the portions of tables that need to be accessed.
Full text indexing to improve match() queries.
Optimization of SQL queries.
In general, these are sufficient for databases that are orders of magnitude larger than the volume you are dealing with.
These may not apply to your particular situation; but you should exhaust the lower-hanging fruit for performance optimization before changing your physical data model for a problem that might never occur.

Related

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

is having millions of tables and millions of rows within them a common practice in MySQL database design?

I am doing database design for an upcoming web app, and I was wondering from anybody profusely using mysql in their current web apps if this sort of design is efficient for a web app for lets say 80,000 users.
1 DB
in DB, millions of tables for features for each user, and within each table, potentially millions of rows.
While this design is very dynamic and scales nicely, I was wondering two things.
Is this a common design in web applications today?
How would this perform, time wise, if querying millions of rows.
How does a DB perform if it contains MILLIONS of tables? (again, time wise, and is this even possible?)
if it performs well under above conditions, how could it perform under strenuous load, if all 80,000 users accessed the DB 20-30 times each for 10 -15 minute sessions every day?
how much server space would this require, very generally speaking (reiterating, millions of tables each containing potentially millions of rows with 10-15 columns filled with text)
Any help is appreciated.
1 - Definitely not. Almost anyone you ask will tell you millions of tables is a terrible idea.
2 - Millions of ROWS is common, so just fine.
3 - Probably terribly, especially if the queries are written by someone who thinks it's OK to have millions of tables. That tells me this is someone who doesn't understand databases very well.
4 - See #3
5 - Impossible to tell. You will have a lot of extra overhead from the extra tables as they all need extra metadata. Space needed will depend on indexes and how wide the tables are, along with a lot of other factors.
In short, this is a very very very seriously bad idea and you should not do it.
Millions of rows is perfectly normal usage, and can respond quickly if properly optimized and indexed.
Millions of tables is an indication that you've made a major goof in how you've architected your application. Millions of rows times millions of tables times 80,000 users means what, 80 quadrillion records? I strongly doubt you have that much data.
Having millions of rows in a table is perfectly normal and MySQL can handle this easily, as long as you use appropriate indexes.
Having millions of tables on the other hand seems like a bad design.
In addition to what others have said, don't forget that finding the right table based on the given table name also takes time. How much time? Well, this is internal to DBMS and likely not documented, but probably more than you think.
So, a query searching for a row can either take:
Time to find the table + time to find the row in a (relatively) small table.
Or, just the time to find a row in one large table.
The (2) is likely to be faster.
Also, frequently using different table names in your queries makes query preparation less effective.
If you are thinking of having millions of tables, I can't imagine that you actually designing millions of logically distinct tables. Rather, I would strongly suspect that you are creating tables dynamically based on data. That is, rather than create a FIELD for, say, the user id, and storing one or more records for each user, you are contemplating creating a new TABLE for each user id. And then you'll have thousands and thousands of tables that all have exactly the same fields in them. If that's what you're up to: Don't. Stop.
A table should represent a logical TYPE of thing that you want to store data for. You might make a city table, and then have one record for each city. One of the fields in the city table might indicate what country that city is in. DO NOT create a separate table for each country holding all the cities for each country. France and Germany are both examples of "country" and should go in the same table. They are not different kinds of thing, a France-thing and a Germany-thing.
Here's the key question to ask: What data do I want to keep in each record? If you have 1,000 tables that all have exactly the same columns, then almost surely this should be one table with a field that has 1,000 possible values. If you really seriously keep totally different information about France than you keep about Germany, like for France you want a list of provinces with capital city and the population but for Germany you want a list of companies with industry and chairman of the board of directors, then okay, those should be two different tables. But at that point the difference is likely NOT France versus Germany but something else.
1] Look up dimensions and facts tables in database design. You can start with http://en.wikipedia.org/wiki/Database_model#Dimensional_model.
2] Be careful about indexing too much: for high write/update you don't want to index too much because that gets very expensive (think average case or worst case of balancing a b-tree). For high read tables, index only the fields you search by. for example in
select * from mutable where A ='' and B='';
you may want to index A and B
3] It may not be necessary to start thinking about replication. but since you are talking about 10^6 entries and tables, maybe you should.
So, instead of me telling you a flat no for the millions of tables question (and yes my answer is NO), I think a little research will serve you better. As far as millions of records, it hints that you need to start thinking about "scaling out" -- as opposed to "scaling up."
SQL Server has many ways you can support large tables. You may find some help by splitting your indexes across multiple partitions (filegroups), placing large tables on their own filegroup, and indexes for the large table on another set of filegroups.
A filegroup is basically a separate drive. Each drive has its own dedicated read and write heads. The more drives the more heads are searching the indexes at a time and thus faster results finding your records.
Here is a page that talks in details about filegroups.
http://cm-bloggers.blogspot.com/2009/04/table-and-index-partitioning-in-sql.html

How many rows are 'too many' for a MySQL table? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How many rows in a database are TOO MANY?
I am building the database scheme for an application that will have users, and each user will have many rows in relation tables such as 'favorites'.
Each user could have thousands of favorites, and there could be thousands of registered users (over time).
Given that users are never deleted, because that would either leave other entities orphaned, or have them deleted too (which isn't desired), and therefore these tables will keep growing forever, I was wondering if the resulting tables could be too big (eg: 1kk rows), and I should worry about this and do something like mark old and inactive users as deleted and remove the relations that only affect them (such as the favorites and other preferences).
Is this the way to go? Or can mysql easily handle 1kk rows in a table? Is there a known limit? Or is it fully hardware-dependant?
I agree with klennepette and Brian - with a couple of caveats.
If your data is inherently relational, and subject to queries that work well with SQL, you should be able to scale to hundreds of millions of records without exotic hardware requirements.
You will need to invest in indexing, query tuning, and making the occasional sacrifice to the relational model in the interests of speed. You should at least nod to performance when designing tables – preferring integers to strings for keys, for instance.
If, however, you have document-centric requirements, need free text search, or have lots of hierarchical relationships, you may need to look again.
If you need ACID transactions, you may run into scalability issues earlier than if you don't care about transactions (though this is still unlikely to affect you in practice); if you have long-running or complex transactions, your scalability decreases quite rapidly.
I'd recommend building the project from the ground up with scalability requirements in mind. What I've done in the past is set up a test environment populated with millions of records (I used DBMonster, but not sure if that's still around), and regularly test work-in-progress code against this database using load testing tools like Jmeter.
Millions of rows is fine, tens of millions of rows is fine - provided you've got an even remotely decent server, i.e. a few Gbs of RAM, plenty disk space. You will need to learn about indexes for fast retrieval, but in terms of MySQL being able to handle it, no problem.
Here's an example that demonstrates what can be achived using a well designed/normalised innodb schema which takes advantage of innodb's clustered primary key indexes (not available with myisam). The example is based on a forum with threads and has 500 million rows and query runtimes of 0.02 seconds while under load.
MySQL and NoSQL: Help me to choose the right one
It is mostly hardware dependant, but having that said MySQL scales pretty well.
I wouldn't worry too much about table size, if it does become an issue later on you can always use partitioning to ease the stress.

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716