Max number of rows in MySQL - mysql

I'm planning to generate a huge amount of data, which I'd like to store in a MySQL database. My current estimations point to four thousand million billion rows in the main table (only two columns, one of them indexed).
Two questions here:
1) is this possible?
and more specifically:
2) Will such table be efficiently usable?
thanks!,
Jaime

Sure, it's possible. Whether or not it's usable will depend on how you use it and how much hardware/memory you have. With a table that large, it would probably make sense to use partitioning as well if that makes sense for the kind of data you are storing.
ETA:
Based on the fact that you only have two columns with one of them being indexed, I'm going to take a wild guess here that this is some kind of key-value store. If that is the case, you might want to look into a specialized key-value store database as well.

It may be possible, MySQL has several table storage engines with differing capabilities. I think the MyISAM storage engine, for instance, has a theoretical data size limit of 256TB, but it's further constrained by the maximum size of a file on your operating system. I doubt it would be usable. I'm almost certain it wouldn't be optimal.
I would definitely look at partitioning this data across multiple tables (probably even multiple DBs on multiple machines) in a way that makes sense for your keys, then federating any search results/totals/etc. you need to. Amongst other things, this allows you to do searches where each partition is searched in parallel (in the mutiple servers approach).
I'd also look for a solution that's already done the heavy lifting of partitioning and federating queries. I wonder if Google's AppEngine data store (BigTable) or the Amazon SimpleDB would be useful. They'd both limit what you could do with the data (they are not RDBMS's), but then, the sheer size is going to do that anyway.

You should consider partitioning your data...for example if one of the two columns is a name, separate the rows into 26 tables based on the first letter.

I created a mysql database with one table that contained well over 2 million rows (imported U.S. census county line data for overlay on a Google map). Another table had slightly under 1 million rows (USGS Tiger location data). This was about 5 years ago.
I didn't really have an issue (once I remembered to create indexes! :) )

4 gigarows is not that big, actually, it is pretty average to handle by any database engine today. Even partitioning could be an overkill. It should simply work.
Your performance will depend on your HW though.

Related

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Database sharding vs partitioning

I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are sharding and partitioning. I looked up descriptions but still ended up confused.
Could the experts at stackoverflow help me get the basics right?
What is the difference between sharding and partitioning ?
Is it true that 'all sharded databases are essentially partitioned (over different nodes), but all partitioned databases are not necessarily sharded' ?
Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.
Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".
A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.
Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.
A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs
See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition
I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:
A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, as for load balancing.
https://en.wikipedia.org/wiki/Partition_(database)
Sharding is a type of partitioning, such as Horizontal Partitioning (HP)
There is also Vertical Partitioning (VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.
https://en.wikipedia.org/wiki/Shard_(database_architecture)
I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...
"Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.
"Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
When to Partition a Table
Here are some suggestions for when to partition a table:
Tables greater than 2 GB should always be considered as candidates
for partitioning.
Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
When the contents of a table need to be distributed across different types of storage devices.
Partition Pruning
Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.
Partitioning Strategies
Range
Hash
List
You can read their text and visualize their images which explain everything pretty well.
And lastly, it is important to understand that databases are extremely resource intensive:
CPU
Disk
I/O
Memory
Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.
While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.
https://en.wikipedia.org/wiki/Shared_nothing_architecture
Looks like this answers both your questions:
Horizontal partitioning splits one or more tables by row, usually
within a single instance of a schema and a database server. It may
offer an advantage by reducing index size (and thus search effort)
provided that there is some obvious, robust, implicit way to identify
in which table a particular row will be found, without first needing
to search the index, e.g., the classic example of the 'CustomersEast'
and 'CustomersWest' tables, where their zip code already indicates
where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in
the same way, but it does this across potentially multiple instances
of the schema. The obvious advantage would be that search load for the
large partitioned table can now be split across multiple servers
(logical or physical), not just multiple indexes on the same logical
server.
Source:Wiki-Shard.
Sharding is the process of storing data records across multiple
machines and is MongoDB’s approach to meeting the demands of data
growth. As the size of the data increases, a single machine may not be
sufficient to store the data nor provide an acceptable read and write
throughput. Sharding solves the problem with horizontal scaling. With
sharding, you add more machines to support data growth and the demands
of read and write operations.
Source: MongoDB.
Consider a Table in database with 1 Million rows and 100 columns
In Partitioning you can divide the table into 2 or more table having property like:
0.4 Million rows(table1), 0.6 million rows(table2)
1 Million rows & 60 columns(table1) and 1 Million rows & 40 columns(table2)
There could be multiple cases like that
This is general partitioning
But Sharding refer to 1st case only where we are dividing the data on the basis of rows. If we are dividing the table into multiple table we need to maintain multiple similar copies of schemas as now we have multiple tables.
When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page.
When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute.
In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all.
Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.
Sharding in a special case of horizontal partitioning, when partitions spans across multiple database instances. If a database is sharded, it means that it's partitioned by definition.
Horizontal partition when moved to another database instance* becomes a database shard.
Database instance can be on the same machine or on another machine.

MySQL Database Structure

I will have a table with a few million entries and I have been wondering if it was smarter to create more than just this one table, even though they would all have the same structure? Would it save resources and would it be more efficient in the end?
This is my particular concern, because I plan creating a small search engine which indexes about 3.000.000 sites and each sites will have approximately 30 words that are being indexed. This is my structure right now
site
--id
--url
word
--id
--word
appearances
--site_id
--word_id
--score
Should I keep this structure? Or should I create tables for A words, B words, C words etc? Same with the appearances table
Select queries are faster on smaller tables. You want to fit the indexes you have to sort on into your systems memory for better performance.
More importantly, tables should not be defined in order to hold a certain type of data, but instead a collection of associated data. So if the data you are storing has logical differences they maybe should be broken into separate tables.
(Incomplete)
Pros:
Faster data access
Easier to copy or back up
Cons:
Cannot easily compare data from different tables.
Union and join queries are needed to compare across tables
If you aren't concerned with some latency on your database it should be able to handle this on the other of a few million records without too much trouble.
Here's some questions to ask yourself:
Are the records all inter-related? Is there any way of cleanly dividing them into different, non-overlapping groups? Are these groups well defined, or subject to change?
Is maintaining optimal write speed more of a concern than simplicity of access to data?
Is there any way of partitioning the records into different categories?
Is replication a concern? Redundancy?
Are you concerned about transaction safety?
Is it possible to re-structure the data later if you get the initial schema wrong?
There are a lot of ways of tackling this problem, but until you know the parameters you're working with, it's very hard to say.
Usually step one is to collect either a large corpus of genuine data, or at least simulate enough data that's reasonably similar to the genuine data to be structurally the same. Then you use your test data to try out different methods of storing and retrieving it.
Without any test data you're just stabbing in the dark

Use one giant MySQL table?

Say I have a table with 25,000 or so rows:
item_id, item_name, item_value, etc...
My application will allow users to generate dynamic lists of anywhere from 2-300 items each.
Should I store all of these relationships in a giant table with columns dynamic_list_id, item_id? Each dynamic list would end up having 2-300 rows in this table, and the size of the table would likely balloon to the millions, or even billions.
This table would also be queried quite frequently, retrieving several of these dynamic lists each second. Is a giant table the best way to go? Would it make sense to split it up into dynamic tables, perhaps named by user?
I'm really at a loss when it comes to preparing databases for giant amounts of data like this, so any insight would be much appreciated.
It's a relational database, it's designed for that kind of thing - just go for it. A mere few million rows doesn't even count as "giant". Think very carefully about indexing though - you have to balance insert/update performance, storage space and query performance.
Yes, I recommend going with yor proposed design: "a giant table with columns dynamic_list_id, item_id."
Performance can easily be addressed as required, through index selection, and by increasing the number of spindles and read/write arms, and SSD caching.
And inthe grand scheme of things, this database does not look to be particularly big. These days it takes dozens or hundreds of TB to be a BIG database.
With such a large table make sure to set your engine to InnoDB for row level locks.
Make sure you're using indexes wisely. If your query starts to drag, increase the size of the Innodb_buffer_pool to compensate.

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.