Database sharding vs partitioning - mysql

I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are sharding and partitioning. I looked up descriptions but still ended up confused.
Could the experts at stackoverflow help me get the basics right?
What is the difference between sharding and partitioning ?
Is it true that 'all sharded databases are essentially partitioned (over different nodes), but all partitioned databases are not necessarily sharded' ?

Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.
Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".
A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.
Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.
A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs
See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition

I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:
A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, as for load balancing.
https://en.wikipedia.org/wiki/Partition_(database)
Sharding is a type of partitioning, such as Horizontal Partitioning (HP)
There is also Vertical Partitioning (VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.
https://en.wikipedia.org/wiki/Shard_(database_architecture)
I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...
"Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.
"Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
When to Partition a Table
Here are some suggestions for when to partition a table:
Tables greater than 2 GB should always be considered as candidates
for partitioning.
Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
When the contents of a table need to be distributed across different types of storage devices.
Partition Pruning
Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.
Partitioning Strategies
Range
Hash
List
You can read their text and visualize their images which explain everything pretty well.
And lastly, it is important to understand that databases are extremely resource intensive:
CPU
Disk
I/O
Memory
Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.
While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.
https://en.wikipedia.org/wiki/Shared_nothing_architecture

Looks like this answers both your questions:
Horizontal partitioning splits one or more tables by row, usually
within a single instance of a schema and a database server. It may
offer an advantage by reducing index size (and thus search effort)
provided that there is some obvious, robust, implicit way to identify
in which table a particular row will be found, without first needing
to search the index, e.g., the classic example of the 'CustomersEast'
and 'CustomersWest' tables, where their zip code already indicates
where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in
the same way, but it does this across potentially multiple instances
of the schema. The obvious advantage would be that search load for the
large partitioned table can now be split across multiple servers
(logical or physical), not just multiple indexes on the same logical
server.
Source:Wiki-Shard.
Sharding is the process of storing data records across multiple
machines and is MongoDB’s approach to meeting the demands of data
growth. As the size of the data increases, a single machine may not be
sufficient to store the data nor provide an acceptable read and write
throughput. Sharding solves the problem with horizontal scaling. With
sharding, you add more machines to support data growth and the demands
of read and write operations.
Source: MongoDB.

Consider a Table in database with 1 Million rows and 100 columns
In Partitioning you can divide the table into 2 or more table having property like:
0.4 Million rows(table1), 0.6 million rows(table2)
1 Million rows & 60 columns(table1) and 1 Million rows & 40 columns(table2)
There could be multiple cases like that
This is general partitioning
But Sharding refer to 1st case only where we are dividing the data on the basis of rows. If we are dividing the table into multiple table we need to maintain multiple similar copies of schemas as now we have multiple tables.

When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page.
When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute.
In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all.
Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.

Sharding in a special case of horizontal partitioning, when partitions spans across multiple database instances. If a database is sharded, it means that it's partitioned by definition.

Horizontal partition when moved to another database instance* becomes a database shard.
Database instance can be on the same machine or on another machine.

Related

How to improve the performance of table scans with innodb

Brief: Is there any way to improve the performance of table scans on InnoDB tables?
Please, do not suggest adding indexes to avoid table scans. (see below)
innodb_buffer_pool_size sits at 75% of server memory (48 GB/64GB)
I'm using the latest version of Percona (5.7.19) if that changes anything
Longer: We have 600Gb of recent time series data (we aggregate and delete older data) spread over 50-60 tables. So most of it is "active" data that is regularly queried. These tables are somewhat large (400+ numeric columns) and many queries run against a number of those columns (alarming) which is why it is impractical to add indexes (as we would have to add a few dozen). The largest tables are partitioned per day.
I am fully aware that this is an application/table design problem and not a "server tuning" problem. We are currently working to significantly change the way these tables are designed and queried, but have to maintain the existing system until this happens so I'm looking for a way to improve things a bit to buy us a little time.
We recently split this system and have moved a part of it to a new server. It previously used MyISAM, and we tried moving to TokuDB which seemed appropriate but ran into some weird problems. We switched to InnoDB but performance is really bad. I get the impression that MyISAM is better with table scans which is why, barring any better option, we'll go back to it until the new system is in place.
Update
All tables have pretty much the same structure:
-timestamp
-primary key (varchar(20) field)
-about 15 fields of various types representing other secondary attributes that can be filtered upon (along with an appropriately indexed criteria first)
-And then about a few hundred measures (floats), between 200-400.
I already trimmed the row length as much as I could without changing the structure itself. The primary key used to be a varchar(100), all measures used to be doubles, many of the secondary attributes had their data types changed.
Upgrading hardware is not really an option.
Creating small tables with just the set of columns I need would help some processes perform faster. But at the cost of creating that table with a table scan first and duplicating data. Maybe if I created it as a memory table. By my estimate, it would take a couple of GB away from the buffer pool. Also there are aggregation processes that read about as much data from the main tables on a regular basis, and they need all columns.
There is unfortunately a lot of duplication of effort in those queries which I plan to address in the next version. The alarming and aggregation processes basically reprocess the entire day's worth of data every time some rows inserted (every half hour) instead of just dealing with new/changed data.
Like I said, the bigger tables are partitioned, so it's usually a scan over a daily partition rather than the entire table, which is a small consolation.
Implementing a system to hold this in memory outside of the DB could work, but that would entail a lot of changes on the legacy system and development work. Might as well spend that time on the better design.
The fact that InnoDB table are so much bigger for the same data as MyISAM (2-3x as big in my case) really hinders the performance.
MyISAM is a little bit better at table-scans, because it stores data more compactly than InnoDB. If your queries are I/O-bound, scanning through less data on disk is faster. But this is a pretty weak solution.
You might try using InnoDB compression to reduce the size of data. That might get you closer to MyISAM size, but you're still I/O-bound so it's going to suck.
Ultimately, it sounds like you need a database that is designed for an OLAP workload, like a data warehouse. InnoDB and TokuDB are both designed for OLTP workload.
It smells like a Data Warehouse with "Reports". By judicious picking of what to aggregate (selected of your Floats) over what time period (hour or day is typical), you can build and maintain Summary Tables that work much more efficiently for the Reports. This has the effect of scanning the data only once (to build the Summaries), not repeatedly. The Summary tables are much smaller, so the reports are much faster -- 10x is perhaps typical.
It may also be possible to augment the Summary tables as the raw data is being Inserted. (See INSERT .. ON DUPLICATE KEY UPDATE ..)
And use Partitioning by date to allow for efficient DROP PARTITION instead of DELETE. Don't have more than about 50 partitions.
Summary Tables
Time series Partitioning
If you would like to discuss in more detail, let's start with one of the queries that is scanning so much now.
In the various projects I have worked on, there were between 2 and 7 Summary tables.
With 600GB of data, you may be pushing the limits on 'ingestion'. If so, we can discuss that, too.

Distributed database use cases

At the moment i do have a mysql database, and the data iam collecting is 5 Terrabyte a year. I will save my data all the time, i dont think i want to delete something very early.
I ask myself if i should use a distributed database because my data will grow every year. And after 5 years i will have 25 Terrabyte without index. (just calculated the raw data i save every day)
i have 5 tables and the most queries are joins over multiple tables.
And i need to access mostly 1-2 columns over many rows at a specific timestamp.
Would a distributed database be a prefered database than only a single mysql database?
Paritioning will be difficult, because all my tables are really high connected.
I know it depends on the queries and on the database table design and i can also have a distributed mysql database.
i just want to know when i should think about a distributed database.
Would this be a use case? or could mysql handle this large dataset?
EDIT:
in average i will have 1500 clients writing data per second, they affect all tables.
i just need the old dataset for analytics. Like machine learning and
pattern matching.
also a client should be able to see the historical data
Your question is about "distributed", but I see more serious questions that need answering first.
"Highly indexed 5TB" will slow to a crawl. An index is a BTree. To add a new row to an index means locating the block in that tree where the item belongs, then read-modify-write that block. But...
If the index is AUTO_INCREMENT or TIMESTAMP (or similar things), then the blocks being modified are 'always' at the 'end' of the BTree. So virtually all of the reads and writes are cacheable. That is, updating such an index is very low overhead.
If the index is 'random', such as UUID, GUID, md5, etc, then the block to update is rarely found in cache. That is, updating this one index for this one row is likely to cost a pair of IOPs. Even with SSDs, you are likely to not keep up. (Assuming you don't have several TB of RAM.)
If the index is somewhere between sequential and random (say, some kind of "name"), then there might be thousands of "hot spots" in the BTree, and these might be cacheable.
Bottom line: If you cannot avoid random indexes, your project is doomed.
Next issue... The queries. If you need to scan 5TB for a SELECT, that will take time. If this is a Data Warehouse type of application and you need to, say, summarize last month's data, then building and maintaining Summary Tables will be very important. Furthermore, this can obviate the need for some of the indexes on the 'Fact' table, thereby possibly eliminating my concern about indexes.
"See the historical data" -- See individual rows? Or just see summary info? (Again, if it is like DW, one rarely needs to see old datapoints.) If summarization will suffice, then most of the 25TB can be avoided.
Do you have a machine with 25TB online? If not, that may force you to have multiple machines. But then you will have the complexity of running queries across them.
5TB is estimated from INT = 4 bytes, etc? If using InnoDB, you need to multiple by 2 to 3 to get the actual footprint. Furthermore, if you need to modify a table in the future, such action probably needs to copy the table over, so that doubles the disk space needed. Your 25TB becomes more like 100TB of storage.
PARTITIONing has very few valid use cases, so I don't want to discuss that until knowing more.
"Sharding" (splitting across machines) is possibly what you mean by "distributed". With multiple tables, you need to think hard about how to split up the data so that JOINs will continue to work.
The 5TB is huge -- Do everything you can to shrink it -- Use smaller datatypes, normalize, etc. But don't "over-normalize", you could end up with terrible performance. (We need to see the queries!)
There are many directions to take a multi-TB db. We really need more info about your tables and queries before we can be more specific.
It's really impossible to provide a specific answer to such a wide question.
In general, I recommend only worrying about performance once you can prove that you have a problem; if you're worried, it's much better to set up a test rig, populate it with representative data, and see what happens.
"Can MySQL handle 5 - 25 TB of data?" Yes. No. Depends. If - as you say - you have no indexes, your queries may slow down a long time before you get to 5TB. If it's 5TB / year of highly indexable data it might be fine.
The most common solution to this question is to keep a "transactional" database for all the "regular" work, and a datawarehouse for reporting, using a regular Extract/Transform/Load job to move the data across, and archive it. The data warehouse typically has a schema optimized for querying, usually entirely unlike the original schema.
If you want to keep everything logically consistent, you might use sharding and clustering - a sort-a-kind-a out of the box feature of MySQL.
I would not, however, roll my own "distributed database" solution. It's much harder than you might think.

Store large amounts of sensor data in SQL, optimize for query performance

I need to store sensor data from various locations (different factories with different rooms with each different sensors). Data is being downloaded in regular intervals from a device on site in the factories that collects the data transmitted from all sensors.
The sensor data looks like this:
collecting_device_id, sensor_id, type, value, unit, timestamp
Type could be temperature, unit could be degrees_celsius. collecting_device_id will identify the factory.
There are quite a lot of different things (==types) being measured.
I will collect around 500 million to 750 million rows and then perform analyses on them.
Here's the question for storing the data in a SQL database (let's say MySQL InnoDB on AWS RDS, large machine if necessary):
When considering query performance for future queries, is it better to store this data in one huge table just like it comes from the sensors? Or to distribute it across tables (tables for factories, temperatures, humidities, …, everything normalized)? Or to have a wide table with different fields for the data points?
Yes, I know, it's hard to say "better" without knowing the queries. Here's more info and a few things I have thought about:
There's no constant data stream as data is uploaded in chunks every 2 days (a lot of writes when uploading, the rest of the time no writes at all), so I would guess that index maintenance won't be a huge issue.
I will try to reduce the amount of data being inserted upfront (data that can easily be replicated later on, data that does not add additional information, …)
Queries that should be performed are not defined yet (I know, designing the query makes a big difference in terms of performance). It's exploratory work (so we don't know ahead what will be asked and cannot easily pre-compute values), so one time you want to compare data points of one type in a time range to data points of another type, the other time you might want to compare rooms in factories, calculate correlations, find duplicates, etc.
If I would have multiple tables and normalize everything the queries would need a lot of joins (which probably makes everything quite slow)
Queries mostly need to be performed on the whole ~ 500 million rows database, rarely on separately downloaded subsets
There will be very few users (<10), most of them will execute these "complex" queries.
Is a SQL database a good choice at all? Would there be a big difference in terms of performance for this use case to use a NoSQL system?
In this setup with this amount of data, will I have queries that never "come back"? (considering the query is not too stupid :-))
Don't pre-optimize. If you don't know the queries then you don't know the queries. It is to easy to make choices now that will slow down some sub-set of queries. When you know how the data will be queried you can optimize then -- it is easy to normalize after the fact (pull out temperature data into a related table for example.) For now I suggest you put it all in one table.
You might consider partitioning the data by date or if you have another way that might be useful (recording device maybe?). Often data of this size is partitioned if you have the resources.
After you think about the queries, you will possibly realize that you don't really need all the datapoints. Instead, max/min/avg/etc for, say, 10-minute intervals may be sufficient. And you may want to "alarm" on "over-temp" values. This should not involve the database, but should involve the program receiving the sensor data.
So, I recommend not storing all the data; instead only store summarized data. This will greatly shrink the disk requirements. (You could store the 'raw' data to a plain file in case you are worried about losing it. It will be adequately easy to reprocess the raw file if you need to.)
If you do decide to store all the data in table(s), then I recommend these tips:
High speed ingestion (includes tips on Normalization)
Summary Tables
Data Warehousing
Time series partitioning (if you plan to delete 'old' data) (partitioning is painful to add later)
750M rows -- per day? per decade? Per month - not too much challenge.
By receiving a batch every other day, it becomes quite easy to load the batch into a temp table, do normalization, summarization, etc; then store the results in the Summary table(s) and finally copy to the 'Fact' table (if you choose to keep the raw data in a table).
In reading my tips, you will notice that avg is not summarized; instead sum and count are. If you need standard deviation, also, keep sum-of-squares.
If you fail to include all the Summary Tables you ultimately need, it is not too difficult to re-process the Fact table (or Fact files) to populate the new Summary Table. This is a one-time task. After that, the summarization of each chunk should keep the table up to date.
The Fact table should be Normalized (for space); the Summary tables should be somewhat denormalized (for performance). Exactly how much denormalization depends on size, speed, etc., and cannot be predicted at this level of discussion.
"Queries on 500M rows" -- Design the Summary tables so that all queries can be done against them, instead. A starting rule-of-thumb: Any Summary table should have one-tenth the number of rows as the Fact table.
Indexes... The Fact table should have only a primary key. (The first 100M rows will work nicely; the last 100M will run so slowly. This is a lesson you don't want to have to learn 11 months into the project; so do pre-optimize.) The Summary tables should have whatever indexes make sense. This also makes querying a Summary table faster than the Fact table. (Note: Having a secondary index on a 500M-rows table is, itself, a non-trivial performance issue.)
NoSQL either forces you to re-invent SQL, or depends on brute-force full-table-scans. Summary tables are the real solution. In one (albeit extreme) case, I sped up a 1-hour query to 2-seconds by by using a Summary table. So, I vote for SQL, not NoSQL.
As for whether to "pre-optimize" -- I say it is a lot easier than rebuilding a 500M-row table. That brings up another issue: Start with the minimal datasize for each field: Look at MEDIUMINT (3 bytes), UNSIGNED (an extra bit), CHARACTER SET ascii (utf8 or utf8mb4) only for columns that need it), NOT NULL (NULL costs a bit), etc.
Sure, it is possible to have 'queries that never come back'. This one 'never comes back, even with only 100 rows in a: SELECT * FROM a JOIN a JOIN a JOIN a JOIN a. The resultset has 10 billion rows.

MySQL: About BigData

So here's the deal. I've designed as schema which stores the daily stock quotes data. I've two tables (among others) "todayData" and "historicalData" with the same structure. The two tables have innodb engine as their storage engine. There is no FK between two tables and are independent.
if i need to see data for today, i query today table and if i need to generate reports or trending analysis etc i rely on historical table. During midnight, today's data will move to historical table.
The question is historical will be mammoth in few weeks (> 10 GB and counting) and needless to say serving this data from a single table is mindless.
What should i do to make sure the reports generate off of historical will be fast and responsive.
People have suggested partitioning etc but i would like to know are there any other ways to do this?
Thank you
Bo
There is no silver bullet for Big Data. All depends on the data and data usage (access patterns, etc.). First, make sure the table is properly indexed, your queries are optimal, and you have enough memory. And if you still have too much data to contain on single server, shard/partition (but mind the access patterns when you choose shard key - if you have to query multiple partitions for single report, it's bad. Buf if you really have to, make sure you can query them in parallel - something not possible currently with the build-in partitioning (so you need app-level sharding logic))

Max number of rows in MySQL

I'm planning to generate a huge amount of data, which I'd like to store in a MySQL database. My current estimations point to four thousand million billion rows in the main table (only two columns, one of them indexed).
Two questions here:
1) is this possible?
and more specifically:
2) Will such table be efficiently usable?
thanks!,
Jaime
Sure, it's possible. Whether or not it's usable will depend on how you use it and how much hardware/memory you have. With a table that large, it would probably make sense to use partitioning as well if that makes sense for the kind of data you are storing.
ETA:
Based on the fact that you only have two columns with one of them being indexed, I'm going to take a wild guess here that this is some kind of key-value store. If that is the case, you might want to look into a specialized key-value store database as well.
It may be possible, MySQL has several table storage engines with differing capabilities. I think the MyISAM storage engine, for instance, has a theoretical data size limit of 256TB, but it's further constrained by the maximum size of a file on your operating system. I doubt it would be usable. I'm almost certain it wouldn't be optimal.
I would definitely look at partitioning this data across multiple tables (probably even multiple DBs on multiple machines) in a way that makes sense for your keys, then federating any search results/totals/etc. you need to. Amongst other things, this allows you to do searches where each partition is searched in parallel (in the mutiple servers approach).
I'd also look for a solution that's already done the heavy lifting of partitioning and federating queries. I wonder if Google's AppEngine data store (BigTable) or the Amazon SimpleDB would be useful. They'd both limit what you could do with the data (they are not RDBMS's), but then, the sheer size is going to do that anyway.
You should consider partitioning your data...for example if one of the two columns is a name, separate the rows into 26 tables based on the first letter.
I created a mysql database with one table that contained well over 2 million rows (imported U.S. census county line data for overlay on a Google map). Another table had slightly under 1 million rows (USGS Tiger location data). This was about 5 years ago.
I didn't really have an issue (once I remembered to create indexes! :) )
4 gigarows is not that big, actually, it is pretty average to handle by any database engine today. Even partitioning could be an overkill. It should simply work.
Your performance will depend on your HW though.