MySQL partitioning by customer - mysql

We have a product that uses diferent MySQL shemas for diferent customers, and a single Java application that uses diferent persistence units for one for every customer. This makes it dificult to add a cutomer withowt redeploying the application.
We are planing to use a single MySQL database schema that hold all the customers with each table having a field which is a KEY sibolizing one customer, so that adding a new customer is a mater od few sql updates/inserts.
What is the best aproach to handle this kind of data in MySQL...does MySQL provide any partitioning tables by key or something like that. And what could be the performance issues of that aproach?

There are a few questions here:
Schema Design Question
Partitioning question
Can mySQL handle a HASH MAP Query O(1)
Schema Design Question:
Yes, this is much better then launching a new app per customer.
Can mySQL handle a HASH MAP Query O(1)
Yes, if the data remains in memory and has enough CPU cycles mySQL can easily do 300K selects a second. Otherwise if the data is I/O bounded and the Disk subsystem is not saturated mySQL can easily do 20-30K Selects per second dependent on the traffic pattern, concurrency, and how many IOPS the database disk subsystem can do.
Partitioning
Partitioning means different things in the context of talking about mySQL. Partitioning is a storage engine that sits on top of another storage engine in mySQL to allocate data to a certain table but exposing a group of partition tables as a single table to the calling application. Partitioning could also mean having certain databases execute a subset of all tables. In your context I think you are asking if you federate by customer what are the performance impact. I.e. can you allocate a database per customer if necessary with the same schema. This concept is more along the ideals of Sharding, taking the data as a whole and allocating resources per unit of data e.g. a customer.
My suggestion to you
Make the schema the same per customer. Benchmark all the queries involved that a customer would do. Query patterns that is. Verify that each query with EXPLAIN does not produce a filesort or temporary table, nor scans 100K rows at a time and you should be able to scale no problem. Once you run into issues with a single or set of boxes getting close to you're IOP ceiling think about splitting the data.

Related

Can Cassandra be used for creating tables on the fly? If yes, how much time will it take on an average?

Our customers are allowed to create custom modules(tables) and properties(columns). Currently we are using RDBMS(multi tenant) for handling this usecase and have created a table to store the schema and another table with predefined set of data types in columns(10 columns for each type) to store customer's data.
To improve the performance, I thought about using RDBMS for regular usage and a separate database for storing custom data.
I finalised on Cassandra for its scalability aspects though I'm worried about creating tables on the fly for each customer and automating table tuning to drive better performance.
It's really depends on number of customers, etc. You can of course create new tables by using the driver for particular language. But every table has some fixed memory overhead, so it's recommended to have max number of tables in low hundreds, something like 200 tables in cluster on average, and not more than 500 tables. Besides the fixed overhead, you also need to remember that every table has associated memtable that keeps the data.

When is it time to switch to NoSQL?

I am dealing with a large database that is collecting historical pricing data. The schema is relatively simple and does not change.
Something like:
SKU (char), type(enum), price(double), datetime(datetime)
The issue is that this table now has over 500,000,000 rows and is around 20gb and growing. It is already getting a bit difficult to run queries. One common query is to get all skus from a specific date range consisting of maybe 500,000 records. Add any complexity like group by, and you can forget it.
This db is mostly writes. But we obviously need to crunch the data and run queries occasionally. I understand that better index planning can help speed up the queries, but I am wondering if this is the type of data that would benefit from a noSQL solution like MongoDB? Can I expect mysql (probably moving to MariaDB) to continue to work for us, even after it grows beyond 100-200 gb in size? Or should I explore alternatives before things get unweildly?
NoSQL is not a solution to a "large database" problem; NoSQL--specifically document databases--are designed for scenarios where the nature of the data you're storing varies, so you don't want to define rigid schemas and relationships up front.
What you have is simple, well-defined data. This is ideally suited for a relational database, but for something of that scale I would recommend looking something either commercial (i.e. SQL Server or Oracle, depending on your platform). The databases I work with in SQL Server are around four terabytes in size with several tables in the hundreds-of-millions records like you have. A relational database can easily accommodate the simple data you've outlined.
You actually have an ideal use-case for SQL, and a rather bad fit for NoSQL. MySQL devs report people using databases of 5,000,000,000 records. Some other SQL servers will be even more scalable than that. However, if you don't have a proper index support, it should be impossible to manage even a fraction of that.
BTW, what is your table schema, including indices?
You could switch to mariadb and then use the spider engine. The spider engine makes it possible to split your data across multiple mariadb instances without loosing the abillity to run queries against your existing instance.
So you can define your own rules for partitioning and then create one instance per partition. So in the end you have multiple instances of mariadb but all your records are virtual sumed up in one table with the spider engine.
Your performace gain would be because you split your data across multiple instances and therefore reduce the amount of records per table or instance and of course by using more hardware ressources.

Database sharding vs partitioning

I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are sharding and partitioning. I looked up descriptions but still ended up confused.
Could the experts at stackoverflow help me get the basics right?
What is the difference between sharding and partitioning ?
Is it true that 'all sharded databases are essentially partitioned (over different nodes), but all partitioned databases are not necessarily sharded' ?
Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.
Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".
A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.
Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.
A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs
See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition
I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:
A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, as for load balancing.
https://en.wikipedia.org/wiki/Partition_(database)
Sharding is a type of partitioning, such as Horizontal Partitioning (HP)
There is also Vertical Partitioning (VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.
https://en.wikipedia.org/wiki/Shard_(database_architecture)
I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...
"Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.
"Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
When to Partition a Table
Here are some suggestions for when to partition a table:
Tables greater than 2 GB should always be considered as candidates
for partitioning.
Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
When the contents of a table need to be distributed across different types of storage devices.
Partition Pruning
Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.
Partitioning Strategies
Range
Hash
List
You can read their text and visualize their images which explain everything pretty well.
And lastly, it is important to understand that databases are extremely resource intensive:
CPU
Disk
I/O
Memory
Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.
While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.
https://en.wikipedia.org/wiki/Shared_nothing_architecture
Looks like this answers both your questions:
Horizontal partitioning splits one or more tables by row, usually
within a single instance of a schema and a database server. It may
offer an advantage by reducing index size (and thus search effort)
provided that there is some obvious, robust, implicit way to identify
in which table a particular row will be found, without first needing
to search the index, e.g., the classic example of the 'CustomersEast'
and 'CustomersWest' tables, where their zip code already indicates
where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in
the same way, but it does this across potentially multiple instances
of the schema. The obvious advantage would be that search load for the
large partitioned table can now be split across multiple servers
(logical or physical), not just multiple indexes on the same logical
server.
Source:Wiki-Shard.
Sharding is the process of storing data records across multiple
machines and is MongoDB’s approach to meeting the demands of data
growth. As the size of the data increases, a single machine may not be
sufficient to store the data nor provide an acceptable read and write
throughput. Sharding solves the problem with horizontal scaling. With
sharding, you add more machines to support data growth and the demands
of read and write operations.
Source: MongoDB.
Consider a Table in database with 1 Million rows and 100 columns
In Partitioning you can divide the table into 2 or more table having property like:
0.4 Million rows(table1), 0.6 million rows(table2)
1 Million rows & 60 columns(table1) and 1 Million rows & 40 columns(table2)
There could be multiple cases like that
This is general partitioning
But Sharding refer to 1st case only where we are dividing the data on the basis of rows. If we are dividing the table into multiple table we need to maintain multiple similar copies of schemas as now we have multiple tables.
When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page.
When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute.
In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all.
Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.
Sharding in a special case of horizontal partitioning, when partitions spans across multiple database instances. If a database is sharded, it means that it's partitioned by definition.
Horizontal partition when moved to another database instance* becomes a database shard.
Database instance can be on the same machine or on another machine.

Can I have several 'similar' database tables to reduce retrieval time

It is best to explain my question in terms of a concrete example.
Consider an order management application that restaurants use to receive orders from their customers. I have a table called orders which stores all of them.
Now every day the tables keep growing in size but the amount of data accessed is constant. Generally the restaurants are only interested in orders received in the last day or so. After 100 days, for example, 'interesting' data is only about 1/100 of the table size; after 1 year it's 1/365 and so on.
Of course, I want to keep all the old orders, but performance for applications that are only interested in current orders keeps reducing. So what is the best way to not have old data interfere with the data that is 'interesting'?
From my limited database knowledge, one solution that occurred to me was to have two identical tables - order_present and order_past - within the same database. New orders would come into 'order_present' and a cron job would transfer all processed orders older than two days to 'order_old', keeping the size of 'order_present' constant.
Is this considered an acceptable solution to deal with this problem. What other solutions exist?
Database servers are pretty good at handling volume. But the performance could be limited by physical hardware. If it is the IO latency that is bothering you, there are several solutions available. You really need to evaluate what fits best for your usecase.
For example:
you can Partition the table to distribute it onto multiple physical disks.
you can do Sharding to put data on to different physical servers
you can evaluate using another Storage Engine which best fits your data and application. MyISAM delivers better read performance compared to InnoDB at the cost of being less ACID compliant
you can use Read Replicas to deligate all (most) "select" queries to replicas (slaves) of the main database servers (master)
Finally, MySQL Performance Blog is a great resource on this topic.

handling/compressing large datasets in multiple tables

In an application at our company we collect statistical data from our servers (load, disk usage and so on). Since there is a huge amount of data and we don't need all data at all times we've had a "compression" routine that takes the raw data and calculates min. max and average for a number of data-points, store these new values in the same table and removes the old ones after some weeks.
Now I'm tasked with rewriting this compression routine and the new routine must keep all uncompressed data we have for one year in one table and "compressed" data in another table. My main concerns now are how to handle the data that is continuously written to the database and whether or not to use a "transaction table" (my own term since I cant come up with a better one, I'm not talking about the commit/rollback transaction functionality).
As of now our data collectors insert all information into a table named ovak_result and the compressed data will end up in ovak_resultcompressed. But are there any specific benefits or drawbacks to creating a table called ovak_resultuncompressed and just use ovak_result as a "temporary storage"? ovak_result would be kept minimal which would be good for the compressing routine, but I would need to shuffle all data from one table into another continually, and there would be constant reading, writing and deleting in ovak_result.
Are there any mechanisms in MySQL to handle these kind of things?
(Please note: We are talking about quite large datasets here (about 100 M rows in the uncompressed table and about 1-10 M rows in the compressed table). Also, I can do pretty much what I want with both software and hardware configurations so if you have any hints or ideas involving MySQL configurations or hardware set-up, just bring them on.)
Try reading about the ARCHIVE storage engine.
Re your clarification. Okay, I didn't get what you meant from your description. Reading more carefully, I see you did mention min, max, and average.
So what you want is a materialized view that updates aggregate calculations for a large dataset. Some RDBMS brands such as Oracle have this feature, but MySQL doesn't.
One experimental product that tries to solve this is called FlexViews (http://code.google.com/p/flexviews/). This is an open-source companion tool for MySQL. You define a query as a view against your raw dataset, and FlexViews continually monitors the MySQL binary logs, and when it sees relevant changes, it updates just the rows in the view that need to be updated.
It's pretty effective, but it has a few limitations in the types of queries you can use as your view, and it's also implemented in PHP code, so it's not fast enough to keep up if you have really high traffic updating your base table.