Sharding and indexes - mysql

I want to shard my database but I'm not professional in this topic. So here are my considerations. Though sharding key is excellent index for directing requests to the right nodes, what about the rest of indexes that will be defined upon my tables? I want requests referencing those indexes to be delivered to the right nodes as well so that only one node process the request. As far as I understand for this purpose some centralized index nodes must exist. So my question is whether this functionality is already present in RDBMS like MYSQL or should I use other special products.

Disclaimer: I work for ScaleBase, I live and breath sharding every day...
I would advise here that if you shard according to column A for example, a WHERE with columnA=xx will go to a single shrad. WHERE columnB=xx will have to go all shards because there might be columnB=xx in all of them. Unless columnA and columnB are related. And then you really need to save the relation in a mapping table.
I can say that running on all DBs can be super-fast, you need to run in parallel and merge results. At ScaleBase we support merging ORDER BY, GROUP BY etc. It's not easy...
Hey see more info in my blog: http://database-scalability.blogspot.com

Andrey, what you are describing is exactly how the Clustrix database works, where data and indexes are automatically distributed, then queries are distributed across nodes. Clustrix "brings the query to the data" and has a shared-nothing architecture (so no centralized index is needed). MySQL does not have any built-in functionality for distributed computing, and while there are various bolt-on options, they will ultimately encounter scaling limits when the centralized resources limits are hit.

Related

Best practices for creating a huge SQL table

I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.

Creating table for each user?

I'm programming a web application in which each user stores their tasks. (like a to-do application). Tasks will be stored in a table (for example: userstasks).
Which one is better?
1- userstasks table has a column named user_id that defines who created task?
2- a new table (e.g. usernametasks) will be created for each registered user that stores all their tasks?
P.S.: There are lots of users!
You always start with the simplest thing that works and stick with it until it's proven to be a performance problem. What you're talking about with #2 is termed "premature optimization". You only go down that road when #1 is having severe performance problems.
When you split data across different users, your ability to query across all users is severely diminished. For all intents, users will be living in different worlds. Reporting is nearly impossible.
For most applications that have a lot of reads, millions of records is not an issue. It's write-heavy applications that need special attention like this, or those with massive scale, like Reddit or Twitter. Since you're not making one of those, stick with the properly normalized structure first.
"Lots of users" probably means tens or hundreds of thousands. On a properly tuned MySQL instance that's not a big deal. If you need more scale, spin up some read-only secondary servers to spread out the load or look at using MySQL cluster.
I would go with option 1 (a table called tasks with a user_id foreign key) in the short run, assuming that a task can't have more than one user? If so then you'll need a JOIN table. Check into setting up an actual foreign key as well, this promotes referential integrity in the data itself.

How can I optimize my database?

I am creating a platform for some clients. Each client needs to have contacts and manage them in groups, categories (which depends of the group) and subcategories (which depends of the category).
The database is going to be very big, and Im afraid about the performance. I want to optimize the database; now, I have these options:
Manage only one database with multiple tables (as we manage now)
Create a database for each client (each database will have the same multiple tables as the option 1)
Manage multiple XML files (like option 2, each client will have a directory with an XML for contacts, another XML file for groups, another for categories, and so on)
Wich is the best option for performance and management of the data (CRUD, create, read, update, delete)??
Thanks!!
I think one database with multiple tables is the way to go, because duplicating the database and schema for each new client doesn't scale well. XML files sounds cool but so far I haven't seen an XML read/write engine which is as fast as most RDBMSes, so bin that one.
To make this work (lots of tables in one database) you should pay attention to indexing and optimizing the one database; indexes in particular will help you maintain speed as you scale up.
Use clustered indexing on the clienId in whichever table it might exist as a foreign key. This procedure will give you the best client-centric performance because you would (usually) be pulling a particular client's info in a page fetch.
For #2, I would suggest making that a premium service to your clients. If they want "priority hosting" on a separate server of "their own" then they pay extra. That will make the maintenance headache worthwhile.
Have you tried actually implementing 1 (which is the easiest)?
Did you profile the code?
What is the performance now?
use EXPLAIN to see how the queries are performing?
Do you use indexes (often correct indexes are enough to give excellent performance changes)?
Optimize when you hit a bottleneck (or when you set certain benchmarks for performance), not during design phase...
UPDATE: You mentioned "millions of entries". That's nothing for mysql (provided you use correct indexes on your tables). I have a table with about 40 million rows & although it's not lightning fast it gives me results in a couple of seconds. So there you go...
3 is not advisable. Search etc. is not what XML files do efficiently.
2 is a maintenance problem.
1 should be doable. "very big" means what? I have a database with a tabe with currently 1.5 billion entries - that is "big" not "very big". What do you define as very big?
As far as ongoing maintenance and support goes I think only option 1 makes sense for you.
Index all columns you need to but nothing more. Look at your code and see how tables are being JOINed and index the columns which will otherwise require a table scan.
Indicies will speed up the read operations but slow down your write operations as you need to update the indicies as well as the column. They also need more space in the DB.
As suggested above use EXPLAIN to see how your queries are executing and what can be optimized there.
Finally performance tuning only works well after you baseline your existing performance, make a change, then baseline performance again to see if it helped. If not roll back and try something else. But always start with a known level of performance, otherwise you might end up making multiple changes which in total slow things down. Good luck!

mysql tables structure - one very large table or separate tables?

I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.

MySQL: Many tables or many databases?

For a project we having a bunch of data that always have the same structure and is not linked together.
There are two approaches to save the data:
Creating a new database for every pool (about 15-25 tables)
Creating all the tables in one database and differ the pools by table names.
Which one is easier and faster to handle for MySQL?
EDIT: I am not interessed in issues of database design, I am just interessed in which of the two possibilities is faster.
EDIT 2: I will try to make it more clear. As said we will have data, where some of the date rarely belongs together in different pools. Putting all the data of one type in one table and linking it with a pool id is not a good idea:
It is hard to backup/delete a specific pool (and we expect that we are running out primary keys after a while (even when use big int))
So the idea is to make a database for every pool or create a lot of tables in one database. 50% of the queries against the database will be simple inserts. 49% will be some simple selects on a primary key.
The question is, what is faster to handle for MySQL? Many tables or many databases?
There should be no significant performance difference between multiple tables in a single database versus multiple tables in separate databases.
In MySQL, databases (standard SQL uses the term "schema" for this) serve chiefly as a namespace for tables. A database has only a few attributes, e.g. the default character set and collation. And that usage of GRANT makes it convenient to control access privileges per database, but that has nothing to do with performance.
You can access tables in any database from a single connection (provided they are managed by the same instance of MySQL Server). You just have to qualify the table name:
SELECT * FROM database17.accounts_table;
This is purely a syntactical difference. It should have no effect on performance.
Regarding storage, you can't organize tables into a file-per-database as #Chris speculates. With the MyISAM storage engine, you always have a file per table. With the InnoDB storage engine, you either have a single set of storage files that amalgamate all tables, or else you have a file per table (this is configured for the whole MySQL server, not per database). In either case, there's no performance advantage or disadvantage to creating the tables in a single database versus many databases.
There aren't many MySQL configuration parameters that work per database. Most parameters that affect server performance are server-wide in scope.
Regarding backups, you can specify a subset of tables as arguments to the mysqldump command. It may be more convenient to back up logical sets of tables per database, without having to name all the tables on the command-line. But it should make no difference to performance, only convenience for you as you enter the backup command.
Why not create a single table to keep track of your pools (with a PoolID and PoolName as you columns, and whatever else you want to track) and then on your 15-25 tables you would add a column on all of them which would be a foreign key back to you pool table so you know which pool that particular record belongs to.
If you don't want to mix the data like that, I would suggest making multiple databases. Creating multiple tables all for the same functionality makes my spider sense tingle.
If you don't want one set of tables with poolID poolname as TheTXI suggested, use separate databases rather than multiple tables that all do the same thing.
That way, you restrict the variation between the accessing of different pools to the initial "use database" statement, you won't have to recode your SELECTs each time, or have dynamic sql.
The other advantages of this approach are:
Easy backup/restore
Easy start/stop of a database instance.
Disadvantages are:
a little bit more admin work, but not much.
I don't know what your application is, but really really think carefully before creating all of the tables in one database. That way madness lies.
Edit: If performance is the only thing that concerns you, you need to measure it. Take a representative set of queries and measure their performance.
Edit 2: The difference in performance for a single query between the many tables/many databases model will be neglible. If you have one database, you can tune the hell out of it. If you have many databases, you can tune the hell out of all of them.
My (our? - can't speak for anyone else) point is that, for well tuned database(s), there will be practically no difference in performance between the three options (poolid in table, multiple tables, multiple databases), so you can pick the option which is easiest for you, in the short AND long term.
For me, the best option is still one database with poolId, as TheTXI suggested, then multiple databases, depending upon your (mostly administration) needs. If you need to know exactly what the difference in performance is between two options, we can't give you that answer. You need to set it up and test it.
With multiple databases, it becomes easy to throw hardware at it to improve performance.
In the situation you describe, experience has led me to believe that you'll find the separate databases to be faster when you have a large number of pools.
There's a really important general principle to observe here, though: Don't think about how fast it'll be, profile it.
I'm not too sure I completely understand your scenario. Do you want to have all the pools using the same tables, but just differing by a distinguishing key? Or do you want separate pools of tables within the one database, with a suffix on each table to distinguish the pools?
Either way though, you should have multiple databases for two major reasons. The first being if you have to change the schema on one pool, it won't affect the others.
The second, if your load goes up (or for any other reason), you may want to move the pools onto separate physical machines with new database servers.
Also, security access to a database server can be more tightly locked down.
All of these things can still be accomplished without requiring separate databases - but the separation will make all of this easier and reduce the complexity of having to mentally track which tables you want to operate on.
Differing the pools by table name or putting them in separate databases is about the same thing. However, if you have lots of tables in one database, MySQL has to load the table information and do a security check on all those tables when logging in/connecting.
As others mentioned, separate databases will allow you to shift things around and create optimizations specific to a certain pool (i.e. compressed tables). It is extra admin overhead, but there is considerably more flexibility.
Additionally, you can always "pool" the tables that are in separate databases by using federated or merge tables to simplify querying if needed.
As for running out of primary keys, you could always use a compound primary key if you are using MyISAM tables. For example, if you have a field called groupCode (any type) and another called sequenceId (auto increment) and create your primary key as groupCode+sequenceId. The sequenceId will increment based on the next unique ID within the group code set.
For example:
AAA 1
AAA 2
BBB 1
AAA 3
CCC 1
AAA 4
BBB 2
...
Although with large tables you have to be careful about caching and make sure the file system you are using handles large files.
I don't know mysql very well, but I think I'll have to give the standard performance answer -- "It depends".
Some thoughts (dealing only with performance/maintenance, not database design):
Creating a new database means a separate file (or files) in the file system. These files could then be put on different filesystems if performance of one needs to be separate from the others, etc.
A new database will probably handle caching differently; eg. All tables in one DB is going to mean a shared cache for the DB, whereas splitting the tables into separate databases means each database can have a separate cache [obviously all databases will share the same physical memory for cache, but there may be a limit per database, etc].
Related to the separate files, this means that if one of your datasets becomes more important than the others, it can easily be pulled off to a new server.
Separating the databases has an added benefit of allowing you to deploy updates one-at-a-time more easily than with the single database.
However, to contrast, having multiple databases means the server will probably be using more memory (since it has multiple caches). I'm sure there are more "cons" for the multi-database approach, but I am drawing a blank now.
So I suppose I would recommend the multi-database approach. Obviously this is only with the understanding that there may very well be a better "database-designy" way of handling whatever you are actually doing.
Given the restrictions you've placed on it, I'd rather spin up more tables in the existing database, rather than having to connect to multiple databases. Managing connection strings TEND to be harder, in addition to managing the different database optimizations you may have.
FTR, in normal circumstances I'd take the approach described by TheTXI.
In answer to your specific question though, I have found it to be dependant on usage. (Cop out I know, but hear me out.)
A single database is probably easier. You'll have to worry about just one connection and would still have to specify tables. Multiple databases could, under certain conditions, be faster though.
If I were you I'd try both. There's no way we'll be able to give you a useful answer.