I have an application that stores isolated (and sharded) user data in separate MySQL schemas (aka databases) with with identical table structure (3 InnoDB tables), one for each account. We're using Amazon RDS to store the data and there are currently about 30k schemas per RDS instance. My question is about how to optimize the MySQL settings (or in terms of RDS, the Parameter Group settings) for a large number of schemas.
I know there is a reflex to condemn the practice of having thousands of identical schemas, suggesting instead that there be a single schema with something like an indexed "account_id" column. The argument generally given is that the penalty for shard-walking or schema changes outweigh the benefits. In our case, we need to keep the data isolated and there is never any interaction between different accounts.
What are some ways to maximize performance in a situation like this?
Warning in advance: I have no clue about Amazon RDS, so this answer may be absolute nonsense. I am answering from a generic MySQL perspective.
Well, your setup is not entirely optimal, but there are some ways to tune it. You want to prevent opening/closing tables too often, so you want a lot of tables open at the same time.
To do this:
Make sure your OS allows MySQL to have a lot of open files
Set the table_cache properly
You can find some more references in the MySQL manual.
How high you actually want this limit depends on the constraints of your resources. Each open table takes memory - I'm not sure how much. A script like the tuning primer or mysqltuner.pl can help you to prevent overcommitting your memory.
Related
Right now I'm trying to choose the most appropriate approach in order to implement Audit Trail for my entities with AWS RDS MySQL database.
I have to log all entity changes including the initiator(user) who initiated these changes. One of the main criterion is performance.
Hibernate Envers looks like the easiest and the most complete solution and can be very quickly integrated. Right now I'm worried about the possible performance slowdown after Envers introducing. I saw a few posts where developers prefer approach for Audit Trail based on database triggers.
The main issue with triggers is how to get initiator(user) who initiated these changes.
Based on your experience, could you please suggest the approach for Java/Spring/Hibernate/MySQL(AWS) in order to implement Audit Trail for historical changes.
Also, do we have any solution for Audit Trail within AWS RDS MySQL database infrastructure ?
Understand that speculation about performance without concrete evidence to support one's theory is analagous to premature optimization of code. It's almost always a waste of time.
From a simple database point of view, as a table grows to a specific limit, yes it's performance will degrade, but typcally this mainly impacts queries and less on insertion/update if the table is properly indexed and queries properly formed.
But many databases support partitioning as a means to control performance concerns, particularly on larger tables. This typically involves separating a table's data across a set of boundaries defined by a partition scheme you create. You simply define what is the most relevant data and you try and store this partition on your fastest drives/storage and the less relevant, typically older, data is stored on your slower drives/storage.
You can also elect to store database tables in differing schemas/tablespaces by specifying the envers property org.hibernate.envers.default_schema. If your database supports putting schemas in different database files on the file system, you can help increase performance by allowing your entity table reads/writes not impact the reads/writes of your audit tables.
I can't speak to MySQL's support for any of these things, but I do know that MSSQL/Oracle supports partitioning very easily and Oracle for sure allows the separation of schemas across differing database files.
In our software, we share information across installations.
Currently we use staging tables within the same database to facilitate this. We use stored procedures to pull certain data from the live tables into the staging tables, and then dump them. This dump then gets loaded into the staging tables of the target database, and stored procedures merge in the data.
This works fine, but for a few reasons*, I'm considering moving this from staging tables to a separate staging database. I'm just concerned about whether or not this will have any performance implications.
Having very quickly tried this (just as a thought exercise) on a couple of differently configured systems, I've come up with differing results. One (with not much data, and running MySQL 5.6) showed no real difference, possibly even slightly faster. The other (with much more data, running MySQL 5.5) showed it to be about 1.5 times slower.
I'm well aware that there will likely be configuration options that may affect this, I'm no DBA, so any pointers would be much appreciated.
TL;DR
What performance implications might there be in inserting data into tables in a different database (on the same server), compared to within the same database. Will it depend on MySQL version, or configuration settings?
* If you're interested in 'reasons', I can let you know in the comments
Let me start by saying that you cannot conduct tests the way you did.
Databases, and MySQL among them, rely on hardware. There are number of optimizations for MySQL variables available which can turn it from a snail to formula 1.
Consequently, you tested on separate systems, and each either runs on different hardware or contains different data. Technically, what MySQL does by default is use InnoDB storage engine. InnoDB storage engine, by default, stores all the data into a single file. So from some "down to the core" perspective, whether you use a table or a database - MySQL won't really care because it will store it into the one and the same file. From there on we get to the point whether the file is fragmented or not (if on mechanical disk) and to many other interesting details that can't be covered in a single answer.
That brings us to next issue - databases exist to store structured data.
Databases are not glorified text files. They exist so we can ask them to cross-examine data and give us meaningful results to questions.
That means we design databases and tables in ways that correspond to certain logic. If it makes sense to store those staging tables into database A, then store it. If it makes sense to do some other thing - do the other thing.
From performance point of view, it hugely depends on how your servers are configured, both from OS to MySQL variables, to which hardware (especially HDD) you have. Without knowing what's going on there down to the last detail - no one can tell you "Yes, it is faster", "No it is slower" or "It is the same". If they do - they're lying. We basically lack every possible information to tell you with certainty which approach is better.
I am helping a customer migrate a PHP/MySQL application to AWS.
One issue we have encountered is that they have architected this app to use a huge number of databases. They create a new DB (with identical schema) for each user. They expect to have tens of thousands of users.
I don't know MySQL very well, but this setup does not seem at all good to me. My only guess is that the developers did this so they could avoid having tables with huge amounts of data. However I can only think of drawbacks (maintaining this system will be a nightmare, very difficult to extend, difficult to scale, etc..).
Anyhow, is this type of pattern commonly used within the MySQL community? What are the benefits, if any?
I am trying to convince them that they should re-architect the DB schema.
* [EDIT] *
In the meantime we know another drawback of this approach. We had originally intended to use Amazon RDS for data storage. However, RDS currently supports up to 30 databases per instance. So unfortunately RDS is now ruled out. The fact that RDS has this limit in place is already very telling, my interpretation is that having such a huge number of databases is not a common practice with MySQL.
Thanks!
This is one of most horrible ideas I've ever read (and I've read many). For once the amounts of databases do not scale as well as tables in databases and on the other it would be impossible to connect users to each other or at least share common attributes and options. It essentially defeats the purpose of the database itself.
My advise here is rather outside of original scope: Your intuition knows more than you think, listen to it more!
This idea seems quite strange to me also! Databases are designed to handle large data sets after all! If there is genuine concern about the volume of data it is usually better practice to separate tables onto different databases - hosted on different physical servers as this allows you to spread the database level processes across hardware to boost performance
Also I don't know how they plan to host this application but many hosting providers are going to charge you per database instance!
Another problem this will give you is that it will make reporting more difficult - I wouldn't like to try including tables from 10,000 databases in a query!!
Given a large MySQL production database, optimized for fast inserts, how would I go about setting up a "slave" database that would be optimized for fast searches? In my head, the slave would basically be a replica of the main, but with significantly more indices across the whole database to speed up read access. Is this sort of customized master-to-slave replication possible?
Setting up master/slave replication?
This How to Set Up Replication in MySQL guide has both what you need to do on the master and the slave.
Found here on this MySQL Replication question:
I'd argue that you may want to consider an alternative database for your reporting side. You mentioned the need for heavy indexes -- why? Since I assume you're doing bulk loads of data and hardly modifying the data at all (ie: transactions), then you should consider a columnar database. If you're used to mysql syntax, then there are two column-oriented databases which use mysql as a front end. Thus, your queries will stay the same, but how the data is stored underneath will be completely different.
It really throws your question out and adds a new one in: what's the best way to move my data to allow it to be optimized for many reads and few writes.
Full Disclosure: I work as the open-source guy for Infobright (one of the column oriented techs). Your question is one I see a lot of, so I thought I'd give you my thoughts. However, there are several different columnar vendors; it all depends on volume, money, and time.
I am working with large datasets (10s of millions of records, at times, 100s of millions), and want to use a database program that links well with R. I am trying to decide between mysql and sqlite. The data is static, but there are lot of queries that I need to do.
In this link to sqlite help, it states that:
"With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes."
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests. I'm wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The description above seems to suggest that this might be the case, but my data is no where near 2TB.
I'd appreciate any insights into understanding this constraint of maximum file size from the filesystem and how this could affect speed for indexing tables and running queries. This could really help me in my decision of which database to use for my analysis.
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask on Android), mysql is probably the way to go.
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
The short short version is:
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite or sqldf packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
SQL if you are mainly using this as a web service.
SQLite, if you want it to able to function offline.
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)