Copying database with or without indexing on? - mysql

This should really be a community wiki page, but I have to ask this question and see what I might be missing. I'm a moderator on a site and they are going through a new site transition.
They started data migration yesterday around lunch. It's still going on and they say it's going to take 30 more hours. It's a rather large site (700 million records going from SQL Server to MySQL) but I couldn't fathom why it was taking so long.
I just found out that they're indexing on the fly. Are there benefits to this? Would it not be quicker and probably safer to copy and then index? If anyone has links, I'll most likely choose that as the answer. Thanks.

The typical procedure I know is to copy all the tables with disabled constraints and no indexes and recreate indexes from scratch afterwards and then enable the constraints. Rebuilding an index from scratch is much cheaper than creating it online during migration.
Googling a minute brought up this for you from the horse's mouth :) :
http://www.mysql.com/why-mysql/white-papers/mysql_microsoftsql2mysql_paper.pdf
see e.g. page 5:
Also you'll want to take the
permissions and index statements from
the end of each of these files [the
generated MySQL DDL], and put them in
new files. If these statements are
left when migrating, migrating the
data will be significantly slower.
I didn't find a benchmark, but you could produce a very representative one yourself: Just migrate, say 1 million of your own records, using both strategies. The results should speak for themselves.
Here is a related question.

Related

How to clean up tables index_5* in tikiwiki

My mysql instance has 1700+ tables named "index_*". At 15MB each, this adds up to 25+ gigs.
How can I clean these up? Is it as easy as dropping these tables? or is there some configuration in tikiwiki that cleans up the database with regards to index tables?
Wow, 1700+ of them, never seen that before! Which version are you running? You probably need to upgrade as that sounds like a bug.
However, having said that the good news is that you can safely (but carefully) delete (drop) them and then rebuild the search index from the search admin panel or the command line using console.php and tiki will make a new one (or two).
I guess that 25 GB is too much to back up, but i'd suggest talking a backup of all the other tables if you can just in case.
The index_* tables are the unified search mysql engine's storage and it's usual to have a couple of them, maybe half a dozen or more, so something sounds like it's going badly wrong. Maybe you have a cron job running a regular rebuild? (but that would have to be every hour or something, usually once a day is plenty)
Good luck!

Solutions for transition from small scale to mid-scale MySQL database

I'm studying up on the future of the database I maintain. Right now we have one database server running MySQL using InnoDB and MyISAM tables. I'm watching the metrics closely and I can see that this will not be sustainable forever. Where does one go next? I have reviewed solutions like Cassandra, but I want to stick to an SQL approach so I'm not sure about that. I have also reviewed NDB cluster and federated database solutions, but I've noticed no one has anything good to say about those. Basically, I looking for advice on intermediate solutions. We do not yet need a vast multi-node array operating on tens of DB servers, but one server is about to reach its limit. I don't want to just throw another server on the pile without making sure that the DB architecture at hand benefits well from the extra power. What do you guys suggest for when it is time to move beyond a single server and how to manage this transition. Thank you to anyone who can help.
Edit to better explain: At present, we have about a hundred tables. We run many join operations to gather the data the end user needs to see, such that most of our queries join at least two tables to complete any operation. The data set is not too big yet, only a few hundred Megs, but the data is accessed in such a way that each table has a few writes everyday, the heaviest of which has about a thousand writes a day. We probably have about a few hundred thousand reads a day too, so read do outnumber writes about 9 to 1.
First Solutions:
Indices go a LONG way
Use profiling software to find your slow queries and optimize them
Depending on your hosting company you can usually update the RAM/CPU of the server
Second Solutions:
Split your reads and your writes into two databases. (I don't know if you're using PHP or not but PHP has a plugin that will automatically split them for you without having to change any of your code http://php.net/manual/en/mysqlnd-ms.rwsplit.php)
Use software like memcache to store database information that is frequently queried but not frequently updated

How easy (or otherwise) is it to to tune a database AFTER 'going LIVE'?

It is looking increasingly like I'll have to go live before I have had the time to tweak all the queries/tables etc, before I go live with a website (already 6 months behind schedule, so all though this is not the ideal scenario - thats how things are).
Its now a case of having to bite the bullet. Its just a case of trying to work out how big that bullet will be when we come to 'biting it'. Once the databse goes live obviously we cant change the data on a whim, because its live data. I am fairly confident on the most of db schema - for e.g. the tables are in most 3 and 4th normal form, and constraints are used to ensure data integrity. I have also put in some indexes on some column that (I think) will be used a lot in queries though this was done quite hurridly and not tested - this is the bit I am worried about.
To clarify, I am not talking about wholesale structure change. The tables themselves are unlikely to change (if ever), however it is almost guaranteed that I will have to tune the tables at some stage (either personally or by hiring someone).
I want to know how much of a task this is. Specifically, assuming a database of a few gigabytes (so far roughly 300 tables)
Assuming 50% of the tables need tuning in the next few months:
How long will it take to perform the tuning (I know this is a "how long is a piece of string" type question) - but what are the main determinants of the effort required, so I can work out how long it is likely to take?
Is it possible to either lock sections of the database, (or specific tables) whilst the indexes are being reworked, or does the entire databse need to go offline? (I am using mySQL 5.x as the db)
Is what I describe (going live before ALL tables perfectly tuned) outrageously risky/inadvisable ? (Does it justify the months of sleepless nights this has caused me so far) ?
In general it is much harder to fix a poor database design that is causing performance issues after going live becasue you have to deal with the existing records. Even worse, the poor design may not become apparent until months after going live when there are many records instead of a few. This is why databses should be designed with performance in mind (no this is not premature optimization, there are known techniques which generally perform better than other techniques and they shoulod be considered inthe design) and databases should be tested against a test set of records that is close to or more than the expected level of records you would have after a couple of years.
As to how long it will take to completely fix a badly designed database, months or years. Often the worst part is something that is central to the design (like say an EAV table) and which will require almost every query/sp/view. UDF to be adjusted to move to a better structure. You then have to ensure all records are moved to the new better structure. The sooner you can fix a mistake like this the better. Far better to move a couple of thousand records to a new structure than 100,000,000.
If your structure is ok but your queries are bad, you are better off as you can take the top ten worst performing (Choose based not just on total time to run but time X no of times run) and fix, rinse and repeat.
If you are in the midst of fixing a poor database, this book might come in handy:
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1268158669&sr=8-1
I would try at least to quantify the limits of the database before going live, so that at least you would know when the activity generated from your application is getting near to that threshold.
You may want to simulate (automatically as much as possible) the typical usage of the database from your application, and check how many concurrent users/sessions/transactions, etc it can handle before it breaks. This, at least, should allow you to solve the "sleepless nights" issue.
As for the original "How easy is it...?" question, the answer obviously depends on many factors. However, the above analysis would undoubtedly help, as at the very least you will be in a position to say whether your database requires tweaking or not.
To answer the title question, I'd say it's fairly easy to tune your DB after deploying into Production.
It's a great idea to be improving performance after deploying to any environment. Being Production adds a bit of pressure, along with the schedule. I'd suggest deploying to Prod, and let it perform as it will. Then start measuring:
how long to run Report X in different times (peak vs after-hours, if there is such a concept in your app).
what's the user's experience when using the app for those critical use-cases?
Then take a backup of your Prod environment, and create yourself a pre-Prod environment. There you'll be able to run your upgrade scripts to be able to measure the 'how long' type questions you have. Index creation, upgrade down-times, etc. When tuning queries, etc, you'll have a great idea of how it performs with production data & volumes. Granted, you won't have the benefits of having those users performing those inserts at the same time.
Keep that backup for multiple iterations, failed upgrades, new/unprepared-for issues, etc.
Keep making backups after each deployment, so that you can test the next round of improvements to your DB.
It depends on what you're tuning. Let's say you're adding an index to a couple tables, or changing a table type from MyISAM to InnoDB or something, then with a large enough table, those things could be done in 5 to 10 minutes depending on your hardware. It won't take hours. That said, it's still best to do any live-db tuning in the middle of the night.
You can grab a read lock by calling FLUSH TABLES WITH READ LOCK but it's probably better to put up a "we're doing maitenance" message in your app for the 15-30 mins you're doing it, just to be safe.
The risk is inherent to the situation and what happens if there are serious problems. I usually take a more cowboy approach and take stuff live, especially if they aren't under high load so I can easily find pain points and fix them. If this is a mission critical system, then no, load test or whatever you can first to be sure you're as ready as you can be. Also, keep in mind that you cannot foresee all the issues you'll have. If your indexes are good, then you're probably okay to take it live and see what needs to be worked on.

What's the fastest way to import a large mysql database backup?

What's the fastest way to export/import a mysql database using innodb tables?
I have a production database which I periodically need to download to my development machine to debug customer issues. The way we currently do this is to download our regular database backups, which are generated using "mysql -B dbname" and then gzipped. We then import them using "gunzip -c backup.gz | mysql -u root".
From what I can tell from reading "mysqldump --help", mysqldump runs wtih --opt by default, which looks like it turns on a bunch of the things that I can think of that would make imports faster, such as turning off indexes and importing tables as one massive import statement.
Are there better ways to do this, or further optimizations we should be doing?
Note: I mostly want to optimize the time it takes to load the database onto my development machine (a relatively recent macbook pro, with lots of ram). Backup time and network transfer time currently aren't big issues.
Update:
To answer some questions posed in the answers:
The production database schema changes up to a couple times a week. We're running rails, so it's relatively easy to run the migrate scripts on stale production data.
We need to put production data into a development environment potentially on a daily or hourly basis. This entirely depends on what a developer is working on. We often have specific customer issues that are the result of some data spread across a number of tables in the db, which needs to be debugged in a development environment.
I honestly don't know how long mysqldump takes. Less than 2 hours, since we currently run it every 2 hours. However, that's not what we're trying to optimize, we want to optimize the import onto the developer workstation.
We don't need the full production database, but it's not totally trivial to separate what we do and don't need (there are a lot of tables with foreign key relationships). This is probably where we'll have to go eventually, but we'd like to avoid it for a bit longer if we can.
It depends on how you define "fastest".
As Joel says, developer time is expensive. Mysqldump works and handles a lot of cases you'd otherwise have to handle yourself or spend time evaluating other products to see if they handle them.
The pertinent questions are:
How often does your production database schema change?
Note: I'm referring to adding, removing or renaming tables, columns, views and the like ie things that will break actual code.
How often do you need to put production data into a development environment?
In my experience, not very often at all. I've generally found that once a month is more than sufficient.
How long does mysqldump take?
If it's less than 8 hours it can be done overnight as a cron job. Problem solved.
Do you need all the data?
Another way to optimize this is to simply get a relevant subset of data. Of course this requires a custom script to be written to get a subset of entities and all relevant related entities but will yield the quickest end result. The script will also need to be maintained through schema changes so this is a time-consuming approach that should be used as an absolute last resort. Production samples should be large enough to include a sufficiently broad sample of data and identify any potential performance problems.
Conclusion
Basically, just use mysqldump until you absolutely can't. Spending time on another solution is time not spent developing.
Consider using replication. That would allow you to update your copy in real time, and MySQL replication allows for catching up even if you have to shut down the slave. You could also use a parallell MySQL instance on your normal server that replicates the data to a MyISAM table, which supports online backup. MySQL allows for this as long as the tables have the same definition.
Another option that might be worth looking into is XtraBackup from renowned MySQL performance specialists Percona. It's an online backup solution for InnoDB. Haven't looked at it myself, though, so I won't vouch for it's stability or that it's even a workable solution for your problem.

Do any databases support automatic Index Creation?

Why don't databases automatically index tables based on query frequency? Do any tools exist to analyze a database and the queries it is receiving, and automatically create, or at least suggest which indexes to create?
I'm specifically interested in MySQL, but I'd be curious for other databases as well.
That is a best question I have seen on stackoverflow. Unfortunately I don't have an answer. Google's bigtable does automatially index the right columns, but BigTable doesn't allow arbitrary joins so the problem space is much smaller.
The only answer I can give is this:
One day someone asked, "Why can't the computer just analyze my code and and compile & statically type the pieces of code that run most often?"
People are solving this problem today (e.g. Tamarin in FF3.1), and I think "auto-indexing" relational databases is the same class of problem, but it isn't as much a priority. A decade from now, manually adding indexes to a database will be considered a waste of time. For now, we are stuck with monitoring slow queries and running optimizers.
There are database optimizers that can be enabled or attached to databases to suggest (and in some cases perform) indexes that might help things out.
However, it's not actually a trivial problem, and when these aids first came out users sometimes found it actually slowed their databases down due to inferior optimizations.
Lastly, there's a LOT of money in the industry for database architects, and they prefer the status quo.
Still, databases are becoming more intelligent. If you use SQL server profiler with Microsoft SQL server you'll find ways to speed your server up. Other databases have similar profilers, and there are third party utilities to do this work.
But if you're the one writing the queries, hopefully you know enough about what you're doing to index the right fields. If not then having the right indexes is likely the least of your problems...
-Adam
MS SQL 2005 also maintains an internal reference of suggested indexes to create based on usage data. It's not as complete or accurate as the Tuning Advisor, but it is automatic. Research dm_db_missing_index_groups for more information.
There is a script on I think an MS SQL blog with a script for suggesting indexes in SQL 2005 but I can't find the exact script right now! Its just the thing from the description as I recall. Here's a link to some more info http://blogs.msdn.com/bartd/archive/2007/07/19/are-you-using-sql-s-missing-index-dmvs.aspx
PS just for SQL Server 2005 +
There are tools out there for this.
For MS SQL, use the SQL Profiler (to record activity against the database), and the Database Engine Tuning Advisor (SQL 2005) or the Index Tuning Wizard (SQL 2000) to analyze the activities and recommend indexes or other improvements.
Yes, some engines DO support automatic indexing. One such example for mysql is Infobright, their engine does not support "conventional" indexes and instead implicitly indexes everything - this is a column-based storage engine.
The behaviour of such engines tends to be very different from what developers (And yes, you need ot be a DEVELOPER to even be thinking about using Infobright; it is not a plug-in replacement for a standard engine) expect.
I agree with what Adam Davis says in his comment. I'll add that if such a mechanism existed to create indexes automatically, the most common reaction to this feature would be, "That's nice... How do I turn it off?"
Part of the reason may be that indexes don't just give a small speedup. If you don't have a suitable index on a large table queries can run so slowly that the application is entirely unusable, and possibly if it is interacting with other software it simply won't work. So you really need the indexes to be right before you start trying to use the application.
Also, rather than building an index in the background, and slowing things down further while it's being built, it is better to have the index defined before you start adding significant amounts of data.
I'm sure we'll get more tools that take sample queries and work out what indexes are necessary; also probably we will eventually get databases that do as you suggest and monitor performance and add indexes they think are necessary, but I don't think they will be a replacement for starting off with the right indexes.
Seems that MySQL doesn't have a user-friendly profiler. Maybe you want to try something like this, a php class based in MySQL profiler.
Amazon's SimpleDB has automatic indexing on all columns based on your usage:
http://aws.amazon.com/simpledb/
It has other limitations though:
It's a key-value store, not an RDB. Obviously that means slow joins (and no built-in join support).
It has a 10gb limit on table size. There are libraries that will handle partitioning big data for you although this locks you into that library's way of doing things, which can have its own problems.
It stores all values as strings, even numbers, which makes sorting a column with a 1,9, and 10 come out like 1,10,9 unless you use a library which hacks this by 0 padding. This also impacts negative numbers.
The 10gb limit is bigger than many might assume, so you could proceed with this for a simple site that you plan on rewriting if it ever hits big.
It's unfortunate this kind of automatic indexing didn't make it into DynamoDb, which appears to have replaced it - they don't even mention SimpleDb in their Product list anymore, you have to find it through old links to it.
Google App Engine does that (see the index.yaml file).