The problem is in this:
I have a dump of staging database that I am using in development and the database size is around 2 Gb which makes many of the ActiveRecord commands (mostly 'where' commands) to run for at least 5 minutes.
What could be a solution(s) to speed up this in 'development'?
Some of the options would be to create a partial database of the development (haven't investigated how), caching, which for some reason didn't work or there is some other option. I would even consider hardcoding some part of the ActiveRecord calls, just to acheave this in development mode.
There's a few ways to achieve this based on the info you've provided.
As mentioned in the comments, you could create a seed file and build a few records to be used in development. This is common practice for most development databases (especially with more than one developer). See the Rails guides about this
Another idea would be to write a rake task that isolates a few relevant rows within the most dependent table in your staging database (say users) and build dummy data from that record. This might help you build "real-ish" data without having to do it all from scratch. If there's a large tangle of associations, this might be more work than it's worth.
Gem seed_dump could come in handy for that pourpouse.
Word of caution, if that staging DB has any PII (personally identifiable information) you will likely want to obfuscate it so you aren't storing user information locally.
Related
Scenario:
Building a commercial app consisting in an RESTful backend with symfony2 and a frontend in AngularJS
This app will never be used by many customers (if I get to sell 100 that would be fantastic. Hopefully much more, but in any case will be massive)
I want to have a multi tenant structure for the database with one schema per customer (they store sensitive information for their customers)
I'm aware of problem when updating schemas but I will have to live with it.
Today I have a MySQL demo database that I will clone each time a new customer purchase the app.
There is no relationship between my customers, so I don't need to communicate with multiple shards for any query
For one customer, they can be using the app from several devices at the time, but there won't be massive write operations in the db
My question
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and if it's a common practice to have one dedicated SQLite3 database PER CLIENT. I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
Is this a correct scenario for SQLite?
Any suggestion (aka tutorial) in how to achieve this?
[I wonder] if it's a common practice to have one dedicated SQLite3 database PER CLIENT
Only if the database is deployed along with the application, like on a phone. Otherwise I've never heard of such a thing.
I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
SQLite is a SQL database and responds to ALTER TABLE and the like. As for updating all the schemas, you'll have to re-run the update for all schemas.
Schema synching is usually handled by an outside utility, usually your ORM will have something. Some are server agnostic, some only support specific servers. There are also dedicated database change management tools such as Sqitch.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and
SQLite's main advantage is not requiring you to install and run a server. That makes sense for quick projects or where you have to deploy the database, like a phone app. For server based application there's no problem having a database server. SQLite's very restricted set of SQL features becomes a disadvantage. It will also likely run slower than a server database for anything but the simplest queries.
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
Under no circumstances should you test with a different database than the production database. Databases do not all implement SQL the same, MySQL is particularly bad about this, and your tests will not reflect reality. Running a MySQL instance for testing is not much work.
This separate schema thing claims three advantages...
Extensibility (you can add fields whenever you like)
Security (a query cannot accidentally show data for the wrong tenant)
Parallel Scaling (you can potentially split each schema onto a different server)
What they're proposing is equivalent to having a separate, customized copy of the code for every tenant. You wouldn't do that, it's obviously a maintenance nightmare. Code at least has the advantage of version control systems with branching and merging. I know only of one database management tool that supports branching, Sqitch.
Let's imagine you've made a custom change to tenant 5's schema. Now you have a general schema change you'd like to apply to all of them. What if the change to 5 conflicts with this? What if the change to 5 requires special data migration different from everybody else? Now let's imagine you've made custom changes to ten schemas. A hundred. A thousand? Nightmare.
Different schemas will require different queries. The application will have to know which schema each tenant is using, there will have to be some sort of schema version map you'll need to maintain. And every different possible query for every different possible schema will have to be maintained in the application code. Nightmare.
Yes, putting each tenant in a separate schema is more secure, but that only protects against writing bad queries or including a query builder (which is a bad idea anyway). There are better ways mitigate the problem such as the view filter suggested in the docs. There are many other ways an attacker can access tenant data that this doesn't address: gain a database connection, gain access to the filesystem, sniff network traffic. I don't see the small security gain being worth the maintenance nightmare.
As for scaling, the article is ten years out of date. There are far, far better ways to achieve parallel scaling then to coarsely put schemas on different servers. There are entire databases dedicated to this idea. Fortunately, you don't need any of this! Scaling won't be a problem for you until you have tens of thousands to millions of tenants. The idea of front loading your design with a schema maintenance nightmare for a hypothetical big parallel scaling problem is putting the cart so far before the horse, it's already at the pub having a pint.
If you want to use a relational database I would recommend PostgreSQL. It has a very rich SQL implementation, its fast and scales well, and it has something that renders this whole idea of separate schemas moot: a built in JSON type. This can be used to implement the "extensibility" mentioned in the article. Each table can have a meta column using the JSON type that you can throw any extra data into you like. The application does not need special queries, the meta column is always there. PostgreSQL's JSON operators make working with the meta data very easy and efficient.
You could also look into a NoSQL database. There are plenty to choose from and many support custom schemas and parallel scaling. However, it's likely you will have to change your choice of framework to use one that supports NoSQL.
I'm working on a group project where we all have a mysql database working on a local machine. The table mainly has filenames and stats used for image processing. We all will run some processing, which updates the database locally with results.
I want to know what the best way is to update everyone else's database, once someone has changed theirs.
My idea is to perform a mysqldump after each processing run, and let that file be tracked by git (which we use religiously). I've written a bunch of python utils for the database, and it would be simple enough to read this dump into the database when we detect that the db is behind. I don't really want to do this though, less it clog up our git repo with unnecessary 10-50Mb files with every commit.
Does anyone know a better way to do this?
*I'll also note that we are Aerospace students. I have some DB experience, but it only comes out of need. We're busy and I'm not looking to become an IT networking guru. Just want to keep it hands off for them since they are DB noobs and get the glazed over look of fear whenever I tell them to do anything with the database. I made it hands off for them thus far.
You might want to consider following the Rails-style database migration concept, whereby as you are developing you provide roll-forward and roll-back SQL statements that work as patches, allowing you to roll your database to any particular revision state that is required.
Of course, this is typically meant for dealing with schema changes only (i.e. you don't worry about revisioning data that might be dynamically populated into tables.). For configuration tables or similar tables that are basically static in content, you can certainly add migrations as well.
A Google search for "rails migrations for python" turned up a number of results, including the following tool:
http://pypi.python.org/pypi/simple-db-migrate
I would suggest to create a DEV MySQL server on any shared hosting. (No DB experience is required).
Allow remote access to this server. (again, no experience is required, everything could be done through Control Panel)
And you and your group of developers will have access to the database at any time from any place and from any device. (As long as you have internet connection)
What processes do you put in place when collaborating in a small team on websites with databases?
We have no problems working on site files as they are under revision control, so any number of our developers can work from any location on this aspect of a website.
But, when database changes need to be made (either directly as part of the development or implicitly by making content changes in a CMS), obviously it is difficult for the different developers to then merge these database changes.
Our approaches thus far have been limited to the following:
Putting a content freeze on the production website and having all developers work on the same copy of the production database
Delegating tasks that will involve database changes to one developer and then asking other developers to import a copy of that database once changes have been made; in the meantime other developers work only on site files under revision control
Allowing developers to make changes to their own copy of the database for the sake of their own development, but then manually making these changes on all other copies of the database (e.g. providing other developers with an SQL import script pertaining to the database changes they have made)
I'd be interested to know if you have any better suggestions.
We work mainly with MySQL databases and at present do not keep track of revisions to these databases. The problems discussed above pertain mainly to Drupal and Wordpress sites where a good deal of the 'development' is carried out in conjunction with changes made to the database in the CMS.
You put all your database changes in SQL scripts. Put some kind of sequence number into the filename of each script so you know the order they must be run in. Then check in those scripts into your source control system. Now you have reproducible steps that you can apply to test and production databases.
While you could put all your DDL into the VC, this can get very messy very quickly if you try to manage lots and lots of ALTER statements.
Forcing all developers to use the same source database is not a very efficient approach either.
The solution I used was to maintain a file for each database entity specifying how to create the entity (primarily so the changes could be viewed using a diff utility), then manually creating ALTER statements by comparing the release version with the current version - yes, it is rather labour intensive but the only way I've found to solve the problem.
I had a plan to automate the generation of the ALTER statements - it should be relatively straightforward - indeed a quick google found this article and this one. Never got round to implementing one myself since the effort of doing so was much greater than the frequency of schema changes on the projects I was working on.
Where i work, every developer (actually, every development virtual machine) has its own database (or rather, its own schema on a shared Oracle instance). Our working process is based around complete rebuilds. We don't have any ability to modify an existing database - we only ever have the nuclear option of blowing away the whole schema and building from scratch.
We have a little 'drop everything' script, which uses queries on system tables to identify every object in the schema, constructs a pile of SQL to drop them, and runs it. Then we have a stack of DDL files full of CREATE TABLE statements, then we have a stack of XML files containing the initial data for the system, which are loaded by a loading tool. All of this is checked into source control. When a developer does an update from source control, if they see incoming database changes (DDL or data), they run the master build script, which runs them in order to create a fresh database from scratch.
The good thing is that this makes life simple. We never need to worry about diffs, deltas, ALTER TABLE, reversibility, etc, just straightforward DDL and data. We never have to worry about preserving the state of the database, or keeping it clean - you can get back to a clean state at the push of a button. Another important feature of this is that it makes it trivial to set up a new platform - and that means that when we add more development machines, or need to build an acceptance system or whatever, it's easy. I've seen projects fail because they couldn't build new instances from their muddled databases.
The main bad thing is that it takes some time - in our case, due to the particular depressing details of our system, a painfully long time, but i think a team that was really on top of its tools could do a complete rebuild like this in 10 minutes. Half an hour if you have a lot of data. Short enough to be able to do a few times during a working day without killing yourself.
The problem is what you do about data. There are two sides to this: data generated during development, and live data.
Data generated during development is actually pretty easy. People who don't work our way are presumably in the habit of creating that data directly in the database, and so see a problem in that it will be lost when rebuilding. The solution is simple: you don't create the data in the database, you create it in the loader scripts (XML in our case, but you could use SQL DML, or CSV with your database's import tool, or whatever). Think of the loader scripts as being source code, and the database as object code: the scripts are the definitive form, and are what you edit by hand; the database is what's made from them.
Live data is tougher. My company hasn't developed a single process which works in all cases - i don't know if we just haven't found the magic bullet yet, or if there isn't one. One of our projects is taking the approach that live is different to development, and that there are no complete rebuilds; rather, they have developed a set of practices for identifying the deltas when making a new release and applying them manually. They release every few weeks, so it's only a couple of days' work for a couple of people that often. Not a lot.
The project i'm on hasn't gone live yet, but it is replacing an existing live system, so we have a similar problem. Our approach is based on migration: rather than trying to use the existing database, we are migrating all the data from it into our system. We have written a rather sprawling tool to do this, which runs queries against the existing database (a copy of it, not the live version!), then writes the data out as loader scripts. These then feed into the build process just like any others. The migration is scripted, and runs every night as part of our daily build. In this case, the effort needed to write this tool was necessary anyway, because our database is very different in structure to the old one; the ability to do repeatable migrations at the push of a button came for free.
When we go live, one of our options will be to adapt this process to migrate from old versions of our database to new ones. We'll have to write completely new queries, but they should be very easy, because the source database is our own, and the mapping from it to the loader scripts is, as you would imagine, straightforward, even as the new version of the system drifts away from the live version. This would let us keep working in the complete rebuild paradigm - we still wouldn't have to worry about ALTER TABLE or keeping our databases clean, even when we're doing maintenance. I have no idea what the operations team will think of this idea, though!
You can use the replication module of the database engine, if it has one.
One server will be the master, changes are to be made on it.
Developers copies will be slaves.
Any changes on the master will be duplicated on the slaves.
It's a one way replication.
Can be a bit tricky to put into place as any changes on the slaves will be erased.
Also it means that the developers should have two copy of the database.
One will be the slave and another the "development" database.
There are also tools for cross database replications.
So any copies can be the master.
Both solutions can lead to disasters (replication errors).
The only solution is see fit is to have only one database for all developers and save it several times a day on a rotating history.
Won't save you from conflicts but you will be able to restore the previous version if it happens (and it always do...).
Where I work we are using Dotnetnuke and this poses the same problems. i.e. once released the production site has data going into the database as well as files being added to the file system by some modules and in the DNN file system.
We are versioning the site file system with svn which for the most part works ok. However, the database is a different matter. The best method we have come across so far is to use RedGate tools to synchronise the staging database with the production database. RedGate tools are very good and well worth the money.
Basically we all develop locally with a local copy of the database and site. If the changes are major we branch. Then we commit locally and do a RedGate merge to put our DB changes on the the shared dev server.
We use a shared dev server so others can do the testing. Once complete we then update the site on staging with svn and then merge the database changes from the development server to the staging server.
Then to go live we do the same from staging to prod.
This method works but is prone to error and is very time consuming when small changes need to be made. The prod DB is always backed up so we can roll back easily if a delivery goes wrong.
One major headache we have is that Dotnetnuke uses identity cols in many tables and if you have data going into tables on development and production such as tabs and permissions and module instances you have a nightmare syncing them. Ideally you want to find or build a cms that uses GUI's or something else in the database so you can easily sync tables that are in use concurrently.
We'd love to find a better method! As we have a lot of trouble with branching and merging when projects are concurrent.
Gus
I need to work with a fairly large amount of data, and am considering both MySQL and SQLite. So I'm trying to get a good, high-level overview of both packages:
How well do each handle large databases?
Is SQLite as much of a handful to work with as MySQL?
Are there any good (web-based) resources comparing these two?
SQLite is a database library and runs only in the program which uses it. It cannot be written to at the same time from other programs although other processes can read from it. You cannot connect remotely to it and saves the data on a local accessible filesystem (possibly mounted from a file server). Forget what I said : these statements were based on outdated assumptions, and I need to read up on sqlite3 because it can now do things I was not aware of.
MySQL is a database server, i.e. can run on another machine and multiple computers and programs can connect to it at the same time.
ALthough SQLite can also handle quite large datasets, in most circumstances people will choose MySQL for large datasets, because they want remote access (without exposing the database files to well intentioned, inadvertent "cleanup" actions) to the data while the program is running for administrative purposes or to run reports.
If your application is an embedded database which only ever will be used by a single application SQLite will be just fine.
And no, SQlite is not such a handful as MySQL. MySql is not really difficult, but it has a number of strange quirks which hit people when they try to get it installed. Once it is running it is pretty painless.
You might look at PostgreSQL, as I find it a bit easier to manage and maintain as I feel some aspects are more 'logical' than MySQL. That being said, in practice there is not a huge difference.
What's the fastest way to export/import a mysql database using innodb tables?
I have a production database which I periodically need to download to my development machine to debug customer issues. The way we currently do this is to download our regular database backups, which are generated using "mysql -B dbname" and then gzipped. We then import them using "gunzip -c backup.gz | mysql -u root".
From what I can tell from reading "mysqldump --help", mysqldump runs wtih --opt by default, which looks like it turns on a bunch of the things that I can think of that would make imports faster, such as turning off indexes and importing tables as one massive import statement.
Are there better ways to do this, or further optimizations we should be doing?
Note: I mostly want to optimize the time it takes to load the database onto my development machine (a relatively recent macbook pro, with lots of ram). Backup time and network transfer time currently aren't big issues.
Update:
To answer some questions posed in the answers:
The production database schema changes up to a couple times a week. We're running rails, so it's relatively easy to run the migrate scripts on stale production data.
We need to put production data into a development environment potentially on a daily or hourly basis. This entirely depends on what a developer is working on. We often have specific customer issues that are the result of some data spread across a number of tables in the db, which needs to be debugged in a development environment.
I honestly don't know how long mysqldump takes. Less than 2 hours, since we currently run it every 2 hours. However, that's not what we're trying to optimize, we want to optimize the import onto the developer workstation.
We don't need the full production database, but it's not totally trivial to separate what we do and don't need (there are a lot of tables with foreign key relationships). This is probably where we'll have to go eventually, but we'd like to avoid it for a bit longer if we can.
It depends on how you define "fastest".
As Joel says, developer time is expensive. Mysqldump works and handles a lot of cases you'd otherwise have to handle yourself or spend time evaluating other products to see if they handle them.
The pertinent questions are:
How often does your production database schema change?
Note: I'm referring to adding, removing or renaming tables, columns, views and the like ie things that will break actual code.
How often do you need to put production data into a development environment?
In my experience, not very often at all. I've generally found that once a month is more than sufficient.
How long does mysqldump take?
If it's less than 8 hours it can be done overnight as a cron job. Problem solved.
Do you need all the data?
Another way to optimize this is to simply get a relevant subset of data. Of course this requires a custom script to be written to get a subset of entities and all relevant related entities but will yield the quickest end result. The script will also need to be maintained through schema changes so this is a time-consuming approach that should be used as an absolute last resort. Production samples should be large enough to include a sufficiently broad sample of data and identify any potential performance problems.
Conclusion
Basically, just use mysqldump until you absolutely can't. Spending time on another solution is time not spent developing.
Consider using replication. That would allow you to update your copy in real time, and MySQL replication allows for catching up even if you have to shut down the slave. You could also use a parallell MySQL instance on your normal server that replicates the data to a MyISAM table, which supports online backup. MySQL allows for this as long as the tables have the same definition.
Another option that might be worth looking into is XtraBackup from renowned MySQL performance specialists Percona. It's an online backup solution for InnoDB. Haven't looked at it myself, though, so I won't vouch for it's stability or that it's even a workable solution for your problem.