Migrating and comparing a SQL Server database - sql-server-2008

We downloaded today RedGate's Toolbet, in oder to automatize some tasks that take so long in our company when it comes to databases.
The first one appear with a 15 GB database we have, with a lot of indexes, constrains and also several triggers. We want this database to be migrated exactly with the schema, all the data, triggers, etc to a new DB with the idea to reduce the size an also to get a better performance hidding all the mistakes commited in the past. Unfortunately this was the first customer's release DB of one products, and we used it to test lot of things that no always worked pretty well. We are sure that if we do something like this, we will get more tha 50% of the size back into our disk.
Can one or some Toolbet tools combined be useful to do this? If answer is not, is there available other tool useful for this task?

One common way this can happen is if you are not selecting all your tables to be included in the compare. For example, you may have selected a child table and not the parent table. This could lead to a FK error like you describe.

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

MySQL database activity log: fields vs table

So basically I am in the process of creating a personal finance tracking system. It occurred to be that keeping tabs on when each instance and transaction was last edited or updated might be of relevant information some day.
Now as far as I can see there are two approaches to implement something like this:
Create "updated" fields to all the tables I want to keep track of and then let mysql update those fields for me (ON UPDATE clause)
Create a completely seperate table for holding the log data and then update that with a triggers and transactions
Now it seems that 1st approach would have the benefit of keeping things simple and easy to maintain. However how this will impact the performance if I suddenly decide to get every log in the database for review. Also this would kind of goes against normalization (not by much though) with same data stored in multiple tables.
The second approach would allow more flexibility to the logging system and might actually shorten the sql query necessary to retrieve certain data. However it would make the schema more complex as two additional tables would have to be created (the actual log table and many-to-many relation table for holding the keys) and maintained. On the other hand if I ever want to implement an activity history this approach would propably be the only one capable of doing it.
As such I would like to know some more pros and cons to each method. Since 2nd option allows more flexibility I am considering implementing it but I am not sure about performance issues. In the end it comes down to two guestions:
Are there any real life examples where both approaches are
implemented?
And:
Are there any studies, comparisons or other resource that might shed
some light on which is considered more performance friendly and "best
practices" approach?
It depends on what kind of reporting you need and your current architecture.
If you just want to know last update date, then having 2 fields (creation date and last update) should be enough. That's because having separate table won't give any perfomance boost, but will make your code harder to maintain.
It's another story if you want to have something more elaborate, like reporting differences (what was changed) and/or have full change log on each transaction (there might be few updates to one transaction, right?). In this case you actually must have separate table, because otherwise it will bloat your table and reduce perfomance.
Based on my experience, I'd go with separate table. That's because it will be easier to maintain - your logging logic will be practically separated from everything else and I think one day you'll need that additional info on your transactions and full transaction history.
As far as perfomance goes, you won't notice any formidable difference unless your system will be under serious load. But as your system is personal, any choice would suffice, just don't forget about proper indexing.
Note that I'm making alot of assumptions here, so if you want something more specific, please provide your actual architecture and reporting needs. I'd suggest some books on high availability/perfomance, but they are not on your specific needs, but on general availability/perfomance.

How can I optimize my database?

I am creating a platform for some clients. Each client needs to have contacts and manage them in groups, categories (which depends of the group) and subcategories (which depends of the category).
The database is going to be very big, and Im afraid about the performance. I want to optimize the database; now, I have these options:
Manage only one database with multiple tables (as we manage now)
Create a database for each client (each database will have the same multiple tables as the option 1)
Manage multiple XML files (like option 2, each client will have a directory with an XML for contacts, another XML file for groups, another for categories, and so on)
Wich is the best option for performance and management of the data (CRUD, create, read, update, delete)??
Thanks!!
I think one database with multiple tables is the way to go, because duplicating the database and schema for each new client doesn't scale well. XML files sounds cool but so far I haven't seen an XML read/write engine which is as fast as most RDBMSes, so bin that one.
To make this work (lots of tables in one database) you should pay attention to indexing and optimizing the one database; indexes in particular will help you maintain speed as you scale up.
Use clustered indexing on the clienId in whichever table it might exist as a foreign key. This procedure will give you the best client-centric performance because you would (usually) be pulling a particular client's info in a page fetch.
For #2, I would suggest making that a premium service to your clients. If they want "priority hosting" on a separate server of "their own" then they pay extra. That will make the maintenance headache worthwhile.
Have you tried actually implementing 1 (which is the easiest)?
Did you profile the code?
What is the performance now?
use EXPLAIN to see how the queries are performing?
Do you use indexes (often correct indexes are enough to give excellent performance changes)?
Optimize when you hit a bottleneck (or when you set certain benchmarks for performance), not during design phase...
UPDATE: You mentioned "millions of entries". That's nothing for mysql (provided you use correct indexes on your tables). I have a table with about 40 million rows & although it's not lightning fast it gives me results in a couple of seconds. So there you go...
3 is not advisable. Search etc. is not what XML files do efficiently.
2 is a maintenance problem.
1 should be doable. "very big" means what? I have a database with a tabe with currently 1.5 billion entries - that is "big" not "very big". What do you define as very big?
As far as ongoing maintenance and support goes I think only option 1 makes sense for you.
Index all columns you need to but nothing more. Look at your code and see how tables are being JOINed and index the columns which will otherwise require a table scan.
Indicies will speed up the read operations but slow down your write operations as you need to update the indicies as well as the column. They also need more space in the DB.
As suggested above use EXPLAIN to see how your queries are executing and what can be optimized there.
Finally performance tuning only works well after you baseline your existing performance, make a change, then baseline performance again to see if it helped. If not roll back and try something else. But always start with a known level of performance, otherwise you might end up making multiple changes which in total slow things down. Good luck!

Database separation - MySQL

I have a main MySQL db set up, and a class to handle the queries to it. It runs real nice. I am building a custom advertising system on my site and I'm wondering if there is any benefit to creating a separate database all together to handle that system?
Is there any pitfalls to doing it either way?
Option #1 - one DB for main website function, one DB for advertising system
Option #2 - one DB for both main website function and advertising system
Well, you need a new connection for every Database you use, also you need a new instance of your DB-Class - both costs some (minimal) memory. I personally see no reason why you would need/want to do this. If you just want to separate the two things, maybe you could use a prefix like "adv_" for the advertisement tables.
Edit: another problem could come up if you ever want to combine (e.g. join) data of the two databases - you will have a much easier time if you do not use multiple databases.
Johnnietheblack, there is no easy answer here, and not even one right answer: different tables need different approaches, and sometimes you have to throw away an academic/more "secure" database model to improve performance & scalability.
It's always a matter of trade-offs. Based on my personal experience, I have some thoughts to share with you:
When you separate tables in different databases, you have more work to do in your data abstraction layers to keep referential integrity (you have to do the DB chores...) and to link information. On the other hand, it's easier to manage the databases (indexes, data files, query tunings, etc.).
Tables with high insert rate and low maintenance (update/delete) and where referential integrity is not that important - like log tables - are good candidates to be put in a separate database: although the I/O from inserts are heavy, the records don't change over time, they are rarely retrieved, and their indexes tend to be pretty simple (date/time and some other attribute). I have one case where the log file was so big (millions of records) that at a point a single insert was taking almost 1 sec. Since it has 500 thousand new records each day, it was a snowball: we cannot stop the system to tune the damn thing because it takes too long, and the system was shutting down because this log table was used everywhere and was impacting the business (75% of the procedures used this table).
Databases can eat THOUSANDS of records for breakfast, so for small tables (less than 1000 records) you generally don't need to worry about, just the big ones ( more than 5000). I have a friend DBA that simply does not create indexes for performance in most of the tables: he made some tests and discovered that their SQL Server was changing the query plan to TABLE SCANS for most of the tables. But be careful here: is strong medicine!
Try to think about SaaS when it comes to define if a new tables set should be put together inside a database: your advertising system needs to be tightly integrated with your website or it can be a separate component, reusable by other components? If it is the later, you should think about using separate databases, to minimize impacts when you update the schema, do maintenance in the new tables, etc.
There are so many other cases, but alas, we have so little time... The important thing here is to keep an open mind and try to forget a little bit about 3rd form academically perfect database models. Hope it helps!

How do I keep 2 scratch Databases in sync

My question is a lot like this one. However I'm on MySQL and I'm looking for the "lowest tech" solution that I can find.
The situation is that I have 2 databases that should have the same data in them but they are updated primarily when they are not able to contact each other. I suspect that there is some sort of clustering or master/slave thing that would be able to sync them just fine. However in my cases that is major overkill as this is just a scratch DB for my own use.
What is a good way to do this?
My current approach is to have a Federated table on one of them and, every so often, stuff the data over the wire to the other with an insert/select. It get a bit convoluted trying to deal with primary keys and what not. (insert ignore seems to not work correctly)
p.s. I can easily build a query that selects the rows to transfer.
MySQL's inbuilt replication is very easy to set up and works well even when the DBs are disconnected most of the time. I'd say configuring this would be much simpler than any custom solution out there.
See http://www.howtoforge.com/mysql_database_replication for instructions, you should be up and running in 10-15 mins and you won't have to think about it again.
The only downside I can see is that it is asynchronous - ie. you must have one designated master that gets all the changes.
My current solution is
set up a federated table on the source box that grabs the table on the target box
set up a view on the source box that selects the rows to be updated (as a join of the federated table)
set up another federated table on the target box that grabs the view on the source box
issue an INSERT...SELECT...ON DUPLICATE UPDATE on the target box to run the pull.
I guess I could just grab the source table and do it all in one shot, but based on the query logs I've been seeing, I'm guessing that I'd end up with about 20K queries being run or about 100-300MB of data transfer depending on how things happen. The above setup sold result in about 4 queries and little more data transfered than actually needed to be.