SQL Server: Mitigating schema changes/upgrades - sql-server-2008

I haven't spent a ton of time researching this yet, mostly looking for best practices on upgrading/changing DB schemas.
We're actively developing a new product and as such we often have additions or changes to our DB schema. We also have many copies of the DB -- one for the test environment, one for the prod environment, dev environments, you name it. We don't really want to have to blow away test data every time we want to make a change to the DB.
Are there good ways of automating this or handling this? None of us have really ever had to deal with this so...

Normalize, Normalize, Normalize
Then do it again.
This means that you can just slip new tables / views and other tasty goodness in without disrupting other tables.
I have seen databases that claim to be normalized, but are not. Try and look ahead when thinking about separating things out.
You may pay a bit with joins, but query views not tables and adopt a good caching strategy and you will be good to go. Some NoSQL databases offer better flexibility, but are a bit like the schizophrenic nephew at the moment in terms of maturity.
What we have is an SQL-independent table description which gets translated into SQL and updates and an ORM/ActiveRecord/Mapper that uses nothing but data from the SQL database schema itself to work out what is going on ... this means you app adjusts to changes too.
We also use stored procedures heavily for inserts and mainly read from views.

Related

Why do NoSql databases scale better than relational databases? How should I choose between them?

By nosql databases I mean something like mongodb or dynamodb
I've been trying to find why NoSql dbs usually are usually better at horizontal scaling than relational dbs, and how to choose between them
I have looked into many videos and posts that tell us the "SQL vs NoSQL". Most of them end up talking about "Normalization vs Denormalization".
Here are some questions I am still confused about.
1.
Many people said that relational dbs have to follow ACID so they are bad at horizontal scaling. But ACID is about transaction, we can always choose not to use any transaction, right? I know not many people do this, but if we denormalized tables enough, would it be like NoSQL dbs where we almost don't use any transaction?. And many NoSql dbs now have transactions too.
2.
I know denormalization is probably good for horizontal scaling, because if data are
spreaded across many nodes(machines), it'll be hard to do table joining(or transaction).
But like transaction, we can choose not to use any table join.
The only thing I can think of is NoSQL are schema-free, it is easier to add new fields(columns) than RDB.
What I am trying to ask are
why is a "Denormalized NoSQL db" better than a "Denormalized relational db" ?
why is a "Normalized NoSQL db" worse than a "Normalized relational db" ?
what's the real thing that prevents relational database from denormalization?
I've read this post
https://softwareengineering.stackexchange.com/questions/194340/why-are-nosql-databases-more-scalable-than-sql
It says
""The SQL API lacks a mechanism to describe queries where ACID's requirements are relaxed. This is why the BASE databases are all NoSQL.""
Could anyone give me an example of this?
Sorry for not being specific
By NoSQL databases I mean something like mongodb
A blog like https://neo4j.com/blog/acid-vs-base-consistency-models-explained/ explains BASE this way:
Basic Availability
The database appears to work most of the time.
Soft-state
Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
Eventual consistency
Stores exhibit consistency at some later point (e.g., lazily at read time).
This level of equivocation doesn't sound very reliable, does it? They trade off availability and consistency to gain performance and scalability.
This is fine if you're running a service that is tolerant of mismatched data or stale data, or which is okay with some minor amount of data loss once in a while. If those issues are an uncommon occurrence, but you get superior performance nearly all the time, it's very attractive. And more importantly, it demos well.
But if you have to run a service with strict requirements for data integrity, it's no good. If losing even one record of data gets you in trouble with auditors, or if you can't reliably read data you just committed a moment before because that commit takes time to propagate to all nodes of your cluster, it could be a deal-breaker.
So which data store to choose depends on the requirements of your app. Only you can judge if the relaxed availability and consistency of a BASE data store is sufficient for the needs of your app.
NoSQL is a term that covers lots of types of storage/query engines e.g. document stores, Graph Databases, etc. - basically anything that looks something like a database but doesn’t use the standard tables/rows/columns structure that a SQL database does.
NoSQL databases were developed to support use cases that relational databases don’t handle well - so while you might be able to use either a SQL or a NoSQL database in any given scenario, the choice between the 2 is normally a no-brainer; they would very rarely both be viable options.
Just to clarify, your questions about types of DB being better or worse are meaningless without context. Without knowing precisely what your requirements are, it’s impossible to say whether a NoSQL DB is better or worse than a SQL one - and that’s before you start looking at specific products in each category.
Also, that post you reference is about 8 years old and much of the information is out of date - as one of the contributors acknowledges in an update made in 2019

how much work should we do in the database?

how much work should we do in the database?
Ok I'm really confused as to exactly how much "work" should be done IN the database, and how much work had to be done instead at the application level?
I mean I'm not talking about obvious stuff like we should convert strings into SHA2 hashes at the application level instead of the database level..
But rather stuff that are more blur, including, but not limited to "should we retrieve the data for 4 column and do a uppercase/concatenation at the application level, or should we do those stuff at the database level and send the calculated result to the application level?
And if you could list any more other examples it would be great.
It really depends on what you need.
I like to do my business logic in the database, other people are religously against that.
You can use triggers and stored procedures/functions in SQL.
Links for MySQL:
http://dev.mysql.com/doc/refman/5.5/en/triggers.html
http://www.mysqltutorial.org/introduction-to-sql-stored-procedures.aspx
http://dev.mysql.com/doc/refman/5.5/en/stored-routines.html
My reasons for doing business logic in triggers and stored proces
Note that I'm not talking about bending the database structure towards the business logic, I'm talking about putting the business logic in triggers and stored procedures.
It centralizes your logic, the database is a central place, everything has to go through it. If you have multiple insert/update/delete points in your app (or you have multiple apps) you'll need to do the checks multiple times, if you do it in the database you only have to do the checks in one place.
It simplifies the application e.g., you can just add a member, the database will figure out if the member is already known and take the appopriate action.
It hides the internals of your database from the application, if you do all your logic in the application you will need intricate knowledge of your database in the application. If you use database code (triggers/procs) to hide that, you don't need to know every database detail in your app.
It makes it easier to restucture your database If you have the logic in your database, you can just change a tablelayout, replace the old table with a blackhole table, put a trigger on that and let the trigger do the updates to the new table, your app does not even need to know the database has changed, this allows legacy apps to keep working unchanged, whilst new apps can use the improved database layout.
Some things are easier in SQL
Some things work faster in SQL
I don't like to use (lots of and/or complicated) SQL code in my application, I like to put SQL code in a stored procedure/function and try to only put simple queries in my application code, that way I can just write code that explains what I mean in my application and let the database layer do the heavy lifting.
Some people disagree strongly with this, but this approach works well for me and has simplified debugging and maintenance of my applications a lot.
Generally, its a good practice to expect only "Data" from the Database. Its upto Application(s), to apply Business/Domain Logic and make sense of the data retrieved. Its highly recommended to do the following things in the Application Layer:
1) Formatting Date
2) Applying Math functions, such as interpolation/extrapolation, etc
3) Dynamic sorting (based on columns)
However, situations sometime warrant few things to be done at the database level.
In my opinion application should use data and database should provide them and that should be clear separation of concerns. So database gives records sorted, ordered and filtered according to requested conditions but it is up to application to apply some business logic to that records and "convert" them into something meaningful to the user.
For example, in my previous company we worked on big application for work time calculations. One of obvious functionalities in this kind of application is tracking vacation days of employees - how many days employee has per year, how many he used, how many left, etc. Basically we could write some triggers and procedures that would update those columns automatically. So when employee had his vacation days approved amount of days he applied for is taken from his "vacation pool" and added to "vacation days used". Pretty easy stuff but we decided to make it explicit on application level and boy, very soon we were happy we did it that way. Application had to be labor law compliant and it quickly turned out that not for all employees vacation days are calculated equally and sometimes vacation day can be not so vacation day at all but that is beside the point. Had we put this "easy" operation in database we had to version our database with every little change to a vacation days related logic and that would lead us straight to hell in customer support field due to a fact that it was possible to update only application without a need to update database (except clear "breakthrough" moments where database structure was changed of course).
In my experience I've found that many applications start with a straight-forward set of tables and then and handful of stored procedures to provide basic functionality. This works very well; it usually yields high performance and is simple to understand, it also mitigates any need for a complex middle-tier.
However, applications grow. It's not unusual to see large data-driven applications with thousands of stored procedures. Throw triggers into the mix and you have an application which, for anybody other than the original developers (if they're still working on it), is very difficult to maintain.
I will put a word in for applications which place most logic in the database - they can work well when you have some good database developers and/or you have a legacy schema which cannot be changed. The reason I say this is that ORMs take much of the pain out of this part of application development when you let them control the schema (if not, you often need to do a lot of fiddling to get it working).
If I was designing a new application then I would usually opt for a schema which is dictated by my application domain (the design of which will be in code). I would normally let an ORM handle the mapping between the objects and the database. I would treat stored procedures as exceptions to the rule when it came to data access (reporting can be much easier in sprocs than trying to coax an ORM into producing a complex output efficiently).
The most important thing to remember though, is that there are no "best practices" when it comes to design. It is up to you the developer to weigh up the pros and cons of each option in the context of your design.

MongoDB, MySQL or something third for my project?

Ok guys.
I've begun developing a little sparetime project that might become big someday. Before I really get started, I want to be certain that I'm starting with the right setup. So I come to you.
I'm making a service, which will work mostly as a todolist/project planner.
In this system there will be an amount of users and an amount of tasks. Each task can be assigned to multiple users, and each user can have multiple tasks (many to many relation).
Until now I was planning to use MySQL, but a friend of mine, who is part of the project, sugested MongoDB instead. He tells me that it would increase performance and be more scaleable.
On the other hand I'm thinking that in order to either get all tasks assigned to a specific user, or all users assigned to a specifik task, one would need to use joins, which MongoDB doesnt have (or have in a cumbersome way as far as I have understood).
Now my question to you is "Which DB system would you suggest. MySQL or MongoDB or a third option? And why?"
Thank you for your time and your assistance.
Morten
We use MySQL at IGN to store person relationships (many-to-many like your use case), and have about 5M records in the relationship table. We have 4 MySQL servers in a cluster and the reads are distributed across 3 MySQL slaves. BTW you can always denormalize to optimize reads and penalizing writes among other things based on the read/write heavyness of your system.
We use the DAO pattern with Spring, so its fairly easy for us to swap DB providers through configuration (and by writing a Mongo/MySQL DAO Implementation as applicable). We have moved activities (like in Social Media) to Mongo almost a year ago but the person relationships are living happily in MySQL.
The comment to your post by Jonas says it all,
If need be, you can always scale later.
This.
I am very much of the mindset that If you don't have scaling problems, don't worry too much (if at all) about scaling problems. Why not use what is easiest, smartest and cleanest to deliver the features clients pay for (in my case at least!) This approach saves a lot of time and energy and is the proper one for 9 projects out of 10.
Learning a technology because it scales is great. Being tied to an unlearned technology and unknown technology because it scales in an upcoming project, is not as great. There are many other factors than scalability, when using 3rd party stuff.
MySQL would seem to be a good choice MySQL being more mature and having loads of client libraries, ORM's and other timesaving technologies. MySQL can handle millions (billions if you have the ram) of rows. I have yet to encounter a project it could not handle, and I have seen some pretty impressive datasets!
Of course, when you will need performance, sure maybe you will find yourself ripping out orm and sql generating code to replace with your own hand tweaked queries, but that day is way down the line and chances are, that day will never even come.
Mongodb, although it is real cool I am sorry to say may well bring you issues having nothing to do with scaling.
My 2 cents, happy coding!
MySQL
Either would likely work for your purposes, but your database seems relatively rigid in its structure, something which SQL deals well with. As such, I would recommend MySQL. A many-to-many relationship is relatively easy to implement and access, as well.
You may take a tiny bit of a performance hit, but in my experience, this is generally not extremely noticeable with smaller scale applications (i.e. databases with less than millions of entries). I do agree with #Jonas Elfström's comment, however: you should have an abstraction layer between your application and the database, so that should scaling become an issue, you can address it without too many problems.
Stick with a relational database, it can handle many to many relationships and is fully featured for backup and recovery, high availability and importantly you will find that every developer you need is familiar with it. There are plenty of documented methods for scaling a relational database.
Pick an open source databases either MySQL or Postgres dependant upon which your team is most familiar with and how it integrates into the rest of your infrastructure stack.
Make sure you design your data model correctly most importantly the relationships between the entities.
Good luck!

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

What are the essential dba skills a developer should learn?

Creation of objects like tables and indexes are fairly essential, even if the code has to be authorized or created by the dba. What other areas normally carried out by dbas should the accomplished developer be aware of?
A developer is responsible for doing everything that makes his code a) correct and b) fast.
This of course includes creating indexes and tables.
Making a DBA responsible for indexes is a bad idea. What if the code runs slowly? Who is to be blamed: a developer with bad code or a DBA with a bad index?
A DBA should convey database supporting operations like making backups, building the infrastructure etc, and report the lack of resources.
He or she should not be a sole person for making the decicions that affect the performance of the whole database system.
Relational databases, as for now, are not yet in that state that would allow splitting of responsibility so that developers could make the queries right and the DBA could make them fast. That's a myth.
If there is a lack of resources (say, an index makes some query fast at the expence of some DML operation being slow), this should be reported by a DBA, not fixed.
Now, it is a decision making time. What do we need more, fast query or a fast insert?
This decision should be made by the program manager (and not the DBA or developer).
And when the decision is made, the developer should be given the new task: "make the SELECT query to be as fast as possible, taking in account that you don't have this index". Or "make an INSERT query to be as fast as possible, taking in account that you will have this index".
A developer should know everything about how a database works, when it works normally.
A DBA should know everything about how to make a database to work normally.
The latter includes ability to make a backup, ability to restore from a backup and ability to detect and report a resource contention.
The ins and outs of database storage and optimization are huge. Knowing how to index and partition tables well is invaluable knowledge.
Also, how to read a query execution plan. SQL is such a cool language in that it will tell you exactly how it's going to run your code, so long as you ask nicely. This is absolutely essential in optimizing your code and finding bottlenecks.
Database maintenance (backups, shrinking files, etc) is always important to keep your server running smoothly. It's something that's often overlooked, too.
Developers should know all about triggers and stored procedures--getting the database to work for you. These things can help automate so many tasks, and often developers overlook them and try to handle it all app side, when they should really be handled by something that thinks in sets.
Which brings me to the most important point, database developers need to think in sets. To often I hear, "For each row, I want to..." and this is generally an alarm in my head. You should be thinking about how the set interacts and the actions you want to take on entire columns.
Optimization. Your code allways should use as little resources as you can achieve.
I would recommend developing an understanding of the security architecture for the relevant DBMS.
Doing so could facilitate your development of secure code.
With SQL Server specifically in mind for example:
Understand why your “managed code” (such as .NET CLR) should not be granted elevated privileges. What would be the implications of doing so?
What is Cross-Database ownership chaining? How does it work?
Understand execution context.
How does native SQL Server encryption work?
How can you sign a stored procedure? Why would you even want to do this?
Etc.
As a general rule, the more you understand about the engine you are working with, the more performance you can squeeze from it.
One thing that currently springs to mind is how to navigate and understand the information that database "system" tables/views gives to you. E.g. in sql server the views that are under the master database. These views hold information such as current logins, lists of tables and partitions etc. which is all useful stuff in trying to track down things such as hung logins or whether users are currently connected etc.
Relationships of your tables. You should always have a recent printout and soft copy of your database. You need to know the primary keys, foreign keys, required and auto filled columns, without that I think you can't write efficient queries or make sure your database is carrying only what it needs.
I think everyone else covered it.
Having a good understanding of the architecture of your database system will definitely be helpful. Can you draw a diagram by heart to show components of your DBMS and their interactions?