We're in the process of adding a bank-like sub-system to our own shop.
We already have customers, so each will be given a sort of account and transactions of some kind will be possible (adding to the account or subtracting from it).
So we at least need the account entity, the transaction one and operations will then have to recalculate overall balances.
How would you structure your database to handle this?
Is there any standard bank system have to use that I could mock?
By the way, we're on mysql but will also look at some nosql solution for performance boost.
I don't imagine you would need NoSQL for any speed boost, since it's unlikely to need much/any parallelism and not sure how non-schema-oriented you might need to be. Except when you start getting into complex business requirements for analysis across many million customers and hundreds of millions of transactions, like profitability, and even then that's kind of a data warehousing-style problem anyway which you probably wouldn't run on your transactional schema in the first place if it had gotten that large.
In relational designs, I would tend to avoid any design which requires balance-recalculation because then you end up with balance-repair programs etc. With proper indexing and a simple enough design, you can do a simple SUM on the transactions (positive and negative) to get a balance. With good consistent sign conventions on the transactions (no ambiguity on whether to add or subtract - always add the values) and appropriate constraints (with limited number of types of transactions, you can specify with constraints that all deposits are positive and all withdrawals are negative) you can let the database ensure there are no anomalies like negative deposits.
Even if you want to cache the balance in some way, you could still rely on such a simple mechanism augmented with a trigger on the transaction table to update an account summary table.
I'm not a big fan of putting any of this in a middle layer outside of the database. Your basic accounting should be fairly simple that it can be handled within the database engine at speed so that anyone or any part of the application executing a query is going to get the same answer without any client-code logic getting involved. And so the database ensures integrity at a level slightly above referential integrity (accounts with non-zero balance might not be allowed to be closed, balances might not be allowed to go negative etc) using a combination of constraints, triggers and stored procedures, in increasing order of complexity as required. I'm not talking about all your business logic, just prohibiting low-level situations you feel the database should never get into due to bad client programming or a failure to do things in the right order or call things with the right parameters.
In real banking (i.e. COBOL apps) typically the database schema (usually non-relational and non-normalized - a lot of these things predate SQL) you see a lot of things like 12 monthly buckets of past balances which are updated and shifted when the account rolls over. Some of the databases these systems use are kind of hierarchical. And this is where the code is really important, because everything gets done in code. Again, it's kind of old-fashioned and subject to all kinds of problems (i.e. probably a lot like what NatWest is going through) and NoSQL is a trend back towards this code-is-king way of looking at things. I just tend to think after a long time working with these things - I don't like having systems with balances cached and I don't like systems where you really don't have point-in-time accountability - i.e. you ignore transactions after a certain date and you can see EXACTLY what things looked like on a certain date/time.
I'm sure someone has "standard" patterns of bank-like database design, but I'm not aware of them despite having built several accounting-like systems over the years - accounts and transactions are just not that complex and once you get beyond that concept, everything gets highly customized.
For instance, in some cases, you might recognize earnings on contracts on some kind of schedule according to GAAP and contracts which are paid over time. In banking you have a lot of interest-related things with different interest rates for cost of funds etc. Everything just gets unique once you start mixing the business needs in with just the basics of the accounting of ins and outs of money.
You don't say whether or not you have a middle tier in your app, between the UI and the database. If you do, you have a choice as to where you'll mark transactions and recalculate balances. If this database is wholly owned by the one application, you can move the calculations to the middle tier and just use the database for persistence.
Spring is a framework that has a nice annotation-based way to declare transactions. It's based on POJOs; an alternative to EJBs. It's a three legged stool of dependency injection, aspect-oriented programming, and great libraries. Perhaps it can help you with both structuring and implementing your app.
If you do have a middle tier, and it's written in an object-oriented language, I'd recommend having a look at Martin Fowler's "Analysis Patterns". It's been around for a long time, but the chapter on financial systems is as good today as it was when it was first written.
Related
I'm trying to create a payment system for my website. The website is a market place for 3d printing blueprint. Users buy credits on my website. When a user purchase a 3d printing blueprint uploaded by other user, it creates a new tuple or a row in the 'purchased' table while deducting credit in user credit table. Here's the important part. My gut tells me to use event scheduler to mark rows of purchased as payed every month and wire the sum of money earned by each seller. My worry is the table will grow infinitely as months pass by.
Is this the right implementation?
Or can I somehow create a new table each month that holds transactions for only this month?
Is there a Nosql equivalent to this?
Stripe.com or Braintree.com might be good options for you.
It is not advisable to create or roll your own payments implementation. These established services not only handle the PCI compliance aspect of payments, but they also have direct support for the use case you're asking about.
In an effort to answer your question further - it's probably not going to be an issue from the stand point of performing inserts into this MySQL table or in terms of iterating across it for batch processing. Querying on the other hand will become more onerous as the data set gets very large.
You can use partitioning in MySQL and perform the partitioning based on date but I doubt this is something you should spend your time accomplishing at this point. Wait until your site blows up and is super popular then come back and update your schema and configuration to meet your actual usage demands.
It's worth noting that you'll also want to make sure to take regular backups of something as important of payments information. Typically you'd also see at least one replica for something this critical.
Again I don't think you should try and solve this yourself. Just pay for a service that does this for you and focus on building the best 3d blueprint marketplace.
I've been asked to develop an application that will be run out to a number of business units. the application will be the basically the same for each unit, but will have minor procedural differences, which won't change the structure of the underlying database. Should I use one database per business unit, or one big database for all the units? The business units are totally separate
My preference is for one database per client. The advantages:
if a client gets too big, they're easy to move - backup, restore, change the connection string, boom. Try doing that when their data is mixed in with others in a massive database. Even if you use schemas and filegroups to segregate, moving them is not a cakewalk.
ditto for deleting a client's data when they move on.
by definition you're keeping each client's data separate. This is often going to be a want, and sometimes a need. Sometimes it will even be legally binding.
all of your code within a database is simpler - it doesn't have to include the client's schema (which can't be parameterized) and your tables don't have to be littered with an extra column indicating the client.
A lot of people will claim that managing 200 or 500 databases is a lot harder than managing 10 databases. It's not really any different, in my experience. You build scripts that automate things, you stagger index maintenance and backup jobs, etc.
The potential disadvantages are when you get up into the realm of 4-digit and higher databases per instance, where you want to start thinking about having multiple servers (the threshold really depends on the workload and the hardware, so I'm just picking a number). If you build the system right, adding a second server and putting new databases there should be quite simple. Again, the app should be aware of each client's connection string, and all you're doing by using different servers is changing the instance the connection string points to.
Some questions over on dba.SE you should look at. They're not all about SQL Server, but many of the concepts and challenges are universal:
https://dba.stackexchange.com/questions/16745/handling-growing-number-of-tenants-in-multi-tenant-database-architecture
https://dba.stackexchange.com/questions/5071/what-are-the-performance-implications-of-running-multiple-smaller-dbs-instead-of
https://dba.stackexchange.com/questions/7924/one-big-database-vs-several-smaller-ones
Your question is a design question. In order to answer it, you need to understand the requirements of the system that you want to build. From a technical perspective, SQL Server -- or really any database -- can handle either scenario.
Here are some things to think about.
The first question is how separate your clients need the data to be. Mixing data together from different business units may not be legal in some cases (say, the investment side of a bank and the market analysis side). In such situations, separate databases are the solution.
The next question is security. In some situations, clients might be very uncomfortable knowing that their data is intermixed with other clients data. A small slip-up, and confidential information is inadvertently shared. This is probably not an issue for different business units in the same company.
Do you have to deal with different uptime requirements, upload requirements, customizations, and perhaps interaction with other tools? If one business unit will need customizations ASAP that other business units are not interested in, then that suggests different databases.
Another consideration is performance. Does this application use a lot of expensive resources? If so, being able to partition the application on different databases -- and potentially different servers -- may be highly desirable.
On the other hand, if much of the data is shared, and the repository is really a central repository with the same underlying functionality, then one database is a good choice.
So I want to know whether transaction based web applications can have no sql databases instead of my sql. Or is there a rule of thumb to use My sql, taking into care for standard compliance?
The first sentence of the OP (...transaction based web applications...) in some sense answers the question. You would need a NoSQL implementation that does support transactions so that you can guarantee atomic updates across multiple data entries. A response to a comment seems to indicate that MongoDB is the DB under consideration. That does not seem to be a good choice for transactions.
Unless there were some kind of add-on that provided ACID transaction support, then it would make some operations difficult. The obvious and overused example is debiting from one account and crediting another. If that couldn't be done in a single transaction, then you would quite possibly be losing money (or creating money if you did the credit first ;).
I don't know about "rules of thumb" for this question, but my suspicion is that you would find an easier time of it using a "traditional" database for a commerce-based system.
how much work should we do in the database?
Ok I'm really confused as to exactly how much "work" should be done IN the database, and how much work had to be done instead at the application level?
I mean I'm not talking about obvious stuff like we should convert strings into SHA2 hashes at the application level instead of the database level..
But rather stuff that are more blur, including, but not limited to "should we retrieve the data for 4 column and do a uppercase/concatenation at the application level, or should we do those stuff at the database level and send the calculated result to the application level?
And if you could list any more other examples it would be great.
It really depends on what you need.
I like to do my business logic in the database, other people are religously against that.
You can use triggers and stored procedures/functions in SQL.
Links for MySQL:
http://dev.mysql.com/doc/refman/5.5/en/triggers.html
http://www.mysqltutorial.org/introduction-to-sql-stored-procedures.aspx
http://dev.mysql.com/doc/refman/5.5/en/stored-routines.html
My reasons for doing business logic in triggers and stored proces
Note that I'm not talking about bending the database structure towards the business logic, I'm talking about putting the business logic in triggers and stored procedures.
It centralizes your logic, the database is a central place, everything has to go through it. If you have multiple insert/update/delete points in your app (or you have multiple apps) you'll need to do the checks multiple times, if you do it in the database you only have to do the checks in one place.
It simplifies the application e.g., you can just add a member, the database will figure out if the member is already known and take the appopriate action.
It hides the internals of your database from the application, if you do all your logic in the application you will need intricate knowledge of your database in the application. If you use database code (triggers/procs) to hide that, you don't need to know every database detail in your app.
It makes it easier to restucture your database If you have the logic in your database, you can just change a tablelayout, replace the old table with a blackhole table, put a trigger on that and let the trigger do the updates to the new table, your app does not even need to know the database has changed, this allows legacy apps to keep working unchanged, whilst new apps can use the improved database layout.
Some things are easier in SQL
Some things work faster in SQL
I don't like to use (lots of and/or complicated) SQL code in my application, I like to put SQL code in a stored procedure/function and try to only put simple queries in my application code, that way I can just write code that explains what I mean in my application and let the database layer do the heavy lifting.
Some people disagree strongly with this, but this approach works well for me and has simplified debugging and maintenance of my applications a lot.
Generally, its a good practice to expect only "Data" from the Database. Its upto Application(s), to apply Business/Domain Logic and make sense of the data retrieved. Its highly recommended to do the following things in the Application Layer:
1) Formatting Date
2) Applying Math functions, such as interpolation/extrapolation, etc
3) Dynamic sorting (based on columns)
However, situations sometime warrant few things to be done at the database level.
In my opinion application should use data and database should provide them and that should be clear separation of concerns. So database gives records sorted, ordered and filtered according to requested conditions but it is up to application to apply some business logic to that records and "convert" them into something meaningful to the user.
For example, in my previous company we worked on big application for work time calculations. One of obvious functionalities in this kind of application is tracking vacation days of employees - how many days employee has per year, how many he used, how many left, etc. Basically we could write some triggers and procedures that would update those columns automatically. So when employee had his vacation days approved amount of days he applied for is taken from his "vacation pool" and added to "vacation days used". Pretty easy stuff but we decided to make it explicit on application level and boy, very soon we were happy we did it that way. Application had to be labor law compliant and it quickly turned out that not for all employees vacation days are calculated equally and sometimes vacation day can be not so vacation day at all but that is beside the point. Had we put this "easy" operation in database we had to version our database with every little change to a vacation days related logic and that would lead us straight to hell in customer support field due to a fact that it was possible to update only application without a need to update database (except clear "breakthrough" moments where database structure was changed of course).
In my experience I've found that many applications start with a straight-forward set of tables and then and handful of stored procedures to provide basic functionality. This works very well; it usually yields high performance and is simple to understand, it also mitigates any need for a complex middle-tier.
However, applications grow. It's not unusual to see large data-driven applications with thousands of stored procedures. Throw triggers into the mix and you have an application which, for anybody other than the original developers (if they're still working on it), is very difficult to maintain.
I will put a word in for applications which place most logic in the database - they can work well when you have some good database developers and/or you have a legacy schema which cannot be changed. The reason I say this is that ORMs take much of the pain out of this part of application development when you let them control the schema (if not, you often need to do a lot of fiddling to get it working).
If I was designing a new application then I would usually opt for a schema which is dictated by my application domain (the design of which will be in code). I would normally let an ORM handle the mapping between the objects and the database. I would treat stored procedures as exceptions to the rule when it came to data access (reporting can be much easier in sprocs than trying to coax an ORM into producing a complex output efficiently).
The most important thing to remember though, is that there are no "best practices" when it comes to design. It is up to you the developer to weigh up the pros and cons of each option in the context of your design.
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.