scalable persistence for startup [closed] - mysql

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
What database would you suggest for a startup that might possibly grow very fast?
To be more specific:
We are using JSON to interchange data with mobile clients, so the data should be stored ideally in this format
The data model is relatively simple, like users, categories, history of
actions...
The users interact in "real time" (5 second propagation delay is still OK)
The queries are known beforehand (can cache results or use mapreduce)
The system would have up to 10000 concurrent users (just guessing...)
Transactions are a plus but can live without them I think
Spatially enabled is a plus
The data replication between nodes should be easy to administer
Open source
Hosting services available (we'd like to outsource the sysadmin part)
We have now a functional private prototype with a standard relational PostgreSQL/PostGIS. But scalability apart questions, I have to convert relational data to JSON and vice versa which seems like an overhead in high load.
I did a little research but I lack experience with all the new NoSQL stuff.
So far, I think of these solutions:
Couchbase: master-master replication, native JSON document store, spatial extension, couchapps and although I don't know iriscouch hosting they seem good techs.
The downside I see so far is javascript debugging, disk occupation.
MongoDb: has only one master but safe failover. Uses binary JSON.
Cluster MySQL: the evergreen of web (one master I think)
PostgresSQL&Slony: because I just love Postgres:-)
But there are plenty of others, Cassandra, Membase...
Do you guys have some real life experience? The bad one counts too!
Thanks in advance,
Karel

Unless you are already having problems with scaling, you can't really have a good idea what you actually need for the future. You should be basing your design decisions on what you need now, not when you have your best estimate of customers. Remember, you have to impress your first few customers with how well your product solves their problems before you can worry about impressing your 10,000th
That said, I've found that its almost always neccesary to have basically everything:
A smart/powerful database for the important data and queries that are part of the current application. For this I have no choice ahead of PostgreSQL/PostGIS.
A document database (sometimes called NoSQL) to record forever anything that has passed through your system. It was an invalid or useless request a year ago, but now you have an application that can use that kind of data, and the vendor finally gave you the API spec you need to parse it, I hope you've got it around in a form you can work with it. At my current organization we are using CouchDB for this and it's proven to be a great choice so far.
I have to convert relational data to JSON and vice versa which seems like an overhead in high load.
Not really; the expensive stuff is IO and poorly written queries. The marshalling/unmarshalling is pure CPU, which is about the cheapest thing in the world to grow. Don't worry about it.

Related

MySql vs NoSql - Social network comments and notifications data structure and implementation

I am really finding it tough to figure out the insights about how does a social networking site (Facebook being a reference) manage their comments and notifications for its users.
How would they actually store the comments data? also how would a notification be stored and sent to all the users that. An example scenario would be that a friend comments on my status and everyone that has liked my status including me gets a notification for that. Also each user has their own read/unread functionality implemented, So I guess there is a notification reference that is stored for each user. But then there would be a lot of redundancy of notification information. If we use a separate table/collection to store these with reference of actual notificatin, then that would create realtime scalability issues. So how would you decide which way to trade-off. My brain crashes when I think about all this. Too much stuff to figure with not a ot of help available over the web.
Now how would each notification be sent to the all the users who are supposed to receive that.. and how would the data structure look like.
I read a lot of implementations those suggest to use MySql. My understanding was that the kind of data (size) that is, it would be better to use a NoSql for scalability purpose.
So how does MySql work well for such use cases, and why is a NoSql like Mongo not suggested anywhere for such implementation, when these are meant to be heavily scalable.
Well, I know a lot of questions in one. But I am not looking for a complete answer here, insights on particular things would also be a great help for me to build my own application.
The question is extremely broad, but I'll try to answer it to the best of my ability.
How would they actually store the comments data? also how would a notification be stored and sent to all the users that.
I generally don't like answering questions like this because it appears as if you did very little research before coming to SO. It also seems like you're confused with application and database roles. I'll at least start you off with some material/ideas and let you decide on your own.
There is no "silver bullet" for a backend design, especially when it comes to databases. SQL databases are generally very good at most database functionality, and rightfully so; it's a technology that is very mature and has stood the test of time for a reason. Most NOSQL solutions are specialized for particular purposes. For instance: if you were logging a lot of information, you might want to look at Cassandra. If you were dealing with a lot of relational data, you would want to use something like Neo4j (or PostgreSQL/MySQL for RMDBS). If you were dealing with a lot of real-time data, you might want to look at Redis.
It's dumb to ask NOSQL vs SQL for a few reasons:
NOSQL is a bad term in general. And it doesn't mean "No SQL". It means "Not Only SQL". Unfortunately the term has encapsulated even the most polar opposite of databases.
Only you know your application's full functionality. Even if I knew the basics of what you wanted to achieve, I still couldn't give you a definitive answer. Nor can anyone else. It's highly subjective, and again, only YOU know EXACTLY what your application should do.
The biggest reason: It's 2014. Why one database? Ten years ago "DatabaseX vs DatabaseY" would have been a practical question. Now, you can configure many application frameworks to reliably use multiple databases in a matter of minutes. Moral of the story: Use each database for its specialized purpose. More on polyglot persistence here.
As far as Facebook goes: a five minute Google search reveals what backend technologies they've used in the past, and it's not that difficult to research some of their current backend solutions. You're not Facebook. You don't need to prepare for a billion users right now. Start with simple, proven technologies. This will let you naturally scale your application. When those technologies start to become a bottleneck, then be worried about scalability.
I hope this helped you with starting your coding journey, but please use Stack Overflow as a last resort if you're having trouble with code. Not an immediate go-to.

NewSQL versus traditional optimization/sharding [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
We're a small startup with a write-heavy SAAS app and are (finally!) getting to the point where our usage is presenting scaling issues. We have a small team, so we really appreciate being able to offload sysadmin to Heroku and RDS.
While Heroku is (mostly) fine, we have a couple problems with RDS:
Scaling. This is the biggest concern. We currently run an XL RDS instance. We'll be able to get by for a while longer with straightforward optimizations, but unless we make some major structural changes to our app, we'll hit a bottleneck at some point.
Also, the downtime for changing instance size sucks.
Availability. We run a multi-AZ instance, so we should survive a single AZ outage. But RDS is built on EBS, which makes me pretty worried given EBS's history and design.
Price. Our RDS bill is 4x what we pay Heroku. I don't mind paying Amazon to save me from hiring a sysadmin, but I would love to find something less expensive.
In my view, we have two options moving forward: the traditional approach (sharding, running a nightly job to move parts of our database to read-only, etc.); or a NewSQL solution (Xeround, VoltDB, NimbusDB, etc).
Traditional pros: It's been done many times before and there are pretty standard ways to do it.
Traditional cons: It will take a lot of work and introduce significant complexity into the app. It also won't solve the secondary problems with RDS (availability and price).
NewSQL pros: Supposedly, these solutions will horizontally scale our database without changing application code (subject to a few restrictions on SQL functionality like not using pessimistic locking). This would save us a huge amount of work. It would also improve reliability (no single point of failure) and reduce costs (not having to run an XL instance during off-hours just to provide for peak usage).
NewSQL cons: These solutions are relatively young, and I haven't been able to find any good reviews or write-ups of people's experience with them in production apps. I've only found one available as a hosted solution (Xeround), so unless we went with that one, we'd have to invest resources in sysadmin.
I'm wondering what opinions are as to what my best option would be.
Xeround is awfully tempting (hosted NewSQL), but I haven't been able to find any good information use of it in production. The few tweets I've seen have been people complaining about it being a bit slow. I'm pretty nervous to move to something that seems so untested.
The conservative side of me says to stick with RDS and use a traditional approach. But it will be really expensive in terms of developer time.
And then part of me wonders if there's another way, maybe a more battle-tested hosted NewSQL solution I haven't heard of. Or maybe a NewSQL solution we'd have to host ourselves but that has a really solid history.
Thanks in advance for your thoughts.
Not sure if you heard about NuoDB yet. But it is a brand new SQL solution that offers the scale-out capabilities of NoSQL and the SQL & ACID compliance capabilities of traditional OLTP. You should take look at the solution.
At Jingit (www.jingit.com) we have battle tested VoltDB. It is fantastic on scaling write heavy apps and in AWS cloud. There is no hosted option so our devs own it and they spend < 1 hr a week administering our VoltDB cluster. We actually use both RDS and VoltDB. RDS for our traditional relational workload, and VoltDB for our HIGH VOLUME transaction processing. If you are developing in Java, VoltDB is a great fit as you write all the procedures in Java.
I hear, too, that NuoDB is interesting. One thing I hear is that Rackspace is coming out with cloud DBaaS sometime soon as well. I don't know what flavor they'll use, but you could see how Nuo works as a scalable solution with them. I think it'll run in conjunction with the Open Stack platform, which, when they open it up, could be more cost and computationally efficient. Just something I've been eyeballing myself.

Empirically estimating a project duration [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
What are common empirical formulas that can produce a rough estimate of project duration for waterfall methodology ( up to 20% fluctuation is acceptable). If it helps in narrowing down the answer, you can assume that following is more or less known :
Number of devs is known and fixed, most devs are above average in terms of know-how, however some learning about domain-specific issues might be required.
Known and fixed max. number of app users.
Technology stack to be used is reasonably diverse (up to 4 different languages and up to 6 various platforms).
Interfacing to up to three legacy systems is expected.
Please feel free to provide estimate methods which cover a broader scope than the above points, they are just provided for basic guidance.
Do yourself a favor and pick up Steve McConnell's Software Estimation: Demystifying the Black Art. If you have access to past estimates and actuals this can greatly aid in producing a useful estimate. Otherwise I recommend this book and identifying a strategy from it most applicable to your situation.
Only expect to utilize 70% of your developers time. The other 30% will be spent in meetings, answering email, taking the elevator, etc. For example if they work 8hrs a day, they will only be able to code for 5.6 to 6.5 hours a day. Reduce this number if they work in a noisy environment where people are using the telephone.
Add 20% to any estimate a developer gives the project manager.
Lines of code is useless as a metric in estimating a project.
Success or failure depends on concise requirements from the customer. If the requirements aren't complete, count on the customer being not happy with the finished product.
Count on the fact that not all of the requirements will be dictated by the customer. There will be revisions to the requirements throughout the project.
Step 1. Create a schedule that is as granulated as is reasonably possible.
Step 2. Ask the people involved how long their features will take.
Step 3. Create an Excel spreadsheet which maps predictions to actual times.
Step 4. Repeat steps 1-3 for all new projects. Make use of an aggregated mapping from previous instances of step 3 to translate developer estimates to actual estimates.
Note that there are tools which can do this for you.
See also
Evidence-based-scheduling.
This project is not going to be cheap...
Number of devs is known and fixed,
most devs are above average in terms
of know-how, however some learning
about domain-specific issues might be
required.
This is a good thing. You don't want to flood the number of developers into the project. Though if you go above around 10 people, do count every 2 as only 1, as the rest will go up in overhead. Unless you can split the task into something that can be handled by two totally separate teams. Then you could have a chance of getting some traction.
Known and fixed max. number of app
users.
This means that you can with more certainty land your architecture early on, as you can estimate how much effort you must put into scaling your solution. This is a good thing. Make sure that you work within these limits and never ever fool yourself into thinking "it's fast enough". It almost never is if you doubt that it could be too slow...
Technology stack to be used is
reasonably diverse (up to 4 different
languages and up to 6 various
platforms).
This isn't as important as to do your people know this stack/set of languages? If there are any learning involved, raise the estimate x2 or x3 if you don't perform a proof of concept up front to learn the technology. Or even better, take the pain and get some coursing. If the language for implementation or technology to be used is unknown, then it is quite likely that you will misuse the technology and do things that will screw stuff up.
Make sure that the technology is proven or you'll end up getting bitten by it.
Are the source available for the tools/technology?
Do you get support?
Do you understand the product and or used it before?
Have the customer used it before?
If too many of these questions get a no, add some (or a lot of) additional time to the sum.
Interfacing to up to three legacy
systems is expected.
This is really a kicker. For legacy integration ask yourself:
Has anyone else integrated with them?
Do you have access to people with knowledge of these systems?
Do they intend to share this knowledge with you?
Do you have to wait for changes being created in these systems?
Are there test systems available for you to use?
Are there development systems available for you to use?
Again, if too many of these questions has a "no" on them, then be afraid. You should also know that actual integration takes about 3-5 times longer than you actually think.
This isn't a project that I would have given a table grabbing estimate for. Do yourself and your customer a favor and do this by the hour. If not, you will as times go by start cutting corners to cover up your lack of progress/underestimation... And both you and your customer will suffer.
There are many cost estimation software tools that can greatly ease the pain of cost estimation, we use ProjectCodeMeter. I know these tools are not perfect, but they do save time getting started by pointing towards the right direction.
Try this list of estimation tools on Wikipedia.

Re-write access adp / sqlserver to C# .net?

I am a non technical person and have a small company who has been supporting my companies software for a number of years. The solution works well and permutations of the solution has been with the current IT service provider for over 15 years. I recently got a more established IT firm to do a general audit on the software. The current solution uses access as a front end with sqlserver 2005 as the database. The company who did the audit presented a list of faults amongst others that the technology is outdated, the solution is not scalable, bad design, non user friendly interfaces, tables not normalised, tables has no referential integrity, non use of proper coding standards and naming conventions, no application security only database security etc. The firm who did the audit proposed that the solution must be re-written and offered to do so. The current service provider aknowledges some of the findings but assures me that it poses very little or no risk to my business. To re-write the application will cost a lot of money. I am in a difficult situation and would appreciate some technical advice. I basically need to know if my business is at risk running on the current technology. I have a maximum of 70 concurrent users working on the system at a given time
Well, if you value Joel's word, i would say that you are indeed, risking alot here.
Rewriting stuff was and will never be a safe thing to do for a company.
To boil it down into simple terms, ask yourself these questions:
Are you having problems with the software currently? Are users complaining about the user interface, or is it particularly hard for new users to pick up the software when using it? Is data being lost or corrupted at any stage, or are you having problems retrieving reports from the database?
Do you currently, or in the future are you likely to need modifications? If your software is badly written, modifications will be more costly, and more likely to break the application and cause downtime in general.
If the answer to both questions is no, then you likely don't need to rewrite the software. You have to remember that good software developers see badly written software and want to re-write it properly - as well as this, there is money for them in developing the software, so their view isn't totally unbiased.
Like others have said, re-writing a system has its own share of risks - old bugs that were fixed a long time ago can rear their heads again, new bugs can be introduced, the developers of the new system can totally miss the point and make the system less usable than the previous system.
If there are problems with the current system though it may be worthwhile to consider having the system re-written by competent developers - if you opt to go this route however, make sure to get feedback from your current users, especially the 'expert' or 'power' users, to ensure that the system will fulfill all of their requirements.
Before you go view your problem from the technical perspective, you must assess how critical the application is to your business. It sounds as though you have a functioning application. If it delivers consistent behavior AND you have no need for upgrades / new development, you may want to leave it alone. We software developers love to complain about everyone else's code, re-write other's work with "elegant" solutions. That means money.
However, you have an investment that may need maintenance, and when you have the underlying code and database in dis-array, you will incur more cost because the application does not lend itself to be modified. You'll want to get a feel for how much change you need to support. Given that it has been in production for 15 years you've had a good run, so you don't have much risk there.
To do a re-write will cost you, because you need to recreate what the app does, and since the supporting database and program seem to be "de-normalized" and unstructured, it's going to a big effort. There are advantages to have a clean database model because it will be easier to do reports, export to Excel, etc. AND should you want to modify it the developers will have an easier time figuring out what to do.
To spend money to get what you already have requires that you challenge the firm to detail what additional benefits you'll receive. Are these benefits beyond what you're getting today, and will this firm deliver on their promises? Will your company be better off if the database is "normalized" but you receive no other benefit than what the current app gives you? Keep these in mind before you make the jump to a new platform.
Fix the problems in the existing app. It will be much cheaper, can be done incrementally, and if done properly, will result in a more maintainable app.
The suggestion to replace the ADP front end sounds like pure prejudice/ignorance to me -- they don't sell Access development so they want to build you an entirely new app.
On the other hand, the advice about the back end sounds like something that you shouldn't wait to fix (though it could require a lot of work, since existing data likely won't fit proper RI).
The front end and back end problems are two separate issues, and can be handled independently (though the app may need to be updated to reflect changes in RI -- impossible to say without a case-by-case evaluation).
I would hire a competent Access developer with ADP experience to handle the project. This will be much cheaper than the complete rewrite and you won't sacrifice any functionality, nor will you re-introduce bugs that have already been addressed in your existing app. It can also likely be done incrementally (depending on what the problems are and how they need to be solved).
The suggestions offered by the "more established IT firm" are pretty common for access/sql server projects. The suggestion is almost always re-write them as web applications.
I just did this myself last year -- took an MS Access front-end/SQL Server back end application, and rewrote the access part as a C#/ASP.Net website. We enjoyed better performance and more flexibility as a result of the switch, but the old front end had been around long enough that we never did get back all of the functionality that we used to have before the rewrite.
If you're actually seeing 70 concurrent users, and none of them are experiencing performance issues, and your corporate network is secure enough, then you may lose more by rewriting the application, at least in terms of functionality. On the other hand, this may be a good chance to evaluate "what works" and "what could work better"--and enhance workflows.
Excellent use of coding standards doesn't necessarily translate to an excellent application.
What prompted the audit? Does their solution address this issue?
Let's do the math:
People: 70
Avg. Hrs Using software/Day: 2 (Conservative)
Salary/Hour: $8.00 (Really Conservative)
Business Days/Year: 250 (Took out weekends & vacation/sick)
Cost of labor using application: 70 * 2 * 8 * 250 = $280,000 / Year (Could go over 500K)
How much improvement can you get? 5%, 10%, 25%
How much will the new application cost? 50K, 100K, 200K
If you are able to save this time, will your users be freed up to do revenue generating activites or will they just have more time to surf the web? You may want to create some worker efficiency factor: 90%, 75%
Simple answer... Most of the "risks" of using Access are surmounted by using SQL server as the backend. You already said your current solution works.
So it boils down to your future plans. If your existing application isn't missing any functionality that can't be provided via access I would just stick with what you have.
If you need new features I would consider a few things.
Are they something Access can't provide or do well (ex: Internet-facing Solutions)?
What is the potential benefit reaped by having the new features?
What is the potential cost incurred by not having the new features?
Can you put a dollar figure on 1 & 2?
How much to develop the solution in Access?
How much to develop the solution in C#
In other words, always do the CBA :) Better yet, do you own CBA, then ask both companies to provide you with one, and compare for fun. In the worst case you might get your existing company to come down on their price to retain you as a client.

When evaluating a design, how do you evaluate complexity? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
We all know to keep it simple, right?
I've seen complexity being measured as the number of interactions between systems, and I guess that's a very good place to start. Aside from gut feel though, what other (preferably more objective) methods can be used to determine the level of complexity of a particular design or piece of software?
What are YOUR favorite rules or heuristics?
Here are mine:
1) How hard is it to explain to someone who understands the problem but hasn't thought about the solution? If I explain the problem to someone in the hall (who probably already understands the problem if they're in the hall) and can explain the solution, then it's not too complicated. If it takes over an hour, chances are good the solution's overengineered.
2) How deep in the nested functions do you have to go? If I have an object which requires a property held by an object held by another object, then chances are good that what I'm trying to do is too far removed from the object itself. Those situations become problematic when trying to make objects thread-safe, because there'd be many objects of varying depths from your current position to lock.
3) Are you trying to solve problems that have already been solved before? Not every problem is new (and some would argue that none really are). Is there an existing pattern or group of patterns you can use? If you can't, why not? It's all good to make your own new solutions, and I'm all for it, but sometimes people have already answered the problem. I'm not going to rewrite STL (though I tried, at one point), because the solution already exists and it's a good one.
Complexity can be estimated with the coupling and how cohesive are all your objects. If something is have too much coupling or is not enough cohesive, than the design will start to be more complex.
When I attended the Complex Systems Modeling workshop at the New England Complex Systems Institute (http://necsi.org/), one of the measures that they used was the number of system states.
For example if you have two nodes, which interact, A and B, and each of these can be 0 or 1, your possible states are:
A B
0 0
1 0
0 1
1 1
Thus a system of only 1 interaction between binary components can actually result in 4 different states. The point being that the complexity of the system does not necessarily increase linearly as the number of interactions increases.
good measures can be also number of files, number of places where configuration is stored, order of compilation on some languages.
Examples:
.- properties files, database configuration, xml files to hold related information.
.- tens of thousands of classes with interfaces, and database mappings
.- a extremely long and complicated build file (build.xml, Makefile, others..
If your app is built, you can measure it in terms of time (how long a particular task would take to execute) or computations (how much code is executed each time the task is run).
If you just have designs, then you can look at how many components of your design are needed to run a given task, or to run an average task. For example, if you use MVC as your design pattern, then you have at least 3 components touched for the majority of tasks, but depending on your implementation of the design, you may end up with dozens of components (a cache in addition to the 3 layers, for example).
Finally something LOC can actually help measure? :)
i think complexity is best seen as the number of things that need to interact.
A complex design would have n tiers whereas a simple design would have only two.
Complexity is needed to work around issues that simplicity cannot overcome, so it is not always going to be a problem.
There is a problem in defining complexity in general as complexity usually has a task associated with it.
Something may be complex to understand, but simple to look at (very terse code for example)
The number of interactions getting this web page from the server to your computer is very complex, but the abstraction of the http protocol is very simple.
So having a task in mind (e.g. maintenance) before selecting a measure may make it more useful. (i.e. adding a config file and logging to an app increases its objective complexity [yeah, only a little bit sure], but simplifies maintenance).
There are formal metrics. Read up on Cyclomatic Complexity, for example.
Edit.
Also, look at Function Points. They give you a non-gut-feel quantitative measurement of system complexity.