Backend technology for high volume data for web application

Backend technology for high volume data for web application - mysql

I am developing an application to provide daily dynamic information like prices, availability, etc for around 50,000 objects. I need to store data for about the next 200 days. That would mean a total of 10 million rows. The prices will be batch updated and new data will be added once daily. Let me say about 10,000 existing rows get updated and 50,000 rows are inserted daily. What is the best backend framework that I can use.
Can MySQL be scalable with limited hardware capability. Or is NoSQL database the way to go? If yes, then which NoSQL database will be best suited for fast fetching and updating the data.

I would recommend you to use Cassandra, as you need to write more than read, and Cassandra is optimized for high throughput while write.
It provide scalability, no single point failure and high throughput. And you can update records as well.
Cassandra also supports batch operation for DML (data manipulation language) i.e. write, update and delete. And batch operation of Cassandra provides atomicity as well.

This type of volume is well within the capabilities/capacities of traditional RDBMS. I would say that if you are familiar with MySQL you will be safe to stick with it. A lot depends also, on what kind of queries you want to run. With a properly structured, denormalized setup, you can run ad hoc queries in an RDBMS, whereas with document stores, you need to think quite carefully about structure up front -- embedding versus referencing, see: MongoDB relationships: embed or reference?. MongoDB has added a very nice aggregation framework, which goes a long way towards being able to query data as you would in an RDBMS, but in many other NoSQL systems, queries are essentially map-reduce jobs and joins are either painful or impossible.
It sounds like your data is structured around dates/days. One thing you can do that will yield dramatic speed improvements on queries is partitioning by date ranges. I have worked on dbs over 100m rows in MySQL where historical data had to be kept for auditing purposes but where most of the read/write was on current data, and partitioning led to truly dramatic read query improvements.
You might be interested by this link which shows what some very high volume sites are using: What databases do the World Wide Web's biggest sites run on? Anecdotally, I know that Facebook had trillions of rows in MySQL across various clusters before they started hitting real bottlenecks, but it is no suprise that Cassandra ultimately came out of Facebook engineering, given the truly colossal data volumes they now handle.
Cassandra, Riak, CouchDB, MongoDB, etc all arose to solve very real problems, but these come with tradeoffs, both in terms of the CAP theorem, and in terms of ad hoc queries being more difficult than in RDBMS. Having said that, MongoDB and Cassandra (which I have most experience with) are easy to set up and fun to work with, so if you want to give them a go, I'm sure you will have no problems, but I would say your usage requirements are well within the capabilities of MySQL. Just my 2c.

Related

MongoDB vs Mysql Storage space compare

I am building a data ware house that is the range of 15+ TBs. While storage is cheap, but due to limited budget we have to squeeze as much data as possible in to that space while maintaining performance and flexibility since the data format changes quiet frequently.
I tried Infobright(community edition) as a SQL solution and it works wonderful in term of storage and performance, but the limitation on data/table alteration is making it almost a no go. and infobright's pricing on enterprise version is quiet steep.
After checking out MongoDB, it seems promising except one thing. I was in a chat with a 10gen guy, and he stated that they don't really give much of a thought in term of storage space since they flatten out the data to achieve the performance and flexibility, and in their opinion storage is too cheap nowadays to be bother with.
So any experienced mongo user out there can comment on its storage space vs mysql (as it is the standard for what we comparing against to right now). if it's larger or smaller, can you give rough ratio? I know it's very situation dependent on what sort of data you put in SQL and how you define the fields, indexing and such... but I am just trying to get a general idea.
Thanks for the help in advance!

MongoDB is not optimized for small disk space - as you've said, "disk is cheap".
From what I've seen and read, it's pretty difficult to estimate the required disk space due to:
Padding of documents to allow in-place updates
Attribute names are stored in each collection, so you might save quite a bit by using abbreviations
No built in compression (at the moment)
...
IMHO the general approach is to build a prototype, insert data and see how much disk space your specific use case requires. The more realistic you can model your queries (inserts and updates) the better your result will be.
For more details see http://www.mongodb.org/display/DOCS/Excessive+Disk+Space as well.

Pros and Cons of MongoDB
For the most part, users seem to like MongoDB. Reviews on TrustRadius give the document-oriented database 8.3 out of 10 stars.
Some of the things that authenticated MongoDB users say they like about the database include its:
Scalability.
Readable queries.
NoSQL.
Change streams and graph queries.
A flexible schema for altering data elements.
Quick query times.
Schema-less data models.
Easy installation.
Users also have negative things to say about MongoDB. Some cons reported by authenticated users include:
User interface, which has a fairly steep learning curve.
Lack of joins, which can make some data retrieval projects difficult.
Occasional slowness in the cloud environment.
High memory consumption
Poorly structured documentation.
Lack of built-in analytics.
Pros and Cons of MySQL
MySQL gets a slightly higher rating (8.6 out of 10 stars) on TrustRadius than MongoDB. Despite the higher rating, authenticated users still mention plenty of pros and cons of choosing MySQL.
Some of the positive features that users mention frequently include MySQL’s:
Portability that lets it connect to secondary databases easily.
Ability to store relational data.
Fast speed.
Excellent reliability.
Exceptional data security standards.
User-friendly interface that helps beginners complete projects.
Easy configuration and management.
Quick processing.
Of course, even people who enjoy using MySQL find features that they don’t like. Some of their complaints include:
Reliance on SQL, which creates a steeper learning curve for users who
do not know the language.
Lack of support for full-text searches in InnoDB tables.
Occasional stability issues.
Dependence on add-on features.
Limitations on fine-tuning and common table expressions.
Difficulties with some complex data types.
MongoDB vs MySQL Performance
When comparing the performance of MongoDB and MySQL, you must consider how each database will affect your projects on a case-by-case basis. While some performance features may appear to be objectively promising, your team members may never use the features that drew you to a database in the first place.
MongoDB Performance
Many people claim that MongoDB outperforms MySQL because it allows them to create queries in multiple ways. To put it another way, MongoDB can be used without knowing SQL. While the flexibility improves MongoDB's performance for some organizations, SQL queries will suffice for others.
MongoDB is also praised for its ability to handle large amounts of unstructured data. Depending on the types of data you collect, this feature could be extremely useful.
MongoDB does not bind you to a single vendor, giving you the freedom to improve its performance. If a vendor fails to provide you with excellent customer service, look for another vendor.
MySQL Performance
MySQL performs extremely well for teams that want an open-source relational database that can store information in multiple tables. The performance that you get, however, depends on how well you configure the MySQL database. Configurations should differ depending on the intended use. An e-commerce site, for example, might need a different MySQL configuration than a team of research scientists.
No matter how you plan to use MySQL, the database’s performance gets a boost from full-text indexes, a high-speed transactional system, and memory caches that prevent you from losing crucial information or work.
If you don’t get the performance that you expect from MySQL data warehouses and databases, you can improve performance by integrating them with an excellent ETL tool that makes data storage and manipulation easier than ever.
MySQL vs MongoDB Speed
In most speed comparisons between MySQL and MongoDB, MongoDB is the clear winner. MongoDB is much faster than MySQL at accepting large amounts of unstructured data. When dealing with large projects, it's difficult to say how much faster MongoDB is than MySQL. The speed you get depends on a number of factors, including the bandwidth of your internet connection, the distance between your location and the database server, and how well you organise your data.
If all else is equal, MongoDB should be able to handle large data projects much faster than MySQL.
Choosing Between MySQL and MongoDB
Whether you choose MySQL or MongoDB probably depends on how you plan to use your database.
Choosing MySQL
For projects that require a strong relational database management system, such as storing data in a table format, MySQL is likely to be the better choice. MySQL is also a great choice for cases requiring data security and fault tolerance. MySQL is a good choice if you have high-quality data that you've been collecting for a long time.
Keep in mind that to use MySQL, your team members will need to know SQL. You'll need to provide training to get them up to speed if they don't already know the language.
Choosing MongoDB
When you want to use data clusters and search languages other than SQL, MongoDB may be a better option. Anyone who knows how to code in a modern language will be able to get started with MongoDB. MongoDB is also good at scaling quickly, allowing multiple teams to collaborate, and storing data in a variety of formats.
Because MongoDB does not use data tables to make browsing easy, some people may struggle to understand the information stored there. Users can grow accustomed to MongoDB's document-oriented storage system over time.

Is there a high performance difference in a Key-Value db on a single server with MySQL vs. NoSQL

In my PHP application I have a 470M rows table weighing 200GB in a MySQL MyISAM partitioned table on one server. Usage includes 70% Writes/30% Reads.
I'm trying to improve performance. Main problem currently is read/write contentions due to table-level locks. I'm trying to decide between two options:
Changing MySQL to Innodb. Pros: avoiding the table level locks. Cons: Much more disk space, need bigger HDs which might not be as fast as these (currently using RAID10 6*300GB SAS 15k).
Moving data to a NoSQL db. Main Con: Learning curve. Have never used NoSQL before.
Question is, while trying to still avoid sharding the data, and considering the fact I'm using the RDMS MySQL as a simple key-value storage, are there high differences between performances between the two approaches or is the NoSQL main advantage here comes when moving to a distributed system?

I can only answer your question partially but hopefully more than a comment.
MongoDB is not typically a key-value store and has been known to have certain performance hits when used as one.
MongoDb also has a locking problem here that could come back to haunt you. It has a DB level lock atm which means it could (would need testing) cause write lock saturation.
It is also heavily designed for a 80% read app (which is said to be the most common setup for websites now-a-days) so the more writes you do the more you will notice a performance drop over time. That being said you can tweak MongoDB to be more write friendly and the distributed nature does help to stop write lock saturation a little.
However that being said my personal opinion the learning curve of MongoDB from SQL:
Was next to null
More natural and simpler to implement into my app than SQL
Query language is simple making it dead easy to get to grips with
Query language has a lot of similarities to SQL
The drivers are standardised so that the syntax you see in the Docs for the JS driver in the console is consistent across the board.
My personal opinion on the general matter is the distributed notion of it. If you get a NoSQL solution designed for key-value stores then it could be really good. A quick search on Google pulled out a small list of NoSQL key-value stores on Wikipedia: http://en.wikipedia.org/wiki/NoSQL#Key-value_stores_on_solid_state_or_rotating_disk

Is mongoDB or Cassandra better than MySQL for large datasets?

In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.
Concurrently, customer service operates on a replicated slave of the same database.
We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).
We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.
Given all this, what is our best option for a database?

I think you're going the wrong way about the problem.
Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).
Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.
What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point).
You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).
Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).
Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.

Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...
... so ...
... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.
If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.

As much as I'm not a fan of MySQL once your data gets large, I have to say that you're nowhere near needing to move to a NoSQL solution. 120M rows is not a big deal: the database I'm currently working with has ~600M in one table alone and we query it efficiently. Managing that much data from an ops perspective is the problem; querying it isn't.
It's all about proper indexes and the correct use of them when joining, and secondarily memory settings. Find your slow queries (mysql slow query log FTW!), and learn to use the explain keyword to understand whey they are slow. Then tweak your indexes so your queries are efficient. Further, make sure you understand MySQL's memory settings. There are great pages in the docs explaining how they work, and they aren't that hard to understand.
If you've done both of those things and you're still having problems, make sure disk I/O isn't an issue. Then you should look in to another solution for querying your data if it is.
NoSQL solutions like Cassandra have a lot of benefits. Cassandra is fantastic at writing data. Scaling your writes is very easy--just add more nodes! But the tradeoff is that it's harder to get the data back out. From a cost perspective, if you have expertise in MySQl, it's probably better to leverage that and scale your current solution until it hits a limit before completely switching your underlying architecture.

how to do fast read data and write data in mysql?

Hi Friends
i am using MySQL DB for one of my Product, about 250 schools are singed for it now, its about 1500000 insertion per hour and about 12000000 insertion per day, i think my current setup like just a single server may crash with in hours, and the read is also same as write, how can i make it crash free DB server, the main problem i am facing now is the slow of both writing and reading data how can i over come that,it is very difficult for me to get a solution.guys please help me..which is the good model for doing the solution?

It is difficult to get both fast reads and writes simultaneously. To get fast reads you need to add indexes. To get fast writes you need to have few indexes. And to get both to be fast they must not lock each other.
Depending on your needs, one solution is to have two databases. Write new data to your live database and every so often when it is quiet you can synchronize the data to another database where you can perform queries. The disadvantage of this approach is that data you read will be a little old. This may or may not be a problem depending on what it is you need to do.

~500 inserts per second is nothing to sneeze at indeed.
For a flexible solution, you may want to implement some sort of sharding. Probably the easiest solution is to separate schools into groups upfront and store data for different groups of schools on different servers. E.g., data for schools 1-10 is stored on server A, schools 11-20 on server B, etc. This is almost infinitely scalable, assuming that there are few relationships between data from different schools.
Also you could just try throwing more horsepower at the problem and invest into a RAID of SSD drives and, assuming that you have enough processing power, you should be OK. Of course, if it's a huge database, the capacity of SSD drives may not be enough.
Finally, see if you can cut down on the number of insertions, for example by denormalizing the database. Say, instead of storing attendance for each student in a separate row put attendance of the entire class as a vector in a single row. Of course, such changes will heavily limit your querying capabilities.

My laid back advice is:
Build you application lightweight. Don't use an high level database abstraction layer like Active Record. They suck at scaling.
Learn a lot about mysql permformance.
Learn about mysql replication.
Learn about load balancing.
Learn about in memory caches. (memcached)
Hire an administrator (with decent mysql knowledge) or web app performance guru/consultant.
The concrete strategy depends on your application and how it is used. Mysql replication, may or may not be appropriate (same applies for the mentioned sharding strategy). But it's a rather simple way to achive some scaling, because it doesn't impact your application design too much. In memory caches can keep away some load from your databases, but they need some work to apply and some trade offs. In the end you need a good overall understanding how to handle a database driven application under heavy load. If you have a tight deadline, add external manpower, because you won't do this right within 6 weeks without experience.

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf

The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008