Large Data Sets - NoSQL, NewSQL, SQL..? Brain Fried - mysql

I'm in need of some advice. I working on a new start-up in the data mining field. This is basically the spin off of a research project.
Any way we have a large about of data that is unstructured, we are doing various NLP, classification and clustering analysis on this data.
We have millions of messages ranging from twitter messages, blog posts, forum posts, new paper articles, reports etc etc... All text. All up we are taking about 300GB+ of text data and growing every day (about 10GB per day growth)!
So we need somewhere to store all of this information in a format that we can actually process and query and get relative real-time results.
Any way we need somewhere to store of this data...
As this is a new start-up we really cant/dont want to pay for a licensed product, e.g. Enterprise edition of VoltDB, Oracle, etc is out of reach.
I was thinking this may be the perfect application for a Non-Relation "NoSQL" database such as Apache Cassandra or Hadoop/HBase (column family), MongoDB (document), VoltDB (community edn) or MySQL.
Currently all the data is in tsv text files and is processed as its written to file. Needless to say its painful and it means the whole thing is stuck in the one process and we cant query it. It works but its way to limited for the richness of what we could be doing with this data set.
Any way I was hoping someone could share their experience using any of the above tools or any recommendations for this use case (large set of text data unstructured) for Natural Language Processing, classification, clustering, frequency gathering, real-time analysis etc..?
My biggest fear is that MySQL wont be able to handle the sheer volumes of data going forward. This thing will be in the terabyte range come the end of the year, so we are in part trying to get a head of the curve and growth by implementing a scalable solution that will allow us to easily query data...
I'm thinking non-rel/NoSQL column family database like HBase is best, for us adding new data sources all the time (crawlers, streaming APIs etc) it will be much easier if we have a unstructured model.
Any help would be greatly appreciated! Hell there might even be a job in it :)
Cheers!

You need to think carefuly about what types of queries you will need to run over these docs. Cassandra etc may well be a good fit if your queries are basic, but richer SQL-like queries are not possible. The largest Cassandra deployments are of the order of 150TB, so your data volumes should not be a problem; but Cassandra performance may be overkill and will sacrifice query richness.
If you just want text indexing, then also consider Lucene, as I think for batch indexing Lucene can now handle over 100 GB/hour, so overnight indexing of 1TB would be possible - and Lucene now claims comparable speeds for incremental indexing too...

Checkout RavenDB. It is a document DB supporting Map/Reduce, which is based on Lucene and therefore can also provide full-text search capabilities natively from the querying API.
Sharding and replication capabilities are built-in, and very advanced. Using Esent as storage, each node can store up to 16TB of data.

Database mainly depends on your use cases. I will suggest you to go with cassandra or hbase.
For real time analysis over cassandra you can use Apache spark and spark streaming all are work well.
Also try Elastic search or solar search for text searching. All this are open source and very good to try.
For real time analysis you can have look to facebook opensource Prestodb as well but i didn't found much information needed apart from presto website and most of people suggesting to go with cassandra with apache spark.

Related

Backend technology for high volume data for web application

I am developing an application to provide daily dynamic information like prices, availability, etc for around 50,000 objects. I need to store data for about the next 200 days. That would mean a total of 10 million rows. The prices will be batch updated and new data will be added once daily. Let me say about 10,000 existing rows get updated and 50,000 rows are inserted daily. What is the best backend framework that I can use.
Can MySQL be scalable with limited hardware capability. Or is NoSQL database the way to go? If yes, then which NoSQL database will be best suited for fast fetching and updating the data.
I would recommend you to use Cassandra, as you need to write more than read, and Cassandra is optimized for high throughput while write.
It provide scalability, no single point failure and high throughput. And you can update records as well.
Cassandra also supports batch operation for DML (data manipulation language) i.e. write, update and delete. And batch operation of Cassandra provides atomicity as well.
This type of volume is well within the capabilities/capacities of traditional RDBMS. I would say that if you are familiar with MySQL you will be safe to stick with it. A lot depends also, on what kind of queries you want to run. With a properly structured, denormalized setup, you can run ad hoc queries in an RDBMS, whereas with document stores, you need to think quite carefully about structure up front -- embedding versus referencing, see: MongoDB relationships: embed or reference?. MongoDB has added a very nice aggregation framework, which goes a long way towards being able to query data as you would in an RDBMS, but in many other NoSQL systems, queries are essentially map-reduce jobs and joins are either painful or impossible.
It sounds like your data is structured around dates/days. One thing you can do that will yield dramatic speed improvements on queries is partitioning by date ranges. I have worked on dbs over 100m rows in MySQL where historical data had to be kept for auditing purposes but where most of the read/write was on current data, and partitioning led to truly dramatic read query improvements.
You might be interested by this link which shows what some very high volume sites are using: What databases do the World Wide Web's biggest sites run on? Anecdotally, I know that Facebook had trillions of rows in MySQL across various clusters before they started hitting real bottlenecks, but it is no suprise that Cassandra ultimately came out of Facebook engineering, given the truly colossal data volumes they now handle.
Cassandra, Riak, CouchDB, MongoDB, etc all arose to solve very real problems, but these come with tradeoffs, both in terms of the CAP theorem, and in terms of ad hoc queries being more difficult than in RDBMS. Having said that, MongoDB and Cassandra (which I have most experience with) are easy to set up and fun to work with, so if you want to give them a go, I'm sure you will have no problems, but I would say your usage requirements are well within the capabilities of MySQL. Just my 2c.

MongoDB vs Mysql Storage space compare

I am building a data ware house that is the range of 15+ TBs. While storage is cheap, but due to limited budget we have to squeeze as much data as possible in to that space while maintaining performance and flexibility since the data format changes quiet frequently.
I tried Infobright(community edition) as a SQL solution and it works wonderful in term of storage and performance, but the limitation on data/table alteration is making it almost a no go. and infobright's pricing on enterprise version is quiet steep.
After checking out MongoDB, it seems promising except one thing. I was in a chat with a 10gen guy, and he stated that they don't really give much of a thought in term of storage space since they flatten out the data to achieve the performance and flexibility, and in their opinion storage is too cheap nowadays to be bother with.
So any experienced mongo user out there can comment on its storage space vs mysql (as it is the standard for what we comparing against to right now). if it's larger or smaller, can you give rough ratio? I know it's very situation dependent on what sort of data you put in SQL and how you define the fields, indexing and such... but I am just trying to get a general idea.
Thanks for the help in advance!
MongoDB is not optimized for small disk space - as you've said, "disk is cheap".
From what I've seen and read, it's pretty difficult to estimate the required disk space due to:
Padding of documents to allow in-place updates
Attribute names are stored in each collection, so you might save quite a bit by using abbreviations
No built in compression (at the moment)
...
IMHO the general approach is to build a prototype, insert data and see how much disk space your specific use case requires. The more realistic you can model your queries (inserts and updates) the better your result will be.
For more details see http://www.mongodb.org/display/DOCS/Excessive+Disk+Space as well.
Pros and Cons of MongoDB
For the most part, users seem to like MongoDB. Reviews on TrustRadius give the document-oriented database 8.3 out of 10 stars.
Some of the things that authenticated MongoDB users say they like about the database include its:
Scalability.
Readable queries.
NoSQL.
Change streams and graph queries.
A flexible schema for altering data elements.
Quick query times.
Schema-less data models.
Easy installation.
Users also have negative things to say about MongoDB. Some cons reported by authenticated users include:
User interface, which has a fairly steep learning curve.
Lack of joins, which can make some data retrieval projects difficult.
Occasional slowness in the cloud environment.
High memory consumption
Poorly structured documentation.
Lack of built-in analytics.
Pros and Cons of MySQL
MySQL gets a slightly higher rating (8.6 out of 10 stars) on TrustRadius than MongoDB. Despite the higher rating, authenticated users still mention plenty of pros and cons of choosing MySQL.
Some of the positive features that users mention frequently include MySQL’s:
Portability that lets it connect to secondary databases easily.
Ability to store relational data.
Fast speed.
Excellent reliability.
Exceptional data security standards.
User-friendly interface that helps beginners complete projects.
Easy configuration and management.
Quick processing.
Of course, even people who enjoy using MySQL find features that they don’t like. Some of their complaints include:
Reliance on SQL, which creates a steeper learning curve for users who
do not know the language.
Lack of support for full-text searches in InnoDB tables.
Occasional stability issues.
Dependence on add-on features.
Limitations on fine-tuning and common table expressions.
Difficulties with some complex data types.
MongoDB vs MySQL Performance
When comparing the performance of MongoDB and MySQL, you must consider how each database will affect your projects on a case-by-case basis. While some performance features may appear to be objectively promising, your team members may never use the features that drew you to a database in the first place.
MongoDB Performance
Many people claim that MongoDB outperforms MySQL because it allows them to create queries in multiple ways. To put it another way, MongoDB can be used without knowing SQL. While the flexibility improves MongoDB's performance for some organizations, SQL queries will suffice for others.
MongoDB is also praised for its ability to handle large amounts of unstructured data. Depending on the types of data you collect, this feature could be extremely useful.
MongoDB does not bind you to a single vendor, giving you the freedom to improve its performance. If a vendor fails to provide you with excellent customer service, look for another vendor.
MySQL Performance
MySQL performs extremely well for teams that want an open-source relational database that can store information in multiple tables. The performance that you get, however, depends on how well you configure the MySQL database. Configurations should differ depending on the intended use. An e-commerce site, for example, might need a different MySQL configuration than a team of research scientists.
No matter how you plan to use MySQL, the database’s performance gets a boost from full-text indexes, a high-speed transactional system, and memory caches that prevent you from losing crucial information or work.
If you don’t get the performance that you expect from MySQL data warehouses and databases, you can improve performance by integrating them with an excellent ETL tool that makes data storage and manipulation easier than ever.
MySQL vs MongoDB Speed
In most speed comparisons between MySQL and MongoDB, MongoDB is the clear winner. MongoDB is much faster than MySQL at accepting large amounts of unstructured data. When dealing with large projects, it's difficult to say how much faster MongoDB is than MySQL. The speed you get depends on a number of factors, including the bandwidth of your internet connection, the distance between your location and the database server, and how well you organise your data.
If all else is equal, MongoDB should be able to handle large data projects much faster than MySQL.
Choosing Between MySQL and MongoDB
Whether you choose MySQL or MongoDB probably depends on how you plan to use your database.
Choosing MySQL
For projects that require a strong relational database management system, such as storing data in a table format, MySQL is likely to be the better choice. MySQL is also a great choice for cases requiring data security and fault tolerance. MySQL is a good choice if you have high-quality data that you've been collecting for a long time.
Keep in mind that to use MySQL, your team members will need to know SQL. You'll need to provide training to get them up to speed if they don't already know the language.
Choosing MongoDB
When you want to use data clusters and search languages other than SQL, MongoDB may be a better option. Anyone who knows how to code in a modern language will be able to get started with MongoDB. MongoDB is also good at scaling quickly, allowing multiple teams to collaborate, and storing data in a variety of formats.
Because MongoDB does not use data tables to make browsing easy, some people may struggle to understand the information stored there. Users can grow accustomed to MongoDB's document-oriented storage system over time.

Storing and Analyzing Logs Database Selection

I am building an internal tool, which will be open-sourced, to take logs and put them into a database - to put it simply. From there, the tool will also analyze the logs and help alert the sys-admins and developers of issues going on, all in real-time. This is a lot of CPU to process this, more than the scope of this question.
What I would like to know is what Database to choose that will allow and perform quickly a number of key tasks:
Store a large number of events categorized by event types
Perform a large number of reads to develop charts to analyze the events that are being logged
Read in real-time to send and trigger automated alerts to the system.
And any other help would be greatly appreciated, too. Code On.
To my observation MongoDB performs in a magnitude better than RDBS for a task you describe - massive store of logs. Particularly good performers are capped collections. Major performance lag with RDBS I've seen was the insert times. Huge disadvantage of RDBS is the schema which is a major pain to upgrade if needed. Because of these reasons we have started to move towards MongoDB - check out logFaces. If you are building your own tool for the open source community - try to make sure it will work with ANY database, not just a particular brand. But then it becomes a not so trivial task :)
(for disclosure - I am the original author of logFaces, so the opinion could be biased)
Storing just events sound like a simple model, so you might want to take a look at NoSQL databases. I think key-value stores/bigtables for really large amounts of data will be better than document based databases in this case.
Large number of reads and analysing on the other hand sound like you might want to build a data warehouse system. This is the good old SQL approach, without some normalization for optimised reading. Though it can take some time to design and implement.

Cassandra or MySQL/PostgreSQL?

I have huge database (kinda wordnet) and want to know if it's easier to use Cassandra instead of MySQL|PostrgreSQL
All my life I was using MySQL and PostrgreSQL and I could easily think in terms of relational algebra, but several weeks ago I learned about Cassandra and that it's used in Facebook and Twitter.
Is it more convenient?
What DBMS are usually used nowadays to store social net's data, relationships between objects, wordnet?
There is nothing like a Silver bullet solution, everything is built to solve specific problem and has its own pros and cons. It is up to you to decide - what problem statement you have and what is best solution that fits your problem. Whether you use Cassandra (NoSQL) or MySQL(RDBMS), it is all driven from your system's requirements. Below are the inputs that will help you in taking better decision while deciding on database.
Why to Use NoSQL
In the case of RDBMS database, making choice is quite easy because almost all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost same kind of solutions oriented to the ACID property. When it comes to NoSQL, decision becomes difficult because every NoSQL database offers different solution and you have to understand which one is best suited for your app/system requirement. For example, MongoDB fits for use cases where your system demands schema-less document store. HBase might fit for Search engines, analysing log data, any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like tree, queue, link list etc and can be good fit for making real time leader board, pub-sub kind of system. Similarly there are other database in this category (including Cassandra) which fits for different problems. Now lets move to original question, and answer them one by one.
When to use Cassandra
Being a part of NoSQL family, Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. You can refer to blog post (http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/) to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra/NoSQL
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
There are many different flavours of "NoSQL" databases. If your application is really like Wordnet perhaps you should look at a graph database such as Neo4j.
I would suggest to analyse your request.
If you are going with more clusters, machines take NoSQL
If your data model is complicated - require efficient structures take NoSQL (no limits with type of columns)
If you fit in a few machines without scales, and you don't need super performance for multi request (as for example in social network - where lot of users send http request), and you don't think you involve saleability take RDBMS (Postgres have some good functions and structures which you can use, like array column type).
Cassandra should work better with large scales of data, multi purpose.
neo4j - would be better for special structures, graphs.
Cassandra and other NoSQL stores are being used for social based sites because of their need for massive write based operations. Not that MySQL and Postgres can't achieve this but NoSQL requires far less time and money, generally speaking.
Sounds like you may want to look at Neo4J though, just in terms of your object model needs.
All different products and they all have their pro's and conn's. What kind of problem do you have to solve?
Huge, as in TB's?

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.
Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.
If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf
The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?