NoSQL/Document Store Searching - mysql

I have researching NoSQL for a while, but I am still struggling to wrap my head around searching and filtering results/documents.
In a NoSQL world, how would I find all data within two timestamps for example? If everything is stored on a key/value basis? Or find all documents within a radius of a latitude and longitude point?
Thanks

Different data structures support different use cases. For instance, a distributed hash-table would be a good choice if you can make do with the limited API of the dictionary/map interface.
If you need to ask more complex queries of your database, then pick a database that supports this use case efficiently. The database landscape is very varied, and there is probably a database out there for you.
For range queries, the BigTable clones (and upwards in expressive querying power) will probably be worth considering.
Even if the database with the right data structure isn't fast enough or can't scale enough, you can still pull tricks like sharding, replication and clever use of caching or search indexes.
How you want to weigh the constraints of consistency, availability, resilience, through-put and latency all depend on the specific problem, and you may find that you need more than one type of database to implement the optimal solution.
Don't artificially make things harder for yourself; premature-optimization, you know.
I work at a MySQL shop and it works fast enough for our hundreds of thousands of transactions a day (millions of queries), and reliably enough for when our software is used in live TV events.
Sorry that I can't give you a more concrete answer than this.

For example MongoDB supports range queries and you can create geo indexes on latitude/longitude info. There are large differences between the NoSQL databases.

Related

How do You Organize Big Data in your Database?

I have some database with big data inside it, now I am thinking how to organize them to be more scallable.
some point as my consideration is :
Security
Performance
Cost
Generally answer is welcome, because I am still didn't expected all of my problem or possibility risk will happen, it's will help me if you can give me some suggestion.
To give a full answer to your question we will need more information on how big the data is, how complex, what your use cases are (ie. do you do many joins on multiple tables or are they mostly on a single table?). In any case, here are some good pointers that would help you get on your way.
If you are expecting your data to grow rapidly, I would recommend that you look at a cloud based database solution rather than invest on physical hardware that would need replacing every so often. Cloud based solutions provide you more freedom to scale your database both vertically and horizontally. There are specialized cloud database technologies such as Amazon RedShift and recently introduced Aurora which can be configured easily as your requirements grow.
For performance improvement within the database you can always look at indexes and changes in structures. Use the explain syntax in MySQL to analyze your queries and see if the queries use temporary tables or data scans which will slow things down. Adding indexes to columns that you use for filtering or merging data increases performance drastically.
In data warehouses, you can also denormalize and pre-join tables to improve performance. Although this will drastically increase your storage use, due to the fact that you are only working with one data table increases the performance as the time taken to do the join over and over again is taken off the equation.
If you are looking at massive datasets that will grow in structure and complexity, there are other non relational database technologies such as noSQL based Hadoop, Cassandra, etc. Moving into these environments may need you to rewrite most of your application, but is something that you should consider before you find yourself in the need for such things when the data has grown too big.
EDIT
Privacy and data security as pointed out below by #Saïd Tahali in the comments. If you can't host your data outside due to legal or security reasons, you will need to invest on your own hardware that will address all of the above in-house.

Limits to move from Sql to NoSql Database

We are facing performance related issues in our current MySQL DB. Our application is pretty heavy on a few tables ~20. We run lot of aggregation queries on this table as well as writes. Most of our teams are developers and we don't have access to a dba which might help in retuning our current db and make things work faster.
Moving to NoSql is an option. But seriously thinking what are the higher limits in terms of
Volumes (Current volumes per day ~50GB)
Structured or Raw Data? (Structured Data)
IO stats on DB - ( Current rate is 60 KB/Sec)
Record writes - (now 3000 rows/sec)
Question arise
Is 50GB is high enough to consider NoSql? Some documentation recommends more than a TB
The data should be raw data, which can be further processed to get structured and use in application
MySql scales out at 3000 rows/secs, not sure MySql can be further tuned
HBase seems to be promising for Analytic application.
Would like to get some guidelines on limits of RDBMS one can think of moving to NoSQL
This is such a broad topic so don't believe there are any "right" answers but maybe a few general recommendations would help:
I think you should think of this challenge in terms of picking the right tool for the problem. All databases have their pros and cons and in some challenges the best approach is to use an entire toolbox to get the job done.
Note that moving your data, or even just parts of it, to different datastores is rarely a non-trivial effort. Use this chance to rethink about your data model before implementing it.
Getting this job done should also take into account more requirements, such your growth plans for example. It looks you're at this crossroads because your original assumptions->choices are no longer en par with reality. If you want to delay the next time you're at the same place, you should use this opportunity to do so.
Lastly keep in mind that the job really done only after you do something with all that captured data - or else I'd recommend you use the infinitely-scalable write-to-/dev/null design pattern ;) Put differently, unless your data is write-only, you'd want to make sure that whatever SQL/NoSQL/NewSQL/other datastore that you choose can also get you the data/information/knowledge inside your use case's acceptable time frames.
It will probably worth it given your current infrastructure, but keep in mind that it's going to be a huge task, since you're going to need to redesign the whole process. HBase can help you, as it has some neat features, like realtime counters (which in some cases eliminates the needing of periodic rollups), or per-client buffering (which can allow you to scale to the >100k writes per second), but, be warned it cannot be queried in the same way you query a relational database, so, you're going to need to carefully plan it to make it work for you.
It seems that your main issue is with the raw data writes, sure, you can definitely rely on HBase for that, and then do the rollups every X min to store the data in your RDBMS so it can be queried as usual. But given you're doing them every minute, which is a very short gap, why don't you keep the data in memory and flush it the rolled up tables every minute?. Sure, you could loss data, but I don't know how critic is for you loosing one minute of data, and that alone could help you a lot.
Anyway, the best advice I can think of: read a book, understand how HBase works first, dig into the pros & cons, and think about how it can suit your specific needings. This is crucial because a good implementation is what is going to determine if it's a success or a total failure.
Some resources:
HBase: The Definitive Guide
HBase Administration Cookbook
HBase Reference guide (free)

Is mongo appropriate to use alongside MySQL?

I can't discuss things in great detail due to an NDA, but I'm hoping an overview of the system being built can help you in aiding me in making a decision concerning our databases.
I'm building an app that will help vendors compete to gain clientele by making strategic offers based on records of inventory/purchase from the storefronts.
One side of the app is for the store owners to see presented offers, network, etc. I've got that going with a standard php/MySQL setup.
My question is concerning the records of inventory. We are talking millions of records here nearly immediately. The sample data I'm using is roll up of four of their managers (they have dozens) over the course of a year or two and it had over 500k rows with about 30 or more columns. When we get scores of stores with all of their managers it will be massive, at least compared to anything I've worked with as of yet.
The vendors will have a side of the product in which they can search through these records and make competitive offers based off of it.
Is the sheer size a good reason to use something like mongo? Or is it more a matter of how the data is laid out / what it consists of? Or some other element that I'm not considering?
And, if not mongo/nosql, then is there some other methodology or technology that such large data stores would benefit from me using (sharding, amazon cloud database, etc).
Thanks
Answers ...
Q: Is the sheer size a good reason to use something like mongo?
A: I think so. Mongo was built from the ground up to scale in a massive way. You have replica sets and sharding that can help you scale. They also have features to make sure your data gets stored in the appropriately geographically distributed data centers.
Q: Or is it more a matter of how the data is laid out / what it consists of?
A: Mongo is a document database and you're right, the data models will be different. You have to think of data in a denormalized way instead of normalized. Just like any technology, there are pros and cons to storing things as documents.
Some pros: Schema management is a breeze. Data more naturally fits objects in your application. Don't have to pay the price of complicated/slow joins.
Some cons: Schemas can be inconsistent - you have to manage it. Data is repeated, which is not managed means it can become inconsistent.
In general I think Mongo would be a good choice to deal with that scale. Mongo has a new aggregation framework that brings a lot of SQL concepts to queries on documents. Easier to make complex queries. Also Mongo has map/reduce to run any kind of query you might have.
After using Mongo daily for about a year, I've really enjoyed the support around it as a product and the general ease of setting it up and working with it.

Cassandra or MySQL/PostgreSQL?

I have huge database (kinda wordnet) and want to know if it's easier to use Cassandra instead of MySQL|PostrgreSQL
All my life I was using MySQL and PostrgreSQL and I could easily think in terms of relational algebra, but several weeks ago I learned about Cassandra and that it's used in Facebook and Twitter.
Is it more convenient?
What DBMS are usually used nowadays to store social net's data, relationships between objects, wordnet?
There is nothing like a Silver bullet solution, everything is built to solve specific problem and has its own pros and cons. It is up to you to decide - what problem statement you have and what is best solution that fits your problem. Whether you use Cassandra (NoSQL) or MySQL(RDBMS), it is all driven from your system's requirements. Below are the inputs that will help you in taking better decision while deciding on database.
Why to Use NoSQL
In the case of RDBMS database, making choice is quite easy because almost all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost same kind of solutions oriented to the ACID property. When it comes to NoSQL, decision becomes difficult because every NoSQL database offers different solution and you have to understand which one is best suited for your app/system requirement. For example, MongoDB fits for use cases where your system demands schema-less document store. HBase might fit for Search engines, analysing log data, any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like tree, queue, link list etc and can be good fit for making real time leader board, pub-sub kind of system. Similarly there are other database in this category (including Cassandra) which fits for different problems. Now lets move to original question, and answer them one by one.
When to use Cassandra
Being a part of NoSQL family, Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. You can refer to blog post (http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/) to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra/NoSQL
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
There are many different flavours of "NoSQL" databases. If your application is really like Wordnet perhaps you should look at a graph database such as Neo4j.
I would suggest to analyse your request.
If you are going with more clusters, machines take NoSQL
If your data model is complicated - require efficient structures take NoSQL (no limits with type of columns)
If you fit in a few machines without scales, and you don't need super performance for multi request (as for example in social network - where lot of users send http request), and you don't think you involve saleability take RDBMS (Postgres have some good functions and structures which you can use, like array column type).
Cassandra should work better with large scales of data, multi purpose.
neo4j - would be better for special structures, graphs.
Cassandra and other NoSQL stores are being used for social based sites because of their need for massive write based operations. Not that MySQL and Postgres can't achieve this but NoSQL requires far less time and money, generally speaking.
Sounds like you may want to look at Neo4J though, just in terms of your object model needs.
All different products and they all have their pro's and conn's. What kind of problem do you have to solve?
Huge, as in TB's?

What database works well with 200+GB of data?

I've been using mysql (with innodb; on Amazon rds) because it's sort of universal default, but it's been ridiculously under-performing, and tweaking it only delays the inevitable.
The data is mostly relatively short (<1kB of bytes each) blobs information about 100Ms of urls. There is (or should be, mysql cannot seem to handle it) very high amount of insert / update / retrieve but few complex queries - not that complex queries wouldn't be useful, but because mysql is so slow that it's far faster to get the data out, process it locally, and cache the results somewhere.
I can keep tweaking mysql and throwing more hardware at it, but it seems increasingly futile.
So what are the options? SQL/relational model/etc. optional - anything will do as long as it's fast, networked, and language-independent.
Have you done any sort of end-to-end profiling of your application and MySQL database? To provide better advice it would also be good to understand what improvements you have tried to implement, and your database structure. You haven't given a lot of information on how your MySQL database is configured either. It provides a lot of options for tuning.
You should pick up a copy of High Performance MySQL if you haven't already to learn more about the product.
There is no point in doing anything until you know what your problem is. NoSQL solutions can offer performance benefits but you have provided little evidence that MySQL is incapable of servicing your needs.
Well "Fast, networked and language-independent" + "few complex queries" brings to mind the various NoSQL solutions. To name a few:
MongoDB
CouchDB
Cassandra
And if that's not fast enough, there are always the wicked fast Redis which is my personal favorite atm. :) It is not a database per se, but it's good enough for most scenarios.
I am sure other people can list more NoSQL databases...
and there is always http://nosql-database.org/ .
Generally speaking, databases in this category is better and faster in your scenario because they have relaxed constraints and thus is easier and faster to insert/update/retrieve frequently. But that requires that you think harder about your data model and it is generally not possible to do SQL-style complex queries directly -- you'll instead write more pre-computed data or use a more denormalized design to account for the lack of complex queries.
But since complex queries is a minor problem in your case, I think NoSQL solutions are ideal for you.
With the data you've given about your application's data and workload, it is almost impossible to determine whether the problem really is MySQL itself or something else. You seem to assume that you can throw any workload to a relational engine and it should handle it. Therefore the suggestions made by other commenters about analyzing the performance more carefully are valid in my opinion. Without more data (transactions / second etc.) any further analysis regarding other suitable engines is also futile.
I'm not sure I agree with the advice to jump ship on traditional databases. It might not be the most efficient tool, but it is the one that is FAR more widely understood and used, and a strongly doubt you have a problem that can't be handled by an efficiently set up relational database.
Obvious answers are Oracle, SQLServer, etc, but it might just be your database structure isn't right. I don't know much about MySQL but I do know it's used in some pretty big projects (eBay being noteworthy).