is it true that Mongo keeps full database in ram? - mysql

how much ram needs mongo in comparison with MySQL?

MongoDB does its best to keep as much useful information in RAM. MySQL generally does the same thing.
Both databases will use all of the RAM they have available.
Comparing the two is not easy, because it really depends on a lot of things. Things like your table structure and your data size and your indexes.
If you give MongoDB and MySQL the same amount of RAM, you will typically find the following:
MongoDB will be very good at finding individual records. (like looking up a user or updating an entry)
MySQL will be very good at loading and using sets of related data.
The performance will really be dictated by your usage of the database.

The short answer is : the same.
Another way to ask is: if I am using MySQL + memcached, how much ram do i need to use Mongo instead of that combination? The answer would be on the order of the same total amount of memory for both clusters (the mongodb cluster being sharded probably in this scenario).

Related

Which database takes up more disk space MongoDB or MySQL?

I have a large database in MongoDB volume of 25gb, because of complex queries aggregation MongoDB cope worse than MySQL, but there is a fear that MySQL will take much more space on disk, is there any way to know the approximate size of the database but in MySQL? Perhaps someone has already done a comparison of these databases in terms of size?
The answer depends on a lot of choices specific to your database, such as:
MongoDB storage engine
MySQL storage engine
Number of indexes
Data types of indexed and non-indexed columns
Compression options used in either brand of database
Probably other factors
The best way to get an accurate comparison is to try it yourself using your data and your data model.
See also my answer to https://stackoverflow.com/a/66873904/20860

Why do we have Redis when we have MySQL temporary tables?

MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.

Hadoop (+HBase/HDFS) vs Mysql (or Postgres) - Loads of independent, structured data to be processed and queried

Hi there at SO,
I would like some ideas/comments on the following from you honorable and venerable bunch.
I have a 100M records which I need to process. I have 5 nodes (in a rocks cluster) to do this. The data is very structured and falls nicely in the relational data model. I want to do things in parallel since my processing takes some time.
As I see it I have two main options:
Install mysql on each node and put 20M records on each. Use the head node to delegate queries to the nodes and aggregate the results. Query Capabilities++, but I might risk some headaches when I come to choose partitioning strategies etc. (Q. Is this what they call mysql/postgres cluster?). The really bad part is that the processing of the records is left up to me now to take care of (how to distribute across machines etc)...
Alternatively install Hadoop, Hive and HBase (note that this might not be the most efficient way to store my data, since HBase is column oriented) and just define the nodes. We write everything in the MapReduce paradigm and, bang, we live happily ever after. The problem here is that we loose the "real time" query capabilities (I know you can use Hive, but that is not suggested for real time queries - which I need) - since I also have some normal sql queries to execute at times "select * from wine where colour = 'brown'".
Note that in theory - if I had 100M machines I could do the whole thing instantly since for each record the processing is independent of the other. Also - my data is read-only. I do not envisage any updates happening. I do not need/want 100M records on one node. I do not want there to be redundant data (since there is lots of it) so keeping it in BOTH mysql/postgres and Hadoop/HBase/HDFS. is not a real option.
Many Thanks
Can you prove that MySQL is the bottleneck? 100M records is not that many, and it looks like that you're not performing complex queries. Without knowing exactly what kind of processing, here is what I would do, in this order:
Keep the 100M in MySQL. Take a look at Cloudera's Sqoop utility to import records from the database and process them in Hadoop.
If MySQL is the bottleneck in (1), consider setting up slave replication, which will let you parallelize reads, without the complexity of a sharded database. Since you've already stated that you don't need to write back to the database, this should be a viable solution. You can replicate your data to as many servers as needed.
If you are running complex select queries from the database, and (2) is still not viable, then consider using Sqoop to import your records and do whatever query transformations you require in Hadoop.
In your situation, I would resist the temptation to jump off of MySQL, unless it is absolutely necessary.
There are a few questions to ask, before suggesting.
Can you formulate your queries to access by primary key only? In other words - if you can avoid all joins and table scans. If so - HBase is an option, if you need very high rate of read/write accesses.
I do noth thing that Hive is good option taking into consideration low data volume. If you expect them to grow significantly - you can consider it. In any case Hive is good for the analytical workloads - not for the OLTP type of processing.
If you do need relational model with joins and scans - I think good solution might be one Master Node and 4 slaves, with replication between them. You will direct all writes to the master, and balance reads among whole cluster. It is especially good if you have much more reads then writes.
In this schema you will have all 100M records (not that match) on each node. Within each node you can employ partitioning if appropriate.
You may also want to consider using Cassandra. I recently discovered this article on HBase vs. Cassandra which I was reminded of when I read your post.
The gist of it is that Cassandra is a highly scallable NoSQL solution with fast querying, which sort of sounds like the solution you're looking for.
So, it all depends on whether you need to maintain your relational model or not.
HI,
I had a situation where I had many tables which I created in parallel using sqlalchemy and the python multiprocessing library. I had multiple files, one per table, and loaded them using parallel COPY processes. If each process corresponds to a separate table, that works well. With one table, using COPY would be difficult. You could use tables partitioning in PostgreSQL, I guess. If you are interested I can give more details.
Regards.

Can I expect a significant performance boost by moving a large key value store from MySQL to a NoSQL DB?

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.
I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.
Queries are typically for around 100 consecutive values,
eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;
I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.
Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.
Thanks
Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.
To make your own test with it should take less than 1 hour.
One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).
I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load.
at any rate 200ms is way too much for a key/value store.
If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn
if you are looking for a distributed solution you can try any Dynamo clone:
Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular.
keep in mind that looking for strong consistency slows down these systems quite a bit.
I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

Can mysql handle a dataset of 50gb?

Can mysql handle a dataset of 50gb (only text) efficiently ? If not, what database technologies should I use ?
thanks
Technically, I would say yes. MySQL can handle 50GB of data, efficiently.
If you are looking for a few examples, Facebook moved to Cassandra only after it was storing over 7 Terabytes of inbox data.
Source: Lakshman, Malik: Cassandra - A Decentralized Structured Storage System.
Wikipedia also handles hundreds of Gigabytes of text data in MySQL.
Any backend that uses b-trees (like all the popular ones for MySQL) get dramatically slower when the index doesn't fit in RAM anymore. Depending on your query needs, Cassandra might be a good fit, or Lucandra (Lucene + Cassandra) -- http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/