Can mysql handle a dataset of 50gb (only text) efficiently ? If not, what database technologies should I use ?
thanks
Technically, I would say yes. MySQL can handle 50GB of data, efficiently.
If you are looking for a few examples, Facebook moved to Cassandra only after it was storing over 7 Terabytes of inbox data.
Source: Lakshman, Malik: Cassandra - A Decentralized Structured Storage System.
Wikipedia also handles hundreds of Gigabytes of text data in MySQL.
Any backend that uses b-trees (like all the popular ones for MySQL) get dramatically slower when the index doesn't fit in RAM anymore. Depending on your query needs, Cassandra might be a good fit, or Lucandra (Lucene + Cassandra) -- http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/
Related
I have a large database in MongoDB volume of 25gb, because of complex queries aggregation MongoDB cope worse than MySQL, but there is a fear that MySQL will take much more space on disk, is there any way to know the approximate size of the database but in MySQL? Perhaps someone has already done a comparison of these databases in terms of size?
The answer depends on a lot of choices specific to your database, such as:
MongoDB storage engine
MySQL storage engine
Number of indexes
Data types of indexed and non-indexed columns
Compression options used in either brand of database
Probably other factors
The best way to get an accurate comparison is to try it yourself using your data and your data model.
See also my answer to https://stackoverflow.com/a/66873904/20860
MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.
I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.
I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.
Queries are typically for around 100 consecutive values,
eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;
I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.
Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.
Thanks
Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.
To make your own test with it should take less than 1 hour.
One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).
I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load.
at any rate 200ms is way too much for a key/value store.
If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn
if you are looking for a distributed solution you can try any Dynamo clone:
Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular.
keep in mind that looking for strong consistency slows down these systems quite a bit.
I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.
how much ram needs mongo in comparison with MySQL?
MongoDB does its best to keep as much useful information in RAM. MySQL generally does the same thing.
Both databases will use all of the RAM they have available.
Comparing the two is not easy, because it really depends on a lot of things. Things like your table structure and your data size and your indexes.
If you give MongoDB and MySQL the same amount of RAM, you will typically find the following:
MongoDB will be very good at finding individual records. (like looking up a user or updating an entry)
MySQL will be very good at loading and using sets of related data.
The performance will really be dictated by your usage of the database.
The short answer is : the same.
Another way to ask is: if I am using MySQL + memcached, how much ram do i need to use Mongo instead of that combination? The answer would be on the order of the same total amount of memory for both clusters (the mongodb cluster being sharded probably in this scenario).
For the same data set, with mostly text data, how do the data (table + index) size of Postgresql compared to that of MySQL?
Postgresql uses MVCC, that would suggest its data size would be bigger
In this presentation, the largest blog site in Japan talked about their migration from Postgresql to MySQL. One of their reasons for moving away from Postgresql was that data size in Postgresql was too large (p. 41):
Migrating from PostgreSQL to MySQL at Cocolog, Japan's Largest Blog Community
Postgresql has data compression, so that should make the data size smaller. But MySQL Plugin also has compression.
Does anyone have any actual experience about how the data sizes of Postgresql & MySQL compare to each other?
MySQL uses MVCC as well, just check
innoDB. But, in PostgreSQL you can
change the FILLFACTOR to make space
for future updates. With this, you
can create a database that has space
for current data but also for some
future updates and deletes. When
autovacuum and HOT do their things
right, the size of your database can
be stable.
The blog is about old versions, a lot
of things have changed and PostgreSQL
does a much better job in compression
as it did in the old days.
Compression depends on the datatype,
configuration and speed as well. You
have to test to see how it's working
for you situation.
I did a couple of conversions from MySQL to PostgreSQL and in all these cases, PostgreSQL was about 10% smaller (MySQL 5.0 => PostgreSQL 8.3 and 8.4). This 10% was used to change the fillfactor on the most updated tables, these were set to a fillfactor 60 to 70. Speed was much better (no more problems with over 20 concurrent users) and data size was stable as well, no MVCC going out of control or vacuum to far behind.
MySQL and PostgreSQL are two different beasts, PostgreSQL is all about reliability where MySQL is populair.
Both have their storage requirements in their respective documentation:
MySQL: http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html
Postgres: http://www.postgresql.org/docs/current/interactive/datatype.html
A quick comparison of the two don't show any flagrant "zomg PostGres requires 2 megabytes to store a bit field" type differences. I suppose Postgres could have higher metadata overhead than MySQL, or has to extend its data files in larger chunks, but I can't find anything obvious that Postgres "wastes" space for which migrating to MySQL is the cure.
I'd like to add that for large columns stores, postgresql also takes advantage of compressing them using a "fairly simple and very fast member of the LZ family of compression techniques"
To read more about this, check out http://www.postgresql.org/docs/9.0/static/storage-toast.html
It's rather low-level and probably not necessary to know, but since you're using a blog, you may benefit from it.
About indexes,
MySQL stores the data within the index which makes them huge. Postgres doesn't. This means that the storage size of a b-tree index in Postgres doesn't depend on the number of the column it spans or which data type the column has.
Postgres also supports partial indexes (e.g. WHERE status=0) which is a very powerful feature to prevent building indexes over millions of rows when only a few hundred is needed.
Since you're going to put a lot of data in Postgres you will probably find it practical to be able to create indexes without locking the table.
Sent from my iPhone. Sorry for bad spelling and lack of references