Most suitable data store for billions of indexes - mysql

So we're looking to store two kinds of indexes.
First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.
And our usage pattern will be something like this:
Both kinds of index will have values added to the top up to thousands of times per second.
Indexes will be infrequently read, but when they are read it'll be the entirety of the index that is read
Indexes should be pruned, either on writing values to the index or in some kind of batch type job
Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.
So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!
Thanks,
-Alec

You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.
You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.

Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it)
It won't take you more than one day to do all this, i've done something like this before.
A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.

Related

How to handle a table with billion of rows with lots of read and write operations

Please guide me through my problem
I receive data at every 1 sec at my server from different sources.My data is structured i parse it and now i have to store this parsed data into single table around 5 lacs of records in a day. Also daily i do lots of read operation on this table.After some time this table will have billions of record.
How should i solve this problem? I want to know should i go with RDBMS or HBase or any other option.
My question is regarding what sort of database repository you wish to use: RAM? Flash? Disk?
RAM responds in nanoseconds.
Flash in microseconds.
Disk in milliseconds.
And, of course, you might want to create a hybrid of all three, especially if some keys were "hotter" than others -- more likely to be read over and over.
If you want to do a lot of fast processing, and scale it "wide" (many CPUs in a cluster for faster read performance), you are a likely candidate for a NoSQL database. I'd need to know more about your data model to know whether it would work as a key-value store, and how it might require more internal structure such as JSON/BSON.
Caveat: I am biased towards Aerospike, my employer. Yet you should do some kicking-of-the-tires with us or any other key-value stores you're considering to see if it would work with your data before betting the farm. Obviously, each NoSQL vendor would claim itself to be "the best," but much depends on your use case. A vendor's "solution" will only work well for certain data models. We tend to be best for fast in-memory RAM/Flash or hybrid implementations.
If in case your table would reach billions of records, RDBMS definitely won't scale.
Regarding HBASE, it depends on your requirements whether it would be a good solution or not.
If you are looking for real time reads, Hbase would only help if you are only looking for a specific key. If you want to do random reads on different columns, Hbase won't be an ideal solution here. Hbase would scale really well in case of updates.
I would suggest you to design your Hbase schema efficiently and store your data in way which suits your querying.
However if you are interested in running aggregation queries you can also map your hbase table to an external table in Hive and run sql type queries on your data.
You can use HBase as a NoSQL database in this case. To make search more customized and faster use ElasticSearch along with Hbase.
If you writes are at 1/second, most of the available databases should be able to support this. Since you are looking for longer term/persistent store, you should consider a database that provides you horizontal scale so that you could add more nodes as and when you would like to increase the capacity. Databases with auto-sharding abilities would be great fit for you (cassandra, aerospike ...). Make sure you choose a auto-sharding database that doesn't require client/application to manage which data is stored where. In-memory databases would not fit the bill in this case.
When your storage is a few tera-bytes, you may have to worry about the database scale, throughput so that your infra cost doesn't bogg you down.
Your query patterns would be very crucial in choosing the right solution. You may not want to index everything, but fine-tune what you index so that you could query on the keys and/or only those data elements from within your records so that index storage overhead doesn't become too much, and hence you keep the cost under control. You should also look for time-range query ability for the database solutions, which seems to be part of your typical query pattern.
Last but not the least, you would want to have your queries processes in fastest possible time. You should try out Cassandra (good for horizontal scaling, less on the throughput) and aerospike (good for horizontal scaling, pretty good on throughput).

Follower System, better in MySQL or Redis?

I'm just wondering what solution to chose to implement a follower system?
In MySQL i would have a table
userID INT PRIMARY,
followID INT PRIMARY
And in Redis I would just use a SET and add to the UserID all the followIDs.
What would be faster for lets say someone having 2000 followers and you want to list all the followers?(in a table that has about 1M entries)
What would be faster to find out if two Users follow each other?
Thank you very much!
By modern standards, 1M items are nothing. Any database or NoSQL system will work fine with such volume, so you just have to pick the one you are the most comfortable with.
In term of absolute performance, Redis will be faster than MySQL on this use case, because:
the whole dataset will be in memory
hash tables are faster than btrees
there is no SQL query to parse or execute
However, please note a relational database is far more flexible than a key/value store like Redis. If you can anticipate all the access paths to your data, then Redis is a good solution. Otherwise you will be better served by a more traditional database.
In my opinion, go with MySQL.
The two biggest points you will think about when making the decision are:
1) Have you thought about your use-cases?
You said you want to implement a follower system. If you're only going to be displaying a list of followers which each user has, then the Redis SET will be enough.
But what if you want to get a list of "A list of users which you are currently following"? You can't dig that up easily from your Redis SET, right? Or how about if you wanted to know if User-X is following User-A ? If User-A had 10,000 followers, this wouldn't be easy either would it?
MySQL is much more flexible when querying different types of results in different scenes.
2) Do you really need the performance difference?
As you know, Redis IS faster than MySQL in these kinds of cases.
It is a simple Key-Value system, so it will exceed the performance of MySQL.
Checking out performance results like these:
http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/
http://ruturaj.net/redis-memcached-tokyo-tyrant-and-mysql-comparision/
But the performance difference between Redis and MySQL really starts to kick in
only after about 5,000request/sec .
Otherwise you'd wouldn't be seeing a difference of more than 50ms.
Performance difference will not be an issue until you have a VERY large traffic.
So, after thinking about these two points, MySQL would be a better answer.
Redis will be good only if:
1) The purpose of the set/list is specific, and there is no need for flexibility in the future
2) You feel that the performance difference will actually have an effect on your architecture.
It depends on what you want to do with the data. You gave some examples but it does not sound as though you are really giving a full definition of what the product needs to do. If all you really want to do is show users if they follow each other? Then either is fine as you are just talking about 2 simple queries. However, what if you want to show two users the intersection of users they share or you want to make suggestions off of the data based on profile data for the users. Then, it becomes more interesting as Redis has functionality to easily give you the intersection of sets very very quickly (we're talking magnitude differences in terms of speed not just milliseconds - and the difference gets exponentially larger as there are more users/relationships to parse as the sql joins required to get the data can become prohibitive if you want to give the data in real time).
sadd friends:alex george paul bart
sadd friends:alice mary sarah bart
sinterstore friends:alex_alice friends:alex friends:alice
Note that the above can be done with mysql as well, but your performance will suffer and it would be something that you are more likely to run as a batch job and then store the results for future use. On the other hand, keep in mind that the largest "friends" network in the world, Facebook, started with mysql to store relationships. The graphs of those relationships were batched and heavily denormalized for storage in thousands of memcached servers to get decent performance.
Then if you are looking for more options beyond mysq1 or redis, you might want to read what Michael Stonebaker has to say (he helped create Postgres and Ingres) about using an RDBMS system for graph data such as friend relationships. http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/. Of course, he's trying to sell his new VoltDB but it is interesting food for thought.
So I think you really need to map out the requirements for the app (as I assume it will do more than just show you who your friends are) in terms of both expected load (did you just throw out 2000 or is that really what you expect to handle) and features and budget. Then really examine many of the different options on the market.

Using Hive for real time queries

First of all I wanted to clarify that I am learning about Hive and Hadoop (and big data in general), so excuse the lack of proper vocabulary.
I am embarking myself in a huge (at least for me) project which requires dealing with enormous quantities of data which I am not use to deal in the past as I always worked mostly with MySQL.
For this project a series of sensors will produce approximately 125.000.000 data points 5 times an hour (15.000.000.000 a day) which is several times more that everything I have ever inserted into every MySQL table combined.
I understand that one approach would be using Hadoop MapReduce and Hive to query and analyze the data.
The problem I am facing is that for what I could learn I understand Hive runs mostly like "cron jobs" and not with real time queries which may take many hours and require a different infrastructure.
I thought of creating MySQL tables based on the results of Hive queries as at most the data which will be needed to be queried in real time would be approximately 1.000.000.000 rows but I was wondering if this is the right way to go or I should look into some other technology.
Is there any technology I should study which is specifically created for real time queries on big data?
Any tip will be much appreciated!
This is a complicated question. Let's start by addressing the technologies that you mention in your question, and go from there:
MySQL: It should be obvious to anyone who has used MySQL (or any other relational DB) that a traditional out-of-the-box installation of MySQL will never support the volumes that you are talking about. the back of the envelope calculations are enough to tell us that- assuming that your sensor inserts are only 100 bytes, you are talking about 15 billion x 100 bytes = 1.5 trillion bytes or 1.396 terabytes per day. That's truly big data, especially if you are planning on storing it for more than a day or two.
Hive: Hive can certainly handle that kind of data volume (I and many others have done it), but as you point out, you don't get real-time queries. Every query will be in batch, and if you need fast queries you'll need to pre-aggregate data.
Now that brings us to the real question- what kind of queries do you need to run? If you need to run arbitrary, real-time queries and can never predict what those queries might be, then you probably need to look towards comparatively expensive, proprietary data stores like Vertica, Greenplum, Microsoft PDW, etc. These will cost a lot of money, but they and others can handle the load you are talking about.
If on the other hand you can predict with some degree of accuracy the type of queries that will be run, then something like Hive might make sense. Store the raw data there, and use the batch query capabilities to do the heavy lifting and periodically create aggregated data tables in MySQL or another relational database to support your needs for low-latency queries.
One more alternative is something like HBase. HBase gives you low-latency access to distributed data, but you lose two critical items that you are probably accustomed to- a query language (HBase doesn't have SQL) and the ability to aggregate data. To do aggregations in HBase, you need to run a MapReduce job, though that job can then go and store it's results back into HBase for low-latency access again.

Storing and analysis of historical data - What kind of Database?

I'm currently designing a system that watches the ranks / views of youtube videos. of LOTS of youtube videos (> 500.000 and growing) on a daily basis.
I'm currently considering storing this in a MySQL database, but what disturbs me, is that the table would grow into billions and trillions of rows, which I don't think would perform well.
I need to analyse this data, for example:
Which videos grew a lot in the time between X and Y
Plot the clicks per day
Plot the clicks per week ...
some more things I don't know yet about
So, what came into my web 2.0 mind was, is there a way a NoSQL database could handle this better? I didn't quite learn these (almost) new databases and don't know what they are capable of.
What would your advice be, what type of database to use?
Relational or not? If not, which NoSQL database?
PS: first priority is the fast evaluation and insertion of the results, second is high availability (or just replication)
It is very difficult to give an advice for a database system, because it always depends. However, considering that Facebook is built on MySQL, it shows that there probably performance is not a limit on MySQL for you.
What is helpful and you'll probably have done, is creating a structure of how your table structure should look like. Then also think of queries you would like to run against the tables.
If you have the right indexes (which is the main and crucial factor query speed relies on), you will not have to worry about performance in MySQL. What you should consider are (what I've had to experience), that there are many interesting things how MySQL deals with indexes. Let me give a few examples I had to figure out during the time:
if you want to use an index for a range scan, the index cannot be used for ORDER BY anymore
a range column has to be the last in an concatenated index for the full index to be used, same for ORDER BY again
For more information, a useful link on mysqlperformanceblog.com: http://www.mysqlperformanceblog.com/2009/09/12/3-ways-mysql-uses-indexes/
In general, if the structure of the database is well thought and the indexing is good, in my experience it does not matter actually if you only have 10.000 rows or 10 billion, the query time would be about the same.

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program