Follower System, better in MySQL or Redis? - mysql

I'm just wondering what solution to chose to implement a follower system?
In MySQL i would have a table
userID INT PRIMARY,
followID INT PRIMARY
And in Redis I would just use a SET and add to the UserID all the followIDs.
What would be faster for lets say someone having 2000 followers and you want to list all the followers?(in a table that has about 1M entries)
What would be faster to find out if two Users follow each other?
Thank you very much!

By modern standards, 1M items are nothing. Any database or NoSQL system will work fine with such volume, so you just have to pick the one you are the most comfortable with.
In term of absolute performance, Redis will be faster than MySQL on this use case, because:
the whole dataset will be in memory
hash tables are faster than btrees
there is no SQL query to parse or execute
However, please note a relational database is far more flexible than a key/value store like Redis. If you can anticipate all the access paths to your data, then Redis is a good solution. Otherwise you will be better served by a more traditional database.

In my opinion, go with MySQL.
The two biggest points you will think about when making the decision are:
1) Have you thought about your use-cases?
You said you want to implement a follower system. If you're only going to be displaying a list of followers which each user has, then the Redis SET will be enough.
But what if you want to get a list of "A list of users which you are currently following"? You can't dig that up easily from your Redis SET, right? Or how about if you wanted to know if User-X is following User-A ? If User-A had 10,000 followers, this wouldn't be easy either would it?
MySQL is much more flexible when querying different types of results in different scenes.
2) Do you really need the performance difference?
As you know, Redis IS faster than MySQL in these kinds of cases.
It is a simple Key-Value system, so it will exceed the performance of MySQL.
Checking out performance results like these:
http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/
http://ruturaj.net/redis-memcached-tokyo-tyrant-and-mysql-comparision/
But the performance difference between Redis and MySQL really starts to kick in
only after about 5,000request/sec .
Otherwise you'd wouldn't be seeing a difference of more than 50ms.
Performance difference will not be an issue until you have a VERY large traffic.
So, after thinking about these two points, MySQL would be a better answer.
Redis will be good only if:
1) The purpose of the set/list is specific, and there is no need for flexibility in the future
2) You feel that the performance difference will actually have an effect on your architecture.

It depends on what you want to do with the data. You gave some examples but it does not sound as though you are really giving a full definition of what the product needs to do. If all you really want to do is show users if they follow each other? Then either is fine as you are just talking about 2 simple queries. However, what if you want to show two users the intersection of users they share or you want to make suggestions off of the data based on profile data for the users. Then, it becomes more interesting as Redis has functionality to easily give you the intersection of sets very very quickly (we're talking magnitude differences in terms of speed not just milliseconds - and the difference gets exponentially larger as there are more users/relationships to parse as the sql joins required to get the data can become prohibitive if you want to give the data in real time).
sadd friends:alex george paul bart
sadd friends:alice mary sarah bart
sinterstore friends:alex_alice friends:alex friends:alice
Note that the above can be done with mysql as well, but your performance will suffer and it would be something that you are more likely to run as a batch job and then store the results for future use. On the other hand, keep in mind that the largest "friends" network in the world, Facebook, started with mysql to store relationships. The graphs of those relationships were batched and heavily denormalized for storage in thousands of memcached servers to get decent performance.
Then if you are looking for more options beyond mysq1 or redis, you might want to read what Michael Stonebaker has to say (he helped create Postgres and Ingres) about using an RDBMS system for graph data such as friend relationships. http://gigaom.com/2011/07/07/facebook-trapped-in-mysql-fate-worse-than-death/. Of course, he's trying to sell his new VoltDB but it is interesting food for thought.
So I think you really need to map out the requirements for the app (as I assume it will do more than just show you who your friends are) in terms of both expected load (did you just throw out 2000 or is that really what you expect to handle) and features and budget. Then really examine many of the different options on the market.

Related

Buses scheduling - relational database or nosql

I'm trying to store buses schedules into database and I'm wondering which database model is suitable for my case.
I have bus operators, each operator has several routes, each route has several turns, each turn has stops, etc. Turns are generated from something called "turn master" where the scheduling is defined (frequency, stops, etc.) within next N days.
I expect to deliver a very fast searching for bus when user tries to search a bus from city to city on given date.
I'm using MySQL, the number of stops reach around 100.000 records and the searching speed is fast but I'm not sure if it's still fast when data gets really big (thousand operators, each operator has hundreses turns, each turn has around 10 stops, turns are generated for around next 30 days).
Basically, performing a search is to look into stops (city/town/place, time) and check if it matches user search criteria.
So, my question is: Is relational database best in this case? Or using some kind of NoSQL will be better when the data get really bigs?
Thanks in advance,
NoSQL databases are designed to work with unstructured data or data which is structured in various or unpredictable ways. Your data is structured in a very well understood and predictable way.
What makes you think that relational database isn't the right answer for your application? Having a lot of rows doesn't mean your relational queries are going to be slow. The performance of your application will depend on having proper indexing, but even more importantly, it will depend on your application logic. What heuristic are you using for solving the travelling salesman problem? How you do your routing could potentially have a bigger impact on system performance than your data storage choices.

Storing and analysis of historical data - What kind of Database?

I'm currently designing a system that watches the ranks / views of youtube videos. of LOTS of youtube videos (> 500.000 and growing) on a daily basis.
I'm currently considering storing this in a MySQL database, but what disturbs me, is that the table would grow into billions and trillions of rows, which I don't think would perform well.
I need to analyse this data, for example:
Which videos grew a lot in the time between X and Y
Plot the clicks per day
Plot the clicks per week ...
some more things I don't know yet about
So, what came into my web 2.0 mind was, is there a way a NoSQL database could handle this better? I didn't quite learn these (almost) new databases and don't know what they are capable of.
What would your advice be, what type of database to use?
Relational or not? If not, which NoSQL database?
PS: first priority is the fast evaluation and insertion of the results, second is high availability (or just replication)
It is very difficult to give an advice for a database system, because it always depends. However, considering that Facebook is built on MySQL, it shows that there probably performance is not a limit on MySQL for you.
What is helpful and you'll probably have done, is creating a structure of how your table structure should look like. Then also think of queries you would like to run against the tables.
If you have the right indexes (which is the main and crucial factor query speed relies on), you will not have to worry about performance in MySQL. What you should consider are (what I've had to experience), that there are many interesting things how MySQL deals with indexes. Let me give a few examples I had to figure out during the time:
if you want to use an index for a range scan, the index cannot be used for ORDER BY anymore
a range column has to be the last in an concatenated index for the full index to be used, same for ORDER BY again
For more information, a useful link on mysqlperformanceblog.com: http://www.mysqlperformanceblog.com/2009/09/12/3-ways-mysql-uses-indexes/
In general, if the structure of the database is well thought and the indexing is good, in my experience it does not matter actually if you only have 10.000 rows or 10 billion, the query time would be about the same.

Optimizing of database with multiple JOINs

First, some details about the website and the database structure -
With my website you can learn English words, and you can insert on each word a sentence, an association, an image, in addition - each word has a category, sub category, group...
My database includes about 20 tables. any user who registers to my website 'add' to users table something like 4000 rows - the number of the words on my website. I have a serious problem while the user is filtering words (somthing like 'search' word but according char/s & category/s & group/s etc.. I have 9 JOINs in my sql query, and it takes something like 1 MIN to display results..
The target of JOINs - inside the table users (where each user has 4000 rows / each row = word) there are joins on this style:
$this->db->join('users', 'sentences.id = users.sentence_id' ,'left');
The same thing with associations, groups, images, binds between words etc..
The users table includes id of sentences, associations, groups.. and with the JOIN there is a connection.
I don't know what to do.. it takes too much time. maybe the problem is the structure of the database? multiple joins? maybe using indexing? but how and where? because it's necessary sometimes retrieve all the words so indexing wouldn't help.
I'm using MySQL.
First of all, if you're using that many joins, indexes will not save you (as they will not be used in joins most of the time).
There are a few things you can do.
Schema Design
You probably would want to reconsider your schema design/query if you need 9 joins to achieve what you are doing!
From the looks of it, it seems your your tables are very normalized, perhaps in 3rd normal form? In that case consider denormalizing your tables into a larger one to avoid joins (joins are more expensive than full table scans!). There are many online documentations on this, however there's always costs to this, as it increases development complexity and data redundancy. Also by denormalizing your tables you avoid joins and can make better use of indexes.
Also I believe MyISAM is the only storage engine in MySQL that supports FULL TEXT indexes. However it does not have transactions and have table level-locking and no MVCC, so it depends on what you need.
Resources
I suggest you have a read at this book High Performance MySQL.
A truly awesome book on tuning MySQL databases
I also suggest having a read at the official documentation on your chosen storage engine. This is significant as each storage engine is VERY DIFFERENT! InnoDB is completely different from MyISAM which is also completely different from PBXT. Each engine has its benefits and you will have to consider which one fits your situation.
I would draw out the relational schema and work out the number of operations for the queries you are running, and go from there. Most DBMS's attempt to optimise queries implicitly, but not always optimally. You should look into re-ordering the joins so that the most restrictive are carried out first. Indexes could help, and again, would require some analysis to find which attributes you are searching on.
Building databases to deal with natural language is a very challenging subject and there is a lot of research on the subject. Have you looked into Markov chains? Have you taken a step back and thought about the computational complexity of what you are trying to do? If you arrive at the same conclusion of nine joins, then it may be fair to say that the problem is not scalable enough for a real-time application.
As an aside, I believe Google App Engine's data store attempts to index attributes for you, with implicit scalability. If you're running your database on a small web server, then you may see better results deploying it with a more comprehensive DBMS. I would only look into this as a last resort, however.

NOsql Vs Mysql - Going schemaless with Cassandra

Here are the facts:
We have a lot (L O T) of data coming in everyday.
Each file we receive is in a csv format and while there are a couple of headers that reoccur more often than others, there is not really a standard.
The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema (new field appeared in on file that was not existing before..).
While the primary key is unique, anything else can be duplicated
These are customers records (i.e.: email,firstname,lastname,city,state,address...etc)
We could have multiple emails for the same individual ..
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
Speed is what we are looking for. Mysql is too slow to answer queries where tables are over 50 million records. Even well optimized we have too many speed issue. Breaking down the tables has become an organizational concern. Schema less noSQL seemed attractive. What would you recommend, what did you implement? (Please do not answer to optimize mysql .. pointless and off topic)
--
Let's go over the points:
We have a lot (L O T) of data coming in everyday.
NoSQL solutions are basically all created to scale to large numbers (Riak, MongoDB, Cassandra, etc.)
... headers that reoccur more often than others, there is not really a standard... The normalization of each file to be uploaded into a mySQL database is highly time consuming and often pushes us to change the schema
NoSQL definitely fits this model many of them are "schema-less" so it's easy to store those extra fields. This will however cost you extra space as the field names are typically stored with the document.
While the primary key is unique, anything else can be duplicated
"Document-oriented" and "Key-Value" databases are a good fit for this as long as the key is provided. If you have to run duplicate checks, then most key-value database are ill-equipped. The "document-oriented" database might be slightly better equipped, but not by much.
We could have multiple emails for the same individual
Most of these databases have some notion of "arrays as a basic type". CouchDB and MongoDB both store objects as JSON, so it's easy to see how a customer could have an array of e-mails without the need for a "join table". MongoDB also provides "atomic update" features like "$addToSet" that plays nicely with arrays.
We read 70% of the time and we write 30% of the time
Scalability could be a concern but it is not right now, though availability is key
The major NoSQL DBs are all designed to scale. (both reads and writes)
The only way to availability is through hardware and locational redundancy (no different that MySQL or other databases). Despite their low version numbers, many of these Databases are being used in production environments by very big companies, so many of the simple cases are covered. It's still virgin territory, but we're also past the "randomly crashes when nothing has changed" phase.
Speed is what we are looking for... Schema less noSQL seemed attractive. What would you recommend, what did you implement?
We have 100s of M of flexible user records in MongoDB. Performance on individual seeks is really awesome.
However, you have to wary about the type of queries you're running.
If you need to run queries that bring back several Users at once, you're going to have speed issues with basically any of these Key-Value or Document-Oriented database. You may want to look at Graph database or some other fancy solution. However, if your use cases all center around one user at a time then take a look at MongoDB.
MongoDB also supports native map-reduce so you'll be able to scale "non-real time" queries.

How can I optimize my database?

I am creating a platform for some clients. Each client needs to have contacts and manage them in groups, categories (which depends of the group) and subcategories (which depends of the category).
The database is going to be very big, and Im afraid about the performance. I want to optimize the database; now, I have these options:
Manage only one database with multiple tables (as we manage now)
Create a database for each client (each database will have the same multiple tables as the option 1)
Manage multiple XML files (like option 2, each client will have a directory with an XML for contacts, another XML file for groups, another for categories, and so on)
Wich is the best option for performance and management of the data (CRUD, create, read, update, delete)??
Thanks!!
I think one database with multiple tables is the way to go, because duplicating the database and schema for each new client doesn't scale well. XML files sounds cool but so far I haven't seen an XML read/write engine which is as fast as most RDBMSes, so bin that one.
To make this work (lots of tables in one database) you should pay attention to indexing and optimizing the one database; indexes in particular will help you maintain speed as you scale up.
Use clustered indexing on the clienId in whichever table it might exist as a foreign key. This procedure will give you the best client-centric performance because you would (usually) be pulling a particular client's info in a page fetch.
For #2, I would suggest making that a premium service to your clients. If they want "priority hosting" on a separate server of "their own" then they pay extra. That will make the maintenance headache worthwhile.
Have you tried actually implementing 1 (which is the easiest)?
Did you profile the code?
What is the performance now?
use EXPLAIN to see how the queries are performing?
Do you use indexes (often correct indexes are enough to give excellent performance changes)?
Optimize when you hit a bottleneck (or when you set certain benchmarks for performance), not during design phase...
UPDATE: You mentioned "millions of entries". That's nothing for mysql (provided you use correct indexes on your tables). I have a table with about 40 million rows & although it's not lightning fast it gives me results in a couple of seconds. So there you go...
3 is not advisable. Search etc. is not what XML files do efficiently.
2 is a maintenance problem.
1 should be doable. "very big" means what? I have a database with a tabe with currently 1.5 billion entries - that is "big" not "very big". What do you define as very big?
As far as ongoing maintenance and support goes I think only option 1 makes sense for you.
Index all columns you need to but nothing more. Look at your code and see how tables are being JOINed and index the columns which will otherwise require a table scan.
Indicies will speed up the read operations but slow down your write operations as you need to update the indicies as well as the column. They also need more space in the DB.
As suggested above use EXPLAIN to see how your queries are executing and what can be optimized there.
Finally performance tuning only works well after you baseline your existing performance, make a change, then baseline performance again to see if it helped. If not roll back and try something else. But always start with a known level of performance, otherwise you might end up making multiple changes which in total slow things down. Good luck!