We have a medium size e-commerce site. We sell books. On said site we have promotions, user recommendations, regular book pages, related books, etcetera. Quite similar to amazon.com except ofcourse the volume of the site.
We have a traditional LAMP setup, where the M still stands for MariaDB.
TPTB want to log and analyze user behaviour in order to optimize conversion.
Bottom line, each click has to be logged, I think. (I fear)
This will add up to a few million clicks every month. The system has to be able to go back in time at least 3 years.
Questions that might be asked the system are: Given a page (eg: homepage), and clicks on a promotional banner, which color of said banner gives the best conversion. Now split that question into new and returning customers. (Multi-dimensional or A/B-testing) Or, given a view of book A and B, which books do users buy next. The range of queries is going to be very wide. Aggregating the data will be pointless.
I have serious doubts about MySQL's ability to provide a good platform for storing, analyzing and querying this data. We could store the rows, feeding them to MySQL via RabbitMQ as to avoid delays, but query and analyze this data efficiently might not be optimal in MySQL, given 50M rows.
There have been a number of articles about using MongoDB to store analytical data. But all the posts seem to increment a counter in a document (pre-aggregating the data), which is not good enough for us.
The big question is: Is there any database (or other system) that is particularly well-suited to store and analyze data like this? Might MySQL still do the trick? Am I correct in my assessment that MongoDB probably might not be of any added value here?
If I understand correctly, then you only want to have reports with aggregated data done say once a day (As opposed to "live")? If that's the case, I would suggest to use Hadoop, as it allows you to run massive Map/Reduce jobs running this aggregations for you, and then present you with a report. At this amount of data, any "live" solution will just not work.
If you don't want to mess with the complexity of Hadoop and Map/Reduce, then perhaps MongoDB might work. It has quite a powerful aggregation framework that can be tasked to do many aggregations in a sort-of-live environment. It's not really meant for running at every pageview, but it's also not a "let's do this once a day" kinda thing. It depends a little bit on your data aggregation requirements whether the Aggregation Framework can help you, but if it doesn't, then MongoDB also supports Map/Reduce for some more complex tasks (at a slower pace). MongoDB is a quite a good fit, as you can have large write performance - if one node doesn't work, you can always shard to have higher write performance.
If your primary convern is to offer recommendations based on past user choices, you may also consider a graph database like Neo4j or FlockDB.
Those database would allow you to build relationship between buyers and the items they bought (which should be a lot less data to store, since you will have a lot less user data redundancies) which you can use to do some Triadic closure processes- In other words finding out what similar users bought that user 'A' did not buy yet.
I can not say I have done it yet, but I am also seriously looking into this.
Otherwise MongoDB in addition to the Map Reduce paradigm, has now (v 2.4.6) an Aggregation Pipeline Framework that I have found very powerful.
Related
So let's say I have a site with appx. 40000 articles.
What Im hoping to do is record the number of page visits per each article overtime.
Basically the end goal is to be able to visualize via graph the number of lookups for any article between any period of time.
Here's an example: https://books.google.com/ngrams
I've began thinking about mysql data structure -> but my brain tells me it's probably not the right task for mysql. Almost seems like I'd need to use some specific nosql analytics solution.
Could anyone advice what DB is the right fit for this job?
SQL is fine. It supports UPDATE statements that guarantee your count is correct rather than just eventual consistency.
Although most people will just use a log file, and process this on-demand. Unless you are Google scale, that will be fast enough.
There exist many tools for this, often including some very efficient specialized data structures such as RDDs that you won't find in any database. Why don't you just use them?
I am developing a Discussion Forum for my University. For this to manipulate the data i m using CouchDB as database.
I m finding difficulty in designing the structure of my db, in order to maximize the performance of my db.
I want to discuss what is the good practice of designing a document database.
Either we should make only one database as SQL and make 'n' no. of documents in the database.
Or we can make more no of database in order to flatten my db structure.This also reduce the more no. of documents to be developed.
The questions you need to ask are simply this: "How do you want to get data out of your database?"
Database design hinges around the queries to be made, not what is available to be stored.
This is especially important for Document DBs like Couch, since, while it does have a flexible schema, it does not have flexible indexing. By that I mean that because of the granularity of the data, it's quite like that later on, when you need to ask a question that it was not designed to answer, answering that question may well be very expensive. It's much, much cheaper to design your views and other constructs early, when there is little data in the data base rather than later after you have thousands or millions of rows.
RDBMS's, since they tend to have a finer granularity of data, tend to be more nimble to new queries and such later in life. Document DBs, not so much.
So think through your use cases up front, and design around those, and design those early on, it's much less painless now than later.
It's hard to tell the right way to approach modeling your data since you don't give much information. Generally though you want to keep as much data as possible in one database as this allows you to index it together (indexes cannot span more than one database).
Also, since there is no schema enforcement in the database, you can create different types of records in each database. For example, there is nothing wrong with have both user information and forum entries in the same database.
Last, you will most likely want to keep messages and their replies in different records. This is an old but still relevant discussion on this topic: http://www.cmlenz.net/archives/2007/10/couchdb-joins
Cheers.
The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.
I've spent a few days researching the pros and cons of mysql against nosql solutions (specifically mongodb) for my project.
The project needs to be able to eventually scale to handle tens of thousands of simultaneous users - millions of users in total. The site is heavily user focussed and will interact with the database as much if not more than a site like facebook - it is very relational, all functionality is dependant on the relation to the user and their relationship with other users. It's also data heavy - lots of files, images, audio, messaging, personal news feed etc.
I like the look on mongodb a lot, I like the way it works, and I like how it scales - but can't get my head around how this would work for a site such as I describe. Would all interactions for a specific user have to be stored in a single document?
I am however very comfortable using mysql and like the relational aspect of it. I am just worried without a lot of work there will be scalability issues with this project - although perhaps with memcached and sharding this won't be an issue?
I'd like to know from those with experience with the two databases on large projects, out of mysql and mongodb which is the right tool for this particular job?
If the data is highly relational, use a relational database. If it's not, don't. NoSQL is great, don't get me wrong, but it's not suited to all tasks. It may be suited to your task, but the only way to find out is for you to build some tests for your specific usecase. Add a bunch of dummy data (millions if not hundreds of millions of rows). And then load test it.
As far as scaling, that's more of a component of how you build your application than the backend you choose. Do you have a solid schema? Do you have a strong cache layer with write-through caching? Do you access the backend as efficiently as possible (queries and such)? Can you shard based upon your application?
Those are the kind of questions which are appropriate here. Not "which will scale for me better". And not "which is the right tool". Both can do the job fine. Which is best is up to you...
Obviously, there's no silver bullet here. However, I would like to challenge this one assumption you've made:
... it is very relational, all functionality is dependant on the relation to the user and their relationship with other users...
OK, I'd like you to picture having 100M users in a relational database and start building this model. Let's try something simple, grab the names of a user's friends.
How do you get a user's friends? Well you go to the users_friends table. If each user has even just 10 friends, that table contains a billion rows. If users have a more reasonable 100 friends, you now have 10B rows.
So now you have a user and a list of their friends IDs. How do we get their friend's names? Well you go through the list of 100 IDs and pull down each of the friends. Perfect.
So now, if you want to show one user the names of all of their friends, all you have to do is join the 100M record table to the 10B record table. This is not a simple task. Scaling joins becomes exponentially harder and more expensive as the dataset grows.
So, to make this easier, you're probably going to run a for loop and manually collect the records for each friend. You have to do this because the friends are scattered across multiple servers, so each "lookup" has to be done individually.
Already you've broken your "relational model".
What about the friends list? Is keeping a table of 10B records really practical? Why not just keep a list of friend IDs with each user? Why do an extra query.
If you notice the pattern here, we've basically broken down the "very relational" model into something that's effectively key-value lookups. Of course, the key-value model will scale much better. And so, MongoDB seems like a good fit here.
Don't get me wrong, there are lots of good uses for relational databases. But when you're talking about handling millions of individual key-value style requests, you probably want to look at a NoSQL database.
There is no law that you have to build an application with exactly one database. It is often common practice having dedicated backends for particular tasks. E.g. in the context of a Facebook-like application it may make sense to work with a graph-database for storing relations between users - every database has its pros and cons and only would fools implement large backends with only a RDBMS or only a NoSQL db just because they don't know better.
I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.
The activities are things like:
6 PM: John favorited Bacon
5:30 PM: Jane commented on Snow Crash
5:15 PM: Jane added a photo of Bacon to her album
The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).
I need to be able to do the following:
Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
Pull activities for a thing (like "Bacon") in reverse date order
Filter by activity type ("favorite", "comment")
Store at least 30 million activities
Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.
I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.
Is anybody doing anything like this outside of a traditional RDBMS?
Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...
Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change
I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.
I have no idea how much performance or much data you plan on using, but 30M rows is not very many.
If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.
But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.
EDIT:Some other points:
key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
Therefore, you cannot do a CREATE INDEX
Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.
In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.
I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.
It seems to me that what you want to do -- Query a large set of data in several different ways and order the results -- is exactly and precisely what RDBMeS were designed for.
I doubt you would find any other datastore that would do this as well as a modern commercial DBMS (Oracle, SQLServer, DB2 etc.) or any opn source tool that would accomplish
this any better than MySql.
You could have a look at Googles BigTable, which is really a relational database but
it can present an 'object'y personality to your program. Its exceptionaly good for free format text
searches, and complex predicates. As the whole thing (at least the version you can download) is implemented in Python I doubt it would beat MySql in a query marathon.
For a project I once needed a simple database that was fast at doing lookups and which would do lots of lookups and just an occasional write. I just ended up writing my own file format.
While you could do this too, it is pretty complex, especially if you need to support it from a web server. With a web server, you would at least need to protect every write to the file and make sure it can be read from multiple threads. The design of this file format is something you should work out as good as possible with plenty of testing and experiments. One minor bug could prove fatal for a web project in this style, but if you get it working, it can work real well and extremely fast.
But for 99.999% of all situations, you don't want such a custom solution. It's easier to just upgrade the hardware, move to Oracle, SQL Server or InterBase, use a dedicated database server, use faster hard disks, install more memory, upgrade to a 64-bit system. Those are the more generic tricks to improve performance with the least effort.
I'd recommend learning about message queue technology. There are several open-source options available, and also robust commercial products that would serve up the volume you describe as a tiny snack.
CouchDB is schema-free, and it's fairly simple to retrieve a huge amount of data quickly, because you are working only with indexes. You are not "querying" the database each time, you are retrieving only matching keys (which are pre-sorted making it even faster).
"Views" are re-indexed everytime new data is entered into the database, but this takes place transparently to the user, so while there might be potential delay in generating an updated view, there will virtually never be any delay in retrieving results.
I've just started to explore building an "activity stream" solution using CouchDB, and because the paradigm is different, my thinking about the process had to change from the SQL thinking.
Rather than figure out how to query the data I want and then process it on the page, I instead generate a view that keys all documents by date, so I can easily create multiple groups of data, just by using the appropriate date key, essentially running several queries simultaneously, but with no degradation in performance.
This is ideal for activity streams, and I can isolate everything by date, or along with date isolation I can further filter results of a particular subtype, etc - by creating a view as needed, and because the view itself is just using javascript and all data in CouchDB is JSON, virtually everything can be done client-side to render your page.