Pulling wall/dashboard data like facebook, twitter, tumblr, etc - mysql

I feel this must be asked elsewhere, but I couldn't figure out the correct search words to find an answer. If this is a duplicate, please point to correct response elsewhere.
Services like Facebook, Twitter, Tumblr, and I'm sure a whole host of others allow you to follow other users. Their posts then appear on a wall or dashboard. I'm wondering how, with such large data sets, these services can pull posts so quickly. I assume they are not using a SQL server and they are not doing something like:
SELECT * FROM `posts` WHERE `poster_id` IN ( super long list of users being followed ) ORDER BY `date` LIMIT 10;
As the above could have a very large list of user ids in it, and it likewise wouldn't work very well with sharding, which all these large services use.
So, can anyone describe what kind of queries, algorithms, or databases these services use to display the followed posts?
Edit: Thanks for everyone's responses. It seems like the most likely way of doing this is via a graph database such as GraphDB, Neo4j or FlockDb, the latter of which is Twitter's graph database. With Neo4j, it is done something like what is documented at http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html.
Of course, Google, Facebook, etc., all have their own, internally built or internally modified databases for their unique use cases.

I could name a few technique on how to make process/fetch data faster but I'm not sure these are the same techniques implemented by facebook, twitter..etc..as each one of them is built on different platform and architecture.
Fetching the data from cached memory - means that users will fetch data without touching the DB, rather getting it from the memory
Splitting the process into different servers - means that the resources are processed by multiple servers to prevent bottlenecks..
if you want to specifically know the stack facebook uses you could read the link.
http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/

Check out Open Graph- Twitter & Facebook both use this architecture to retrieve "stories" posted by users. It's a version of the semantic web idea. https://developers.facebook.com/docs/opengraph/ The days of SQL calls are over (thank god). FQL- the Facebook Query Language still works, but is largely being deprecated. It's not SQL but a version of a query language against the graph (was databases).

Essentially all the really big sites have moved away from SQL servers and towards NoSQL in some form or other (several of the really big ones having written their own!). The NoSQL databases relax ACID constraints but as a result are much more able to scale and handle potentially enormous numbers of requests.
If you google NoSQL you will find lots of information about it.
http://blog.3pillarglobal.com/exploring-different-types-nosql-databases
http://www.mongodb.com/learn/nosql
SQL still has it's place, but for a lot of things NoSQL is the way forwards.

Thanks for everyone's responses. It seems like the most likely way of doing this is via a graph database such as GraphDB, Neo4j or FlockDb, the latter of which is Twitter's graph database. With Neo4j, it is done something like what is documented at http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html.
Of course, Google, Facebook, etc., all have their own, internally built or internally modified databases for their unique use cases.

Related

Database for counting page accesses

So let's say I have a site with appx. 40000 articles.
What Im hoping to do is record the number of page visits per each article overtime.
Basically the end goal is to be able to visualize via graph the number of lookups for any article between any period of time.
Here's an example: https://books.google.com/ngrams
I've began thinking about mysql data structure -> but my brain tells me it's probably not the right task for mysql. Almost seems like I'd need to use some specific nosql analytics solution.
Could anyone advice what DB is the right fit for this job?
SQL is fine. It supports UPDATE statements that guarantee your count is correct rather than just eventual consistency.
Although most people will just use a log file, and process this on-demand. Unless you are Google scale, that will be fast enough.
There exist many tools for this, often including some very efficient specialized data structures such as RDDs that you won't find in any database. Why don't you just use them?

combination mysql mongodb

I am building a web application that requires to be scalable. In a nutshell:
We got users, users have friends, so they got a friendlist. Users can create messages, and messages from your friends are displayed on the homepage, each message is linked to a location and these messages can be filtered by date, for example I want to display all the messages from my friends that where posted yesterday, or display me all messages from location X.
I am now building the application fully in MongoDb, however I am heading into trouble atm. For example:
On the mainpage, we got the message list of the friends of the users, no problem we use:
$db->messages->find(array('users._id' => array('$in' => $userFriendListGoesHere)));
So then we got our messages, however after that, each message has a location, so I have to make a loop through all messages, and get the location from another collection, and also multiple users can be bound to a single message, so we also have to get all the user data from another collection, in MySql simply a join query, in MongoDb 2 loops, and this is my first question: is this a problem? Does this require alot of resources, the looping?
So my idea is to split up with MySql and MongoDb, I use MongoDb to store all the locations (since it are over 350.000+ locations and use lat long calculations) and MySql for the message, users and friends of the users, so second question, can you help me with my decision, should I keep using MongoDb with the loops? Or use a combination?
Thanks for reading and your time.
.. in MySql simply a join query, in MongoDb 2 loops, and this is my first question: is this a problem?
This is par for the course with MongoDB, in fact, it's a core MongoDB trade-off.
MongoDB is based on the precept that joins do not scale. So it has no joins and leaves you to "roll your own". Some libraries like Morphia (for Java) provide built-in logic for loading references.
PHP has the Doctrine project, which should help with some of this.
Does this require alot of resources, the looping?
Kind of? This will really depend on implementation.
It's obviously going to involve a bunch of back and forth with the DB, but it may be less network traffic than the SQL version. You will need memory space for all of the data coming back. But again, that's not terribly different from SQL.
Really, it's up to you to make all of the trade-offs about how this is implemented and who is keeping what in memory.
should I keep using MongoDb with the loops
MongoDB is a great idea when your data is not inherently relational.
In the example you provided, it kinda seems like your data is relational. MySQL and other relational DBs (such as Postgres) are better data stores than MongoDB for relational data. This blog post covers this topic in more detail.
In summary, I'd recommend the following:
Please spend some time analyzing whether your data is inherently relational or not.
If it is not, then MongoDB can give you benefits over using MySQL.
If it is relational, then MySQL is the better solution.
Using both is, of course, possible - but it will create additional work & complexity for you. In the long term - is that worth the effort? Only you will know the answer.
Best of luck with your web app!

Database, technologies, engine, etc to use for my website based on these requirements

My Website: A search intensive location based social network. So at a high level are components like we see on facebook - profiles, feeds etc. On a low level I am drilling down right to reservations at hotels across the world, restaurants etc. So lots of data, lots of searches, lots of analytic, lots of reads/writes.
Current Platform: 1 MySQL database, Php Codeingiter, 1 Dedicated hosting server. Website is geo-location so world over, support multi lang and localization, must be real time. Plan to add CDN once launched. This will change once i finalize the exact technologies to use.
Here are the list of concerned items:
1. Website searching: Photos / videos (name, description, people tagged in it), user defined tags, comments (like wall comments), posts, blogs, group, people searching by name/email.
Mail searches: searching subject, email content, sender, attachments.
Storing Basic user/system values: User details, system details, schema, etc.
Storing & implementing Live feeds: Real time feeds based on user activities.
Storing & implementing Analytic: In house developed web analytic for system reporting + user analytic for business pages. This includes mixture of reports/graphs/metrics. So this will be a custom data warehouse.
Storing & implementing Relationships: Find, maintain and show users degree in relationships, common items between various degrees.
Handling API calls so businesses can send/receive data like hotel/restaurant owners, etc
QUESTION - Can anyone suggest: Database to use - type and which brand exactly (relational, document, key value, graph, etc), database engine to use if rdms (InnoDB may not work in all cases), add on servers/file systems/cache like memcached, etc? Should i go normalized or de-normalized if rdms. Or NOSQL all the way.
MySQL works for some parts, memcache works for some, lucene works for some parts, some parts like inbox may require a document database, relationships may require a graph database but I am not sure which one works exactly for which of the 7 items above and if i can use the same platforms/technologies for most of the above items. My only requirement is open source so it can be free to use and work with PHP. I don't want to implement a separate database/set of technologies to support each of the above 7 requirements. Ofcourse being a social network, performance and scalability are important too.
If you have the bucks then Oracle will support most of your requirments. which really come down to standard RDBMS, plus CLOBs plus full text search.
MS SQLserver will also support these features but you are limited to Windows host.
If you are doing this with open source I would seriously look at PostGres as MySql's future looks uncertain now its owned by the world largest comercial database vendor.
Well, FourSquare is doing most of this with MongoDB, so it must have something going for it.
I don't want to start any holy wars here (though I guess it may not sound like it), but don't use MySQL, just... don't. Besides, these days it's looking more likely that Oracle is trying to kill it. Oracle itself would be a giant waste of money for something like this.
If you want to stay with a relational model, take a look at VoltDB, that's been making some noises as a SQL db that's actually horizontally scalable.
Personally I would start with a combination of Mongo, Lucene, and Hadoop/HBase for the data crunching (analytics, relationships, etc). But really, it would just be an excuse to play with shiny new toys, I don't claim to have much experience with these.
I would seriously rethink PHP as well, but here I go again with the holy wars.
First if you think the site will grow to anything like other successful sites you want to scale horizontally, so you'll need a distributed solution. Which means some sort of NoSQL solution. But you don't have to choose a single NoSQL solution, more and more you're seeing whats termed a polyglot approach - multiple db's to handle specific aspects. Seem too complicated? Probably not compared to trying to scale an ill-fitting technology into your architecture.
So store the objects in Cassandra or Mongodb, which provides excellent scale and performance. Then feed the relationship data into a distributed graph database to handle the network links. You'll have a nice blend of technologies that will be more scalable than a SQL database. But you'll need to review the technical requirements of the various technologies on your own, too many decisions to be made to make a product recommendation.

SQLite or MySQL for a read-mostly website?

Is it practical to use SQLite as the database backend for a website with, say, 300,000 unique visitors a month ?
Writes to the database will be pretty limited - user signing up or logging in, adding comments etc. The vast majority of the use will just be queries getting content based on a primary key in the URL. I'd like to know if SQLite can cope as a website backend and won't end up dramatically slower then MySQL.
I've seen this SO question and others, but they're not really the same and seem like they could be out-of-date now too. http://www.sqlite.org/whentouse.html suggests it'd be fine, but they might be a bit biased!
SQLite is a very cool product - and with HTML5 on the horizon, it's a good idea for any web developer to get acquainted with it. However you should bear in mind that sqlite does not scale well. If you ever need to share data across multiple webservers, it's going to be very difficult using sqlite as the data-substrate.
However to ease the development, you could look at PDO / dbx_ in PHP which provides an abstraction layer (i.e. same code talks to all sorts of databases) however there are some subtle variations between the way different systems implement stuff like transactions - and variations in SQL - if you do go down this route, I'd recommend maintaining your own abstraction layer between the PHP PDO/DBX calls and your application - think stored procedures implemented in PHP.
300,000 unique visitors a month ?
aaaarrrgghhhh! pet hate. While you need to think about how much money your site will be making in order to plan a budget, this is not a useful metric for capacity planning. Really you want to look at expected hit/page rates.
I think you would be fine. Sqlite is able to support multi-threading just fine, and as you are mostly reading from it, there shouldn't be a problem. Also, if you are writing to it, it fully supports transactions as well. You have to remember though that it's still just one file and no service - so if you are going to cluster it you will be out of luck. Maybe you should check what problems exactly you have with mysql and solve them.
sqlite is very fast, but it becomes difficult to use once you need to cluster. However pretty much all databases are difficult once you need to cluster. If you are read oriented, it shouldn't matter too much which you use. Just make sure you are using memcached.

What are the reasons to store documents into DBMS when using Alfresco CMS

I have interview for an internship with company that wants to implement document management system and they are considering on the first place open source solutions, their top choice being Alfresco, but decision is still not final, part of my work there would be to investigate is Alfresco the best solution.
What I have seen from project description, is that they would implement Alfresco with MySQL database, and not to use DBMS just for document metadata and indexing, but they actually want to store documents inside. By company profile, type of documents would be mostly PDF and .doc, not images.
I have researched a bit, and I have read all the topics here related to storing files into the database, not to duplicate a question. So from what I understand, storing BLOBS is generally not recomendable, and by the profile of the company and their legal obligations with archiving, I see they will have to store larger amount of docs.
I would like to be ready as much as I can for the interview and that is why I would like your opinion on these questions:
What will be your reasons for deciding to store documents into the DBMS, (especially having in mind that you are installing Alfresco, which stores files in the FS)???
Do you have any experiences with storing documents into the MySQL database specifically???
All the help is very much appreciated, I am really excited about interview and really want this internship, so this is one of things i really want to understand before!!
Thank you!!!!
From my experience with Alfresco, this is going to take a lot of customization of the Alfresco repository. I wouldn't go there myself. But if had to, I would answer your questions like this:
Reasons for storing documents into the DBMS instead of the file could be:
use of DBMS backup/security tools to copy/save/backup the documentation around,
and this one is probably a good one:
access to these documents could be easier from other applications. I mean, if you're rewriting the storage service anyway, then you can rewrite it so that you store some of the metadata in the new database structure too. This would create some redundancy, but it would make the documents accessible from other systems without having to depend on Alfresco.
I have some experience with applications that use DBMS as storage - the application was used to store the incoming invoices, so that those could be approved, disputed or sent for payment or whatever.
It had decent LAN performances, but the company had really good bandwidth. On the remote locations, though, it was a bit lagged as the documents were transfered back and forth.