I was trying to learn database design techniques by myself. As facebook is an example of huge data processing system, I was wondering how to process that huge amount data. I came to know they use MySQL as core database engine and ‘Memcached’ to cache data and reduce database access.
I just to want to know how they store text data like status or comments. Do they just store it in some table of MySQL database or use any kind of technique?
Additionally it will be a bonus if anyone can provide information about their storing technique for media like images or videos.
(If asking that kind of information about an organization is illegal or unethical then I am sorry for asking).
Here are two relevant notes from the Facebook engineering team
TAO: The power of the graph
Needle in a haystack: efficient storage of billions of photos
Related
I have a video surveillance project running on a cloud infrastructure and using MySQL database.
We are now integrating some artificial intelligence into our project including face recognition, plate recognition, tag search, etc.. which implies a huge amount of data every day
All the photos and the images derived from those photos by image processing algorithms are stored in cloud storage but their references and tags are stored in the database.
I have been thinking of the best way to integrate this, do I have to stick to MySQL or use another system. The different options I thought about are:
1- Use another database MongoDB to store the photos references and tags. This will cost me another database server, as well as the integration with a new database system along with the existent MySQL server
2- Use elastic search to retrieve data and perform tag searching. This leads to question the capacity of MySql to store this amount of data
3- Stick with MySQL purely, but is the user experience will be impacted?
Would you guide me to the best option to choose or give me another proposal?
EDIT:
For more information:
The physical pictures are stored in cloud storage, only the URLs are stored in the database.
In the database, we will store the metadata of the picture like id, the id of the client, URL, tags, date of creation, etc...
Operations are of the type :
It will be generally a SELECTs based on different criteria and search by tags
How big the data is?
Imagine a camera placed outdoor in the street and each time it detects a face it will send an image.
Imagine thousands of cameras are doing so. Then, we are talking about millions of images per client.
MySQL can handle billions of rows. You have not provided enough other information to comment on the rest of your questions.
Large blobs (images, videos, etc) are probably best handled by some large, cheap, storage. And then, as you say, a URL to the blob would be stored in the database.
How many rows? How frequently inserting? Some desired SELECT statements? Is it mostly just writing to the database? Or will you have large, complex, queries?
I feel this must be asked elsewhere, but I couldn't figure out the correct search words to find an answer. If this is a duplicate, please point to correct response elsewhere.
Services like Facebook, Twitter, Tumblr, and I'm sure a whole host of others allow you to follow other users. Their posts then appear on a wall or dashboard. I'm wondering how, with such large data sets, these services can pull posts so quickly. I assume they are not using a SQL server and they are not doing something like:
SELECT * FROM `posts` WHERE `poster_id` IN ( super long list of users being followed ) ORDER BY `date` LIMIT 10;
As the above could have a very large list of user ids in it, and it likewise wouldn't work very well with sharding, which all these large services use.
So, can anyone describe what kind of queries, algorithms, or databases these services use to display the followed posts?
Edit: Thanks for everyone's responses. It seems like the most likely way of doing this is via a graph database such as GraphDB, Neo4j or FlockDb, the latter of which is Twitter's graph database. With Neo4j, it is done something like what is documented at http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html.
Of course, Google, Facebook, etc., all have their own, internally built or internally modified databases for their unique use cases.
I could name a few technique on how to make process/fetch data faster but I'm not sure these are the same techniques implemented by facebook, twitter..etc..as each one of them is built on different platform and architecture.
Fetching the data from cached memory - means that users will fetch data without touching the DB, rather getting it from the memory
Splitting the process into different servers - means that the resources are processed by multiple servers to prevent bottlenecks..
if you want to specifically know the stack facebook uses you could read the link.
http://royal.pingdom.com/2010/06/18/the-software-behind-facebook/
Check out Open Graph- Twitter & Facebook both use this architecture to retrieve "stories" posted by users. It's a version of the semantic web idea. https://developers.facebook.com/docs/opengraph/ The days of SQL calls are over (thank god). FQL- the Facebook Query Language still works, but is largely being deprecated. It's not SQL but a version of a query language against the graph (was databases).
Essentially all the really big sites have moved away from SQL servers and towards NoSQL in some form or other (several of the really big ones having written their own!). The NoSQL databases relax ACID constraints but as a result are much more able to scale and handle potentially enormous numbers of requests.
If you google NoSQL you will find lots of information about it.
http://blog.3pillarglobal.com/exploring-different-types-nosql-databases
http://www.mongodb.com/learn/nosql
SQL still has it's place, but for a lot of things NoSQL is the way forwards.
Thanks for everyone's responses. It seems like the most likely way of doing this is via a graph database such as GraphDB, Neo4j or FlockDb, the latter of which is Twitter's graph database. With Neo4j, it is done something like what is documented at http://docs.neo4j.org/chunked/milestone/cypher-cookbook-newsfeed.html.
Of course, Google, Facebook, etc., all have their own, internally built or internally modified databases for their unique use cases.
I'm writing an app that needs to handle more than 15.000 photos and I want to store into the database their EXIF and IPTC attributes.
My initial approach is to use MySQL and create a table to store all the attributes, as it is suggested here.
However most of the photos have up to 250 attributes. Since I got 15k photos that means I will have almost 4 million rows. And this is only the beginning (I expect more photos in the future).
I wonder whether MySQL would be ok in this scenario or I should move to a NoSQL approach like MongoDB.
Please also note that I need to make the database searchable.
Thanks in advance.
If you're a .Net developer, RavenDB is ideal for your scenario. It can easily handle that volume on very modest hardware, and has outstanding search capabilities provided by it's internal use of the Lucene search engine.
The photos themselves would be stored as attachments, while the attributes would be part of the document.
Even if you're not a .Net developer, RavenDB can be used over http/rest from any language. It's just much easier with the native .Net client.
I have interview for an internship with company that wants to implement document management system and they are considering on the first place open source solutions, their top choice being Alfresco, but decision is still not final, part of my work there would be to investigate is Alfresco the best solution.
What I have seen from project description, is that they would implement Alfresco with MySQL database, and not to use DBMS just for document metadata and indexing, but they actually want to store documents inside. By company profile, type of documents would be mostly PDF and .doc, not images.
I have researched a bit, and I have read all the topics here related to storing files into the database, not to duplicate a question. So from what I understand, storing BLOBS is generally not recomendable, and by the profile of the company and their legal obligations with archiving, I see they will have to store larger amount of docs.
I would like to be ready as much as I can for the interview and that is why I would like your opinion on these questions:
What will be your reasons for deciding to store documents into the DBMS, (especially having in mind that you are installing Alfresco, which stores files in the FS)???
Do you have any experiences with storing documents into the MySQL database specifically???
All the help is very much appreciated, I am really excited about interview and really want this internship, so this is one of things i really want to understand before!!
Thank you!!!!
From my experience with Alfresco, this is going to take a lot of customization of the Alfresco repository. I wouldn't go there myself. But if had to, I would answer your questions like this:
Reasons for storing documents into the DBMS instead of the file could be:
use of DBMS backup/security tools to copy/save/backup the documentation around,
and this one is probably a good one:
access to these documents could be easier from other applications. I mean, if you're rewriting the storage service anyway, then you can rewrite it so that you store some of the metadata in the new database structure too. This would create some redundancy, but it would make the documents accessible from other systems without having to depend on Alfresco.
I have some experience with applications that use DBMS as storage - the application was used to store the incoming invoices, so that those could be approved, disputed or sent for payment or whatever.
It had decent LAN performances, but the company had really good bandwidth. On the remote locations, though, it was a bit lagged as the documents were transfered back and forth.
my question is similar to other friend posted here...we are trying to develop an application that supports possibly terabytes of information based on a land registry in Paraguay with images and normal data.
The problem is that we want to reduce the cost of operation to minimum as possible because it´s like a competition between companies, and for that reason we want to use a free database....I have read a lot of information about it but I am still confused. We have to realize that the people who is gonna use it are government people so the DB has to be easy to manage at the same time.
What would u people recommend me?
Thanu very much
MySQL and even SQLite already have spatial indexes, so no problem there.
To store the datafiles you could use a BLOB field, but it's usually much better (and easier to optimise) to store as files. To keep the files related to the DB records you can either put the full path (or URL) in a varchar field, or store the image in a path calculated by the record's ID.
To easily scale into the multi-terabyte store, plan from the start on using several servers. If the data is read-mostly, an easy way is to store the images on different hosts, each with a static HTTP server, and the database records where is each image. then put a webapp frontend for the database, where the URLs for each image directly point to the appropriate storage server. That way you can keep adding storage without creating a bottleneck on the 'central' server.
Postgresql, SQL Server 2008 and Any recent version of Oracle all have spatial indexing, table partitioning and BLOBs and are capable of acting as the back-end of a large geographic database. You might also want to check out two open-source GIS applications: GRASS and QGIS, which might support doing what you want with less modification work than writing a bespoke application. Both can use Postgresql and other database back-ends.
As for support, any commercial or open-source database is going to need the attentions of a competent DBA if you want to get it to work well on terabyte-size databases. I don't think you will get away with a model of pure end-user support - attempts to do this are unlikely to work.
It sounds like the image files will be a considerable amount of your storage. Don't store them in a database just store the file location details in the database.
(If you want access via the internet try Amazon Storage. It isn't free but very cheap and they handle the scaleability for you. )
Another cautionary note on using B/C/LOBs, as I've been bitten on exponential DB growth by storing internally w/in the DB.
What about storing the GIS maps on a separate server and just store the LAT/LONG "shape" of the area w/in the DB. The GIS can be updated separately w/out the cost of storing the images in the main database.
Smaller to admin. Less cost to backup.
Whilst not meeting your criteria of being free, I would strongly recommend you consider using SQL Server 2008, because of two Gfeatures in this version which could help:
FILESTREAM - allows you to store your binary images within the filesystem, rather than within the database itself. This will make your database much more manageable whilst still allowing you to query the data in the usual way.
GEOGRAPHIC DATA TYPES - support for geospatial (lat/long) datatypes is likely to be very valuable to your solution.
Good luck!
Use ESRI's Image Server. You won't need a database to serve the images. Its very easy to use. It also works off of files and its fast and handles many image formats. Plus it does image processing on the fly and supports many clients. AutoCAD, Microstation, ArcMap, ArcIMS, ArcServer...etc.
Image Server