Serving images when using Couchbase as app server? - couchbase

I'm working on a little express app which currently allows users to login (via passport) and see information regarding their friends, i.e. purchase history, likes etc. Ideally I want each user to have an accompanying profile photo and for items in their purchase history to have accompanying product photos. A simplified user model is shown below
{
"name": "Homer Simpson",
"purchases": "Duff"
}
If I want a profile photo to go with this, is there a straightforward way to do it in couchbase, or should I store it in something like S3 and then have
{
"name": "Homer Simpson",
"profile_pic" : "http://s3.something.com/profilepics/homer.jpg",
"purchases": "Duff"
}

I can give you two arguments of why I would not store them in the database (Couchbase or otherwise).
1) Use each tool for what it is best at. Couchbase can serve up that example document you give right from its managed cache in RAM. You will get sub millisecond response time. S3 is excellent at serving up static content like images. Can Couchbase serve up that image very fast from RAM as well? Sure, but you are going to use more resources to do it and that leads me to my #2 argument.
2) By storing the image in a DB, Couchbase or otherwise, you would be using your most expensive and performant resources for something that is static and changes infrequently. Just think of the cost of storage on an EC2 instance as compared to S3. If you were to store that in a database, you have to store it, replicate it, back it up, etc. along side your critical data. S3 is great at having high durability at an exceedingly reasonable price. Images in a database sounds like a great idea, but over time they become chains around your neck. Unless you start now with a user account expiration policy, you could be 2 years from now and storing, backing up and replicating images in from a user that quit using your service 1.5 years ago and paying for every KB multiple times. Again, this is not exclusive to Couchbase by any means.
Go with the document that has the pointer to the image in S3 is my opinion. You will get the best of both world. Performance and cost effectiveness for not much extra work.

Related

How do modern web applications implement caching and data persistence with large amounts of rapidly changing data?

For example, consider something like Facebook or Twitter. All the user tweets / posts are retained indefinitely (so they must ultimately be stored within a static database). At the same time, they can rapidly change (e.g. with replies, likes, etc), so some sort of caching layer is necessary (e.g. you obviously can't be writing directly to the database every time a user "likes" a post).
In a case like this, how are the database / caching layers designed and implemented? How are they tied together?
For example, is it typical to begin by implementing the database in its entirety, and then add the caching layer afterword?
What about the other way around? In other words, begin by implementing the majority of functionality into the cache layer, and then write another layer which periodically flushes the cache to the database (at some point when its activity has gone down)? In this scenario, for current / rapidly changing data, the entire application would essentially be stored in cache.
Or perhaps implement some sort of cache-ranking algorithm based on access / update frequency?
How then should it be handled when a user accesses less frequent data (which isn't currently in cache)? Simply bypass cache completely / query the database directly, or should all data be cached before it's sent to users?
In cases like this, does it make sense to design the database schema with the caching layer in mind, or should it be designed independently?
I'm not necessarily asking for direct answers to all these questions, but they're just to give an idea of where I'm coming from.
I've found quite a bit of information / books on implementing the database, and implementing the caching layer independent of one another, but not a whole lot of information on using them in conjunction / tying them together.
Any information, suggestions, general patters, articles, books, would be much appreciated. It's just difficult to find some direction here.
Thanks
Probably not the best solution, but I worked on a personal project using Openresty where I used their shared memory zones to cache, to avoid the overhead of connecting to something like Redis, then used Redis as the backend DB.
When a user loads a resource, it checks the shared dict, if it misses then it loads it from Redis and writes it to the cache on the way back.
If a resource is created or updated, it's written to the cache, and also queued to a shared dict queue.
A background worker ticks away waiting for new items in the queue, writing them to Redis and then sending an event to other servers to either invalidate the resource in their cache if they have it, or even pre-cache it if needed.

How important is MySQL location geographically?

I read that StackExchange uses two data centers to house all of their servers, both data centers are in the US. I'm in Ireland so I'm sure US servers are fine for me, but how can StackExchange load quickly for users in Australia if all the database servers are in the US?
I'd just like to ask, does this mean for services like MySQL, being geographically close to the server isn't as big of a deal for keeping page load times fast?
I know they use a CDN to speed up their page load time and they probably cache certain pages to speed things up, but even if I go to some really old, unpopular question I can't notice any slow-down.
The location of the database server relative to the viewer is not the significant performance factor. As a site visitor, you aren't talking to the database -- you're talking to a web application server, which is talking to the database.
Far more important, usually, is the location of the database server relative to the application server, because many applications require multiple queries and thus multiple round trips to the database in order to render a single page, and these round trips increase the time it takes for a page to be rendered. When the database is physically proximate to the application tier, that time becomes negligible.
Speaking in general web terms, in a well-managed site like SE, with all the supporting assets in a CDN, the only delay that is relevant to you is the transit time required for that one big HTTP request/response necessary to render the page content. The transit time is not negligible, because the speed of light is still finite, so round trip times to far-flung locales even on the best routes can easily be in the 200-300ms range... but if you only need to traverse it once, you still have a respectable response time.
A site that uses a lot of ajax to fetch additonal data would not fare so well with the web server so far away. If such design were needed, you'd need geographically distributed web servers, with adjacent database replicas, and geo-routing in DNS to send read-only ajax requests to the nearest web server, which could query its local replica, get a quick response, and return a quick answer.
I once moved a MySQL server -- relative to the app server -- from being ~0.5 ms away to being ~25ms away. The page load time on the site (which was already not optimal) increased from 2 sec to 10 sec. The reason? The app had been through many iterations over the years and made a lot of unnecessary requests to the database... if I remember right, even the simplest page required 13 different queries, most of which were fetching data that wasn't actually used (like fetching your score even for pages that didn't actually display your score). This inefficiency went undetected as long as the app and the db were very, very close. But, again, this was about the distance between the web server and the database, not the database and the browser.
Stack Exchange has two data centers but at last check one of them is only a hot standby/failover site. The main site does all the work under normal operations. And, SE uses MSSQL, but that, too, is immaterial, because the fundamental phenomenon at work here is a law of physics.
Perhaps StackExchange uses several copies of databases (DB Slaves) geographically distributed across different regions of the world. That explains high speed of work even with unpopular SQL-requests.
Also between Australia and West Coast of United States, direct communication via an underwater cable is possible, which ensures a high speed of operation.

Can large sets of binary data can be store in Database? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
database for huge files like audio and video
I'm seeking for the best (or at least good enough) way of storing large sets of binary data (images, videos, documents, etc.). The solution has to be scalable and can't get stuck after X amount of data.
I would like to have a one place for example MySQL database where all the data is kept. When one of web front ends needs it (on request) It can acquire it from the the DB and cache it permanently for later.
From this what I can see on http://dev.mysql.com/doc/refman/5.0/en/table-size-limit.html MySQL table can't store more then 4TB per table. Is there something more appropriate like perhaps nosql databases or perhaps it's better to store everything in files on one server and propagate it to all web frontends?
You typically don't want to store large files in a relational database -- it's not what they're designed for. I would also advise against using a NoSQL solution, since they're also typically not designed for this, although there are a few exceptions (see below).
Your last idea, storing the files on the filesystem (do note that this is what filesystems are designed for ;) is most likely the right approach. This can be somewhat difficult depending on what your scalability requirements are, but you will likely want to go with one of the following:
SAN. SANs provide redundant, highly-available storage solutions within a network. Multiple servers can be attached to storage provided by a SAN and share files between each other. Note that this solution is typically enterprise-oriented and fairly expensive to implement reliably (you'll need physical hardware for it as well as RAID controllers and a lot of disks, at minimum).
CDN. A content delivery network is a remote, globally distributed system for serving files to end users over the Internet. You typically put a file in a location on your server that is then replicated to the CDN for actual distribution. The way a CDN works is that if it doesn't have the file a user is requesting, it'll automatically try to fetch it from your server; once it has a copy of the file once, it caches the file for some period of time. It can be really helpful if you're normally constrained by bandwidth costs or processing overhead from serving up a huge number of files concurrently.
Cloud offering (Amazon S3, Rackspace Cloud Files). These are similar to a CDN, but work well with your existing cloud infrastructure, if that's something you're using. You issue a request to the cloud API to store your file, and it subsequently becomes available over the Internet, just like with a CDN. The major difference is that you have to issue any storage requests (create, delete, or update) manually.
If the number of files you're serving is small, you can also go with an in-house solution. Store files on two or three servers (perhaps have a larger set of servers and use a hash calculation for sharding if space becomes an issue). Build a small API for your frontend servers to request files from your storage servers, falling back to alternate servers if one is unavailable.
One solution that I almost forgot (although I haven't ever used beyond research purposes) is Riak's Luwak project. Luwak is an extension of Riak, which is an efficient distributed key/value store, that provides large file support by breaking the large files into consistently-sized segments and then storing those segments in a tree structure for quick access. It might be something to look into, because it gives you the redundancy, sharding, and API that I mentioned in the last paragraph for free.
I work as a (volunteer) developer on a fairly large website - we have some 2GB of images in 14000 images [that's clearly nowhere near a "world record"], and a database of 150MB of database. Image files are stored as separate files instead of as database objects, partly because we resize images for different usages - thumbnails, medium and large images are created programattically from the stored image (which may be larger than the "large" size we use for the site).
Whilst it's possible to store "blobs" (Binary Large Objects) in SQL databases, I don't believe it's the best solution. Storing a reference in the database, so that you can make a path/filename combination for the actual stored file [and possibly hiding the actual image behind some sort of script - php, jsp, ruby or whatever you prefer] would be a better solution.

Best way to store multiple type of data

I'm about to design a database for a project I'm working to. I need to store multiple type of data, like Videos , Photos , text , audio. I've to store them and through php I have to query them frequently. This project I'm working on is a Social Network and I need to connect users through notification and messages.
Here is the question : Is more helpfull to use NoSQL DB's to store data and for the notification system ( Like MongoDB and Redis ) or MySQL can help me as well with this kind of systems?
Sorry for my english , but technical stuff are so hard to explain for a english beginner like me. Thank you guys.
The problem with SQL techs, i.e. MySQL is that normally you have to place that binary text within a BLOB at which point you are already doing it wrong.
Another thing to consider is that file system access will always be faster than database access whether it is MongoDB or SQL, however, database storage does have some advantages. Eventually on your site (if it were to get slightly popular) you will find you need a CDN. These sorts of distribution networks can be costly however with something like MongoDB you can just spin up replicas of the data in other regions and have the binary data replicate as it is needed (maybe even TTL'd just like a CDN).
So this is one area to consider, that the file system is not most of the time, the right answer for a high load site like a social network. However even facebook themselves are not immune to having to serve directly from a file system, as they state ( https://blog.facebook.com/blog.php?post=2406207130 , considering this post is 5 years old but I doubt much has changed on this front):
We have also developed our own specialized web servers that are tuned to serve files with as few disk reads as possible. Even with thousands of hard drive spindles, I/O (input/output) is still a concern because our traffic is so high. Our squid caches help reduce the load, but squid isn't nearly fast or efficient enough for our purposes, so we're writing our own web accelerator too.
However they do have an extremely large infrastructure and most likely you should more like consider whether you use database storage or a CDN.
I would personally say you should probably do some research into content distribution networks and how other sites serve their images. You can find information all over Google. You can search specifically for Facebook, who until recently, were using Akamai for their CDN.
You can go either way. But storing binary data in a db is usually not the most efficient path. You are better off storing that in the filesystem and put the paths in the DB.

What type of database would be used for keeping data between a Web Site and Game Server?

I'm making an Online game where I will host a game server. Players will login to my game server. They will then be taken to a lobby where they can choose a game to join. I will be keeping track of wins and loses and a few other statistics.
My requirements are as follows:
At any time in game, a player should be able to click on another
player and get their latest up-to-date statistics.
A player should also be able to go to my Web Site and get the same
statistics. (Ideally, up to date immediately, but less important than
in game)
I will also have a leader-board that will be generated from data on
the Web Site.
My question is: What type of solution would typically be used for this type of situation?
It is vital that I never lose data. One thing that worries me about using a Web Site database is data loss.
I'm also unsure how the interactions between the Web Site database and the game server would work. Is there a capability with mySQL to do this sort of thing? My other concern with using a Web Site database is how much bandwidth I would consume monthly. I generously estimate that I will have 1000 people online at any given time. A game lasts around 20 minutes.
How are these types of situations typically solved? I've looked all over but I've yet to find a clear answer to my concerns.
Thanks
I would recommend a few things based on your requirements. Your question is very open ended so the answers given are quite general:
Databases are fine to store data as they write to a harddrive and are transactional (meaning they fine to survive web server crashes).
Databases can be backed up using any one of numerous back up tools, such as: https://www.google.com/search?q=sql+backup&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
For up to date statistics you should probably be pulling active game players information from a cache (otherwise you might find you are pounding the database when most of your data isnt going to change (ie possibly most gamers could be offline and their data will remain static but might want to be viewed.
Investigate what kind of database you want. NOSQL, or SQL. There is no obvious choice here without evaluating the benefits of each.
Investigate N-Tier or MultiTier design. http://en.wikipedia.org/wiki/Multitier_architecture
Consider some sort of cloud like infrastructure such as appfabric, azure, (there are other linux ones too) etc. There are many cloud services which can provide high scalability. It could be a short cut for the previous points.