How important is MySQL location geographically? - mysql

I read that StackExchange uses two data centers to house all of their servers, both data centers are in the US. I'm in Ireland so I'm sure US servers are fine for me, but how can StackExchange load quickly for users in Australia if all the database servers are in the US?
I'd just like to ask, does this mean for services like MySQL, being geographically close to the server isn't as big of a deal for keeping page load times fast?
I know they use a CDN to speed up their page load time and they probably cache certain pages to speed things up, but even if I go to some really old, unpopular question I can't notice any slow-down.

The location of the database server relative to the viewer is not the significant performance factor. As a site visitor, you aren't talking to the database -- you're talking to a web application server, which is talking to the database.
Far more important, usually, is the location of the database server relative to the application server, because many applications require multiple queries and thus multiple round trips to the database in order to render a single page, and these round trips increase the time it takes for a page to be rendered. When the database is physically proximate to the application tier, that time becomes negligible.
Speaking in general web terms, in a well-managed site like SE, with all the supporting assets in a CDN, the only delay that is relevant to you is the transit time required for that one big HTTP request/response necessary to render the page content. The transit time is not negligible, because the speed of light is still finite, so round trip times to far-flung locales even on the best routes can easily be in the 200-300ms range... but if you only need to traverse it once, you still have a respectable response time.
A site that uses a lot of ajax to fetch additonal data would not fare so well with the web server so far away. If such design were needed, you'd need geographically distributed web servers, with adjacent database replicas, and geo-routing in DNS to send read-only ajax requests to the nearest web server, which could query its local replica, get a quick response, and return a quick answer.
I once moved a MySQL server -- relative to the app server -- from being ~0.5 ms away to being ~25ms away. The page load time on the site (which was already not optimal) increased from 2 sec to 10 sec. The reason? The app had been through many iterations over the years and made a lot of unnecessary requests to the database... if I remember right, even the simplest page required 13 different queries, most of which were fetching data that wasn't actually used (like fetching your score even for pages that didn't actually display your score). This inefficiency went undetected as long as the app and the db were very, very close. But, again, this was about the distance between the web server and the database, not the database and the browser.
Stack Exchange has two data centers but at last check one of them is only a hot standby/failover site. The main site does all the work under normal operations. And, SE uses MSSQL, but that, too, is immaterial, because the fundamental phenomenon at work here is a law of physics.

Perhaps StackExchange uses several copies of databases (DB Slaves) geographically distributed across different regions of the world. That explains high speed of work even with unpopular SQL-requests.
Also between Australia and West Coast of United States, direct communication via an underwater cable is possible, which ensures a high speed of operation.

Related

Implementing dynamically updating upvote/downvote

How to implement dynamically updating vote count similar to quora:- Whenever a user upvotes an answer its reflected automatically for every one who is viewing that page.
I am looking for an answer that address following:
Do we have to keep polling for upvote counts for every answer, If yes
then how to manage the server load arising because of so many users
polling for upvotes.
Or to use websockits/push notifications, how scalable are these?
How to store the upvote/downvote count in databases/inmemory to support this. How do they control the number of read/writes. My backend database is mysql
The answer I am looking for may not be exactly how quora is doing it, but may be how this can be done using available opensource technologies.
It's not the back-end system details that you need to worry about but the front end. Having connection being open all the time is impractical at any real scale. Instead you want the opposite - to be able to serve and close connection from back-end as fast as you can.
Websockets is a sexy technology, but again, in real world there are issues with proxies, if you are developing something that should work on a variety of screens (desktop, tablet, mobile) it might became a concern to you. Even good-old long polls might not work through firewalls and proxies.
Here is a good news: I think
"keep polling for upvote counts for every answer"
is a totally good solution in this case. Consider the following:
your use-case does not need any real real-time updates. There is little harm to see the counter updated a bit later
for very popular topics you would like to squash multiple up-votes/down-votes into one anyway
most of the topics will see no up-vote/down-vote traffic at all for days/weeks, so keeping a connection open, waiting for an event that never comes is a waste
most of the user will never up-vote/down-vote that just came to read a topic, so your read/write ration of topics stats will be greatly skewed toward reads
network latencies varies hugely across clients, you will see horrible transfer rates for a 100B http responses, while this sluggish client is fetching his response byte-by-byte your precious server connection and what is more importantly - thread on a back end server is busy
Here is what I'd start with:
have browsers periodically poll for a new topic stat, after the main page loads
keep your MySQL, keep counters there. Every time there is an up/down vote update the DB
put Memcached in front of the DB as a write-through cache i.e. every time there is an up/down vote update cache, then update DB. Set explicit expire time for a counter there to be 10-15 minutes . Every time counter is updated expire time is prolongated automatically.
design these polling http calls to be cacheable by http proxies, set expire and ttl http headers to be 60 sec
put a reverse proxy(Varnish, nginx) in front of your front end servers, have this proxy do the caching of the said polling calls. These takes care of the second level cache and help free up backend servers threads quicker, see network latencies concern above
set-up your reverse proxy component to talk to memcached servers directly without making a call to the backend server, yes if your can do it with both Varnish and nginx.
there is no fancy schema for storing such data, it's a simple inc()/dec() operation in memcached, note that it's safe from the race condition point of view. It's also a safe atomic operation in MySQL UPDATE table SET field = field + 1 WHERE [...]
Aggressive multi level caching covers your read path: in Memcached and in all http caches along the way, note that these http poll requests will be cached on the edges as well.
To take care of the long tail of unpopular topic - make http ttl for such responses reverse proportional to popularity.
A read request will only infrequently gets to the front end server, when http cache expired and memcached does not have it either. If that is still a problem, add memecached servers and increase expire time in memcached across the board.
After you done with that you have all the reads taken care of. The only problem you might still have, depending on the scale, is high rate of writes i.e. flow of up/down votes. This is where your single MySQL instance might start showing some lags. Fear not - proceed along the old beaten path of sharding your instances, or adding a NoSQL storage just for counters.
Do not use any messaging system unless absolutely necessary or you want an excuse to play with it.
Websockets, Server Sent Events (I think that's what you meant by 'push notifications') and AJAX long polling have the same drawback - they keep underlying TCP connection open for a long time.
So the question is how many open TCP connections can a server handle.
Basically, it depends on its OS, number of file descriptors (a config parameter) and available memory (each open connection reserves a read/write buffers).
Here's more on that.
We once tested a possibility to keep 1 million websocket connections open on a single server (Windows 7 x64 with 16Gb of RAM, JVM 1.7 with 8Gb of heap, using Undertow beta to serve Web requests).
Surprisingly, the hardest part was to generate the load on the server )
It managed to hold 1M. But again the server didn't do something useful, just received requests, went through protocol upgrade and kept those connections open.
There was also some number of lost connections, for whatever reason. We didn't investigate. But in production you would also have to ping the server and handle reconnection.
Apart from that, Websockets seem like an overkill here, SSE still aren't widely adopted.
So I would go with good old AJAX polling, but optimize it as much as possible.
Works everywhere, simple to implement and tweak, no reliance on an external system (I had bad experience with that several times), possibilities for optimization.
For instance, you could group updates for all open articles in a single browser, or adjust update interval according to how popular the article is.
After all it doesn't seem like you need real-time notifications here.
sounds like you might be able to use a messaging system like Kafka, or RabbitMQ, or ActiveMQ. Your front end would sent votes to a message channel and receive them with a listener, and you could have a server side piece persist the votes to the db periodically.
You could also accomplish your task by polling your database, and by incre/decre menting a number related to a post via a stored proc... there are a bunch of options here and it depends on how much concurrency you may be facing.

What type of database would be used for keeping data between a Web Site and Game Server?

I'm making an Online game where I will host a game server. Players will login to my game server. They will then be taken to a lobby where they can choose a game to join. I will be keeping track of wins and loses and a few other statistics.
My requirements are as follows:
At any time in game, a player should be able to click on another
player and get their latest up-to-date statistics.
A player should also be able to go to my Web Site and get the same
statistics. (Ideally, up to date immediately, but less important than
in game)
I will also have a leader-board that will be generated from data on
the Web Site.
My question is: What type of solution would typically be used for this type of situation?
It is vital that I never lose data. One thing that worries me about using a Web Site database is data loss.
I'm also unsure how the interactions between the Web Site database and the game server would work. Is there a capability with mySQL to do this sort of thing? My other concern with using a Web Site database is how much bandwidth I would consume monthly. I generously estimate that I will have 1000 people online at any given time. A game lasts around 20 minutes.
How are these types of situations typically solved? I've looked all over but I've yet to find a clear answer to my concerns.
Thanks
I would recommend a few things based on your requirements. Your question is very open ended so the answers given are quite general:
Databases are fine to store data as they write to a harddrive and are transactional (meaning they fine to survive web server crashes).
Databases can be backed up using any one of numerous back up tools, such as: https://www.google.com/search?q=sql+backup&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a
For up to date statistics you should probably be pulling active game players information from a cache (otherwise you might find you are pounding the database when most of your data isnt going to change (ie possibly most gamers could be offline and their data will remain static but might want to be viewed.
Investigate what kind of database you want. NOSQL, or SQL. There is no obvious choice here without evaluating the benefits of each.
Investigate N-Tier or MultiTier design. http://en.wikipedia.org/wiki/Multitier_architecture
Consider some sort of cloud like infrastructure such as appfabric, azure, (there are other linux ones too) etc. There are many cloud services which can provide high scalability. It could be a short cut for the previous points.

Cloud Database Service Latency/Performance

I am running a heavy traffic site and our server is beginning to get to its limits, at the moment the entire LAMP stack is on one box (not ideal).
I would like to move the database onto it's own box or onto a cloud service, but from my previous experience moving the database off the same box as the webserver increases the latency of reads quite dramatically slowing down the site.
Is using a cloud service for this going to overcome this problem, because as far as I can tell its essentially the same situation (as moving it onto a separate box in my control)? In which case why is there so much popularity around cloud based database services at the moment?
Are cloud based database services so quick that the latency of reads is so low that its almost like having it on the same box in the same datacentre?
Using a cloud service just for your database won't help your situation.
If you only move the database, you're physically placing it in a remote location - which will always increase latencies, no matter how powerful the hardware serving the content.
I would suggest that you will see a benefit in hosting your database on a separate machines from your web server, so long as they are physically next to each other sharing a dedicated network (as already suggested).
If you wanted to explore the benefits of cloud services, I would suggest only doing so if you can move both database and web server together. Furthermore, it's really only of benefit if you explore load balancing across multiple web-servers and/or replicated databases. (The ability to scale dynamically is a major benefit of cloud based platforms).
Clouds are about paying someone else to manage the infrastructure so you don't have to. They also come with some nice benefits about being able to rapidly acquire infrastructure since you don't have to wait for physical machines to be landed you can simply tap into the "cloud's" unused capacity. Sure people build features on top of this infrastructure to make it easier to scale (this is usually programming against a certain model).
If you are thinking about a cloud when are you planning on moving to 10 servers...or 100? Do you deal with traffic that comes in large bursts where the peaks in your traffic are very high?
Since you are talking about moving to a second box I don't think you need to have the cloud discussion yet. Just add a database server and use caching like e4c5 recommended.
There will be increased latency going across the network, but it shouldn't be that noticeable. Gigabit ethernet is pretty fast. When you have tried splitting the boxes, how did you access the other box? You should be using a local, internal IP address (i.e. 192.168.#.#). If you are not, then your requests may get routed over the internet, even though the boxes are physically next to each other.
Moving to a cloud won't solve your problems if the servers aren't networked properly.

What would be my best MySQL Synchronization method?

We're moving a social media service to be on separate data centers as our other hosting provider's entire data center went down. Twice.
This means that both websites need to be synchronized in some sense -- I'm less worried about the code of the pages, that's easy enough to sync, but they need to have the same database data.
From my research on SO, it seems MySQL Replication is a good option, but the MySQL manual, for scaling out, says that its best when there are far more reads then there are writes/updates:
http://dev.mysql.com/doc/refman/5.0/en/replication-solutions-scaleout.html
In our case, it's about equal. We're getting around 200-300 thousand requests a day right now, and we can grow rapidly. Every request is both a read and write request.
What would be the best method or tool to handle this?
Replication isn't instantaneous, and all writes have to be sent over the wire to the remote servers, so it takes bandwidth too. As long as this works for you and you understand the consequences, then don't worry about the read/write ratio.
However, are you sure that you need global replication? We handle millions of requests and have one location, with multiple web servers connected to two databases. One database is the live database, and the other is a replicated read only database.
We do have global fail over locations, and some people connect to these on any day, even if our main node is up because they have Internet issues. The data just trickles in though.
If the main node went down, then every body would be using the global fail over locations, in order. So, if our main node died, all customers would connect to Denver. If Denver went down, they'd all connect to Columbus.
Also, our main node is on two different Internet providers, so one ISP going down doesn't take us down.
Is the connection speed between two datacenters good enough? You can copy files to a new server and move database there. And then setup old server so that it will connect to new server's MySQL database in another DC? This will be slower of course, but depending on the nature of your queries it can be acceptable. As soon as DNS or whatever moves/finishes, you just power off the old server when there is no more requests for it.
To help you to assess your options you need to consider what your requirements are in a disaster recovery scenario (i.e. total loss of the system in one data-centre).
In particular for this scenario, how much data can you afford to lose (recovery point objective - RPO), and how quickly do you need to have the standby data-centre version of the site up and running (recovery time objective - RTO).
For example if your RPO is no transactions lost and recovery in 5 minutes, then the solution would be different than if you can afford to lose 5 mins of transactions and an hour to recover.
Another question I'd ask is if you're using SAN storage at all? This gives you options for replication at the storage level (SAN array to SAN array), rather than at the database level (e.g. MySQL replication).
Also to consider is the distance between the data-centres (e.g. timewise can you afford to perform a synchronous write to both databases, or would an asynchronous replication approach be more appropriate)

Scaling up from 1 Web Server + 1 DB Server

We are Web 2.0 company that built a hosted Content Management solution from the ground up using LAMP. In short, people log into our backend to manage their website content and then use our API to extract that content. This API gets plugged into templates that can be hosted anywhere on the interwebs.
Scaling for us has progressed as follows:
Shared hosting (1and1)
Dedicated single server hosting (Rackspace)
1 Web Server, 1 DB Server (Rackspace)
1 Backend Web Server, 1 API Web Server, 1 DB Server
Memcache, caching, caching, caching.
The question is, what's next for us? Every time one of our sites are dugg or mentioned in a popular website, our API server gets crushed with too many connections. Or every time our DB server gets overrun with queries, our Web server requests back up.
This is obviously the 'next problem' for any company like ours and I was wondering if you could point me in some directions.
I am currently attracted to the virtualization solutions (like EC2) but need some pointers on what to consider.
What/where/how to scale is dependent on what your issues are. Since you've been hit a few times, and you know it's the API server, you need to identify what's actually causing the issue.
Is it DB lookup times?
A volume of requests that the web server just can't handle even though they're shortlived?
API requests take too long to process? (independent of DB lookups, e.g., does the code take a bit to run)?
Once you identify WHAT the problem is, you should have a pretty clear picture of what you need to do. If it's just volume of requests, and it's the API server, you just need more web servers (and code changes to allow horizontal scaling) or a beefier web server. If it's API requests taking too long, you're looking at code optimizations. There's never a 1-shot fix when it comes to scalability.
The most common scaling issues have to do with slow (2-3 seconds) execution of the actual code for each request, which in turn leads to more web servers, which leads to more database interactions (for cross-server sessions, etc.) which leads to database performance issues. High performance, server independent code with memcache (I actually prefer a wrapper around memcache so the application doesn't know/care where it gets the data from, just that it gets it and the translation layer handles DB/memcache lookups as well as populating memcache).
Depends really if your bottleneck is reads or writes. Scaling writes is much harder than reads.
It also depends on how much data you have in the database.
If your database is small, but cannot cope with the read load, you can deploy enough ram that it fits in ram. If it still cannot cope, you can add read-replicas, possibly on the same box as your web servers, this will give you good read-scalability - the number of slaves from one MySQL master is quite high and will depend chiefly on the write workload.
If you need to scale writes, that's a totally different game. To do that you'll need to split your data out, either horizontally (partitioning / sharding) or vertically (functional partitioning etc) so that you can spread the workload over several write servers which do not need to do each others' work.
I'm not sure what EC2 can do for you, it essentially offers slow, high latency machines with nonpersistent discs and low IO performance on the end of a more-or-less nonexistent SLA. I guess it might be useful in your case as you can provision them relatively quickly - provided you're just using them as read-replicas and you don't have too much data (remember they have nonpersistent discs and sucky IO)
What is the level of scaling you are looking for? Is it a stop-gap solution e.g. scale vertically? If it is a more strategic scaling project, does your current architecture support scaling horizontally?