Best Approach to Storing Mean Uptime Data - mysql

We have 500+ remote locations. Each location has a linux router which checks in to our management system (homemade using RoR3) every 15 minutes.
We need to log and calculate mean uptime of each boxes Internet connectivity.
Each router posts a request every 15 minutes to a script on the server. (Currently this just records the last checkin time and the uptime.)
If we want to plot the historical uptime of each box, what is the most efficient way to do this without clogging our db up.
500 boxes checking in every 15 minutes would (according to my calculations) result in 17,520,000 inserts. Quite a hefty amount of data that I don't think we need.
Could anyone help solve this riddle for us?

Why not take a look at RRDTool (Wiki-entry). It's just the tool for this kind of situation.
It works as a sort of a round-robin self-averaging database, and it's used in many logging applications just for similar purposes to your situation.
As an example take a look at Cacti which is a data-logging / network monitoring and graphing front-end app built around RRDTool (implemented in PHP).

Related

MySQL Scaling on GCP

I created a instance (8 core) of MySQL on GCP. And a simple database in it. When I run a load of 40000+ concurrent users (1500 req/sec), the response times come out very high (10 seconds+). However I can see the hardware cpu utilization only at 15% or so.
What can I do to get the response time in msec?
Cheers!
Deepak
Imagine cramming 40000 shoppers in a grocery store. How many hours would it take for a shopper to buy just one carton of milk?
Seriously, there is a limit to how many connections can be made to any device. Database computers will top out at a few hundred. After that, latency will suffer severely as all the connections are waiting for their turn at various shared resources.
Another approach
Let's say these are achievable:
10ms to connect, fetch info for a page, and disconnect.
1500 pages built per second. (By the way, make sure the web server can achieve this.)
15 concurrent connections, each running for 10 ms. That equals 1500 pages per second.
1500 pages per second = 90000 pages per minute.
So, let's specify "40000 pages delivered to different (or same) users in one minute". I suggest that will be easy. And it won't require much more than 15 concurrent users. (Traffic is never smooth [except in a benchmark], so 50 concurrent connections may happen.)
[thousands of database servers is] where i would like to go eventually...however i need to solve a basic problem of mine which I have posted above!
Right, you have to expand the number of database servers now if you are serving 40,000 concurrent queries. Not eventually.
But let's be clear about what comprises concurrent users. Here's an example:
mysql> show global status like 'threads%';
+-------------------+-------+
| Variable_name | Value |
+-------------------+-------+
| Threads_connected | 1266 |
| Threads_running | 9 |
+-------------------+-------+
I've analyzed high-scale production sites for dozens of internet companies. It's typical to see hundreds or thousands of concurrent connections but few of these are executing an SQL query at any given moment. When a given thread is between executing queries, you can view it SHOW PROCESSLIST but it is only doing "Sleep".
This is fine, and it's normal.
I give the analogy to an ssh session: you may be connected to a shell on a linux server, but if you're doing nothing, just sitting at a shell prompt, you aren't taxing the server resources much. You could have hundreds of users connected with ssh to the same server at once. But if they all begin running applications at the same time, you're in trouble. The server might not handle that load well. At least, all of the users will experience slow performance.
It's the same with a MySQL database. If you need a server that can support 40,000 Threads_running, then you need to spread that load over many MySQL servers. There isn't any single server that exists today that can handle that.
But you might mean something different when you say 40,000 concurrent users. It might be that you have 40,000 users who are looking at some page on your website at the same time. But that's not resulting in continuous SQL queries in 40,000 database sessions all at the same time. Each person spends some time reading the web page they just loaded, and scrolling up and down, and perhaps typing into a form. While they are doing that, the website is waiting for their next request, and the web server and database server is not doing any work for that user while it's waiting. It can do some work for other users.
In this way, a single database server can support 40,000 (or more) users who are by some definition using the site, even though only a handful are invoking any code to run SQL queries at any given moment.
This is normal and most websites can handle that traffic with no problems.
If that's the reality of your application, and you still have problems scaling it, then you might have inefficient application code or unoptimized SQL queries. That is, the website could serve the requests easily if you wrote the code to be more efficient.
Inefficient code cannot be fixed by changing your server. The cost of inefficient code scales up faster than you can hope to handle it by upgrading the server. So you must solve performance problems by writing better code.
This is the point of an old tweet of mine:
The subject of scalable internet architecture is very complex. You need to do a lot of study and a lot of testing to grow a website and make it scalable.
You can start by reading. My favorite is Theo Schlossnagle's book Scalable Internet Architectures. Here is a video of Theo speaking about the same subject: https://www.youtube.com/watch?v=2WuT2rdLK5A
The book is from quite a few years ago. Perhaps the scale websites need to support is greater than it was back then, but the methods of achieving scalability are the same today.
Test
Identify bottlenecks
Rearchitect your web app code to relieve those bottlenecks
Test again

Kingswaysoft SSIS create Contacts Slow

I’m currently doing some testing for an upcoming data migration project and came across Kingswaysoft which seemed like it would be ideal for this purpose.
However I’m currently testing importing 225,000 contact records into a new sandbox Dynamics 365 instance and it is on course to take somewhere between 10 and 13 hours.
Is this typical of the speeds I should expect or am I doing something silly?
I am setting only some out of the box fields such as first name, last name, dob and address data.
I have a staging contact SQL database holding the 225k records to be uploaded.
I have the CRM Destination Component setup to use multi threading batch size of 250 with up to 16 threads.
Have tested using both Create and Upsert and both very slow.
Am I doing something wrong - I would have expected it to be much quicker.
When it comes to the data load to Dynamics 365 Online, the most important aspect that affects your data load performance is the network latency. You should try to put the data migration solution as close as possible to the Dynamics 365 online server. If you have the configuration right, you should be able to achieve something like 1m to 2m records per hour. The speed that you are getting is too slow. There must be something. There are many other things that could affect the data load performance, but start from network latency first. We have some other tips shared at https://www.kingswaysoft.com/products/ssis-integration-toolkit-for-microsoft-dynamics-365/help-manual/crm/advanced-topics#MaximizedPerformance, which you should check out.

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

How to find out what is causing a slow down of the application?

This is not the typical question, but I'm out of ideas and don't know where else to go. If there are better places to ask this, just point me there in the comments. Thanks.
Situation
We have this web application that uses Zend Framework, so runs in PHP on an Apache web server. We use MySQL for data storage and memcached for object caching.
The application has a very unique usage and load pattern. It is a mobile web application where every full hour a cronjob looks through the database for users that have some information waiting or action to do and sends this information to a (external) notification server, that pushes these notifications to them. After the users get these notifications, the go to the app and use it, mostly for a very short time. An hour later, same thing happens.
Problem
In the last few weeks usage of the application really started to grow. In the last few days we encountered very high load and doubling of application response times during and after the sending of these notifications (so basically every hour). The server doesn't crash or stop responding to requests, it just gets slower and slower and often takes 20 minutes to recover - until the same thing starts again at the full hour.
We have extensive monitoring in place (New Relic, collectd) but I can't figure out what's wrong; I can't find the bottlekneck. That's where you come in:
Can you help me figure out what's wrong and maybe how to fix it?
Additional information
The server is a 16 core Intel Xeon (8 cores with hyperthreading, I think) and 12GB RAM running Ubuntu 10.04 (Linux 3.2.4-20120307 x86_64). Apache is 2.2.x and PHP is Version 5.3.2-1ubuntu4.11.
If any configuration information would help analyze the problem, just comment and I will add it.
Graphs
info
phpinfo()
apc status
memcache status
collectd
Processes
CPU
Apache
Load
MySQL
Vmem
Disk
New Relic
Application performance
Server overview
Processes
Network
Disks
(Sorry the graphs are gifs and not the same time period, but I think the most important info is in there)
The problem is almost certainly MySQL based. If you look at the final graph mysql/mysql_threads you can see the number of threads hits 200 (which I assume is your setting for max_connections) at 20:00. Once the max_connections has been hit things do tend to take a while to recover.
Using mtop to monitor MySQL just before the hour will really help you figure out what is going on but if you cannot install this you could just using SHOW PROCESSLIST;. You will need to establish your connection to mysql before the problem hits. You will probably see lots of processes queued with only 1 process currently executing. This will be the most likely culprit.
Having identified the query causing the problems you can attack your code. Without understanding how your application is actually working my best guess would be that using an explicit transaction around the problem query(ies) will probably solve the problem.
Good luck!

How To Interpret Siege and or Apache Bench Results

We have a MySQL driven site that will occasionally get 100K users in the space of 48 hours, all logging into the site and making purchases.
We are attempting to simulate this kind of load using tools like Apache Bench and Siege.
While the key metric seems to me number of concurrent users, and we've got our report results, we still feel like we're in the dark.
What I want to ask is: What kinds of things should we be testing to anticipate this kind of traffic?
50 concurrent users 1000 Times? 500 concurrent users 10 times?
We're looking at DB errors, apache timeouts, and response times. What else should we be looking at?
This is a vague question and I know there is no "right" answer, we're just looking for some general thoughts on how to determine what our infrastructure can realistically handle.
Thanks in advance!
Simultaneous users is certainly one of the key factors - especially as that applies to DB connection pools, etc. But you will also want to verify that the page rate (pages/sec) of your tests is also in the range you expect. If the the think-time in your testcases is off by much, you can accidentally simulate a much higher (or lower) page rate than your real-world traffic. Think time is the amount of time the user spends between page requests - reading the page, filling out a form, etc.
Depending on what other information you have on hand, this might help you calculate the number of simultaneous users to simulate:
Virtual User Calculators
The complete page load time seen by the end-user is usually the most important metric to evaluate system performance. You'll also want to look for failure rates on all transactions. You should also be on the lookout for transactions that never complete. Some testing tools do not report these very well, allowing simulated users to hang indefinitely when the server doesn't respond...and not reporting this condition. Look for tools that report the number of users waiting on a given page or transaction and the average amount of time those users are waiting.
As for the server-side metrics to look for, what other technologies is your app built on? You'll want to look at different things for a .NET app vs. a PHP app.
Lastly, we have found it very valuable to look at how the system responds to increasing load, rather than looking at just a single level of load. This article goes into more detail.
Ideally you are going to want to model your usage to the user, but creating simulated concurrent sessions for 100k users is usually not easily accomplished.
The best source would be to check out your logs for the busiest hour and try and figure out a way to model that load level.
The database is usually a critical piece of infrastructure, so I would look at recording the number and length of lock waits as well as the number and duration of db statements.
Another key item to look at is disk queue lengths.
Mostly the process is to look for slow responses either in across the whole site or for specific pages and then hone in on the cause.
The biggest problem for load testing is that is quite hard to test your network and if you have (as most public sites do) a limited bandwidth through your ISP, that may create a performance issue that is not reflected in the load tests.