mysql: average time for the connected visitors - optimization - mysql

The system I work is a little more complex to explain here but I can reduce it to something more simple.
Let's say I have a simple chat server and I count the seconds every client stays connected and save it in a table(I'm using mysql).
So every time a client connects I save the time he stays connected in seconds (int)
If he disconnects and connects again I save this info in another row because this is how I want. The number of times a client connects to the server in a day is between 50k-500k or even more(I know, I know but this is related to my complex system but irrelevant to my question here).
My problem is that I want to show to every client some stats about his visits similar to google analytics(by days), to be more specific I'm interested in showing the average time he spent on a certain day.
I'm looking at an optimized way to do this. So far I've thought about the following solutions:
use select avg(time) from table where date=.... but speed problems might occur
save the avg time in a separate table for every day and user. This solutions is ok, but raises another question: how do I save the average time? here are the situations I was thinking:
a) use mysql trigger to update the stats every time a client is connecting (using INSERT AFTER ...) this solution is not bad, however like I said the client can connect 500k times/day which means 500k times mysql needs to calculate the average time
b) make a separate application similar to a cron job or a timer task that updates the stats every X hours ,this way I know the mysql server will be used only once a few hours depending on the number of clients I have.
So far I'm thinking of implementing the 2.b solution, but I said to ask you first before proceeding. If you have better ideas please share.
Thanks

You can use solution a but don't recalculate the average over and over again. You can do it by storing the current average and the amount of items that where used to calculate the average. Your formula would be like:
(current_average*number_of_old_items+new_value)/(number_of_old_items+1)

In my opinion, this:
speed problems might occur
is not enough reason to avoid what is certainly the simplest and least error-prone solution, especially when it is so easy to change if and when speed problems do occur.
That being said — in the event of speed problems, I agree with your assessment: better to use a scheduled job that computes the average than to add a trigger that will impose a penalty on every insert.

Related

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

Best and most efficient way for ELO-score calculation for users in database

I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.

update specific column of a row in mysql after every 5 minutes unsing node.js

I am working on a project with node.js (not express) server and a mysql database. When a user clicks a button on the page, it uploads 2 values (say SpecificName, Yes/No). Now these values get inserted into the mysql database through the node server. Later, mysql runs a check for the specificName column (if it finds none, it then creates a column with the same Name) and updates the second value in it.
Now I would like to keep every update of the second value that the user makes through website (i.e yes) for 5 minutes in the mysql database after which it automatically updates the the specific location with another value (say cancel). I'm auspicious in solving every thing except this 5 minutes paradox. Also I'm keeping 15-20 so called specificName columns in which the value (say yes/no) is being updated and at the same time there are more than 1000 rows that are working simultaneously so a lots of 5 minute timers going on for the values. Is there a way to store value temporarily in mysql after which it is destroyed automatically?.
I came across :
node-crons (too complex, don't even know if its a right choice)
mysql events (I'm not sure how to use it with node)
timestamp (can't create more than one timestamp (guess I need one for each column))
datetime (haven't tested it yet) and other things like
(DELETE FROM table WHERE timestamp > DATE_SUB(NOW(), INTERVAL 5 MINUTE)).
Now I have no idea what to use or how to resolve this dilemma.
Any help would be appreciated.
Per my conversation with Sammy on kik, I'm pretty sure you don't want to do this. This doesn't sound like a use case that fits MySQL. I also worry that your MySQL knowledge is super limited, in which case, you should take the time to do more research on MySQL. Without a better understanding of the larger goal(s) your application is trying to accomplish, I can't suggest better alternatives. If you can think of a way to explain the application behavior without compromising the product idea, that would be very helpful in helping us solve your problem.
General things I want to make clear before giving you potential answers:
You should not be altering columns from your application. This is one of my issues with the Node/Mongo world. Relational databases don't like frequently changing table definitions. It's a quick way to a painful day. Doing so is fine in non-relational systems like Mongo or Cassandra, but traditional relational databases do not like this. The application should only be inserting, updating, and deleting rows. Ye hath been warned.
I'm not sure you want to put data into MySQL that has a short expiration date. You probably want some sort of caching solution like memcache or redis. Now, you can make MySQL behave like a cache, but this is not its intended use. There are better solutions. If you're persistent on using MySQL for this, I recommend investigating the MEMORY storage engine for faster reads/writes at the cost of losing data if the system suddenly shuts down.
Here are some potential solutions:
MySQL Events - Have a timestamp column and an event scheduled to run... say every minute or so. If the event finds that a row has lived more than 5 minutes, delete it.
NodeJS setTimeout - From the application, after inserting the record(s), set a timeout for 5 minutes to go and delete said records. You'll probably want to ensure you have some sort of id or timestamp column for supahfast reference of the values.
Those are the two best solutions that come to mind for me. Again, if you're comfortable revealing how your application behaves that requires an unusual solution like this, we can likely help you arrive at a better solution.
ok so guess i figured it out myself. I'm posting this answer for all those who still deal with this query.I used DATETIME stamp in mysql that i created for each column of specificName.so with every specificName column,there exist another specificName_TIME column that stores the time at which the value (yes/no) is updated.the reason i didn't use timestamp is because its not possible to create end number of timestamp in mysql versions lower than 5.6.Now i updated the current time by adding 5 minutes before storing it in database.Then i ran 2 chain functions. First one checks if the datetime in the database is smaller than the current time (SELECT specificName FROM TABLE WHERE specificName_TIME < NOW()).If it turns out to be true it shows me the value else it reflects null.Then i ran the second function to update the value if its true to continue the whole process again and if not to continue it anyways updating the last value with null.
HOPE THIS HELPS.

Reliably select from a database table at fixed time intervals

I have a fairly 'active' CDR table I want to select records from it every say 5 minutes for those last 5 minutes. The problem is it has a SHA IDs generated on a few of the other columns so all I have to lean on is a timestamp field by which I filter by date to select the time window of records I want.
The next problem is that obviously I cannot guarantee my script will run on the second precisely every time, or that the wall clocks of the server will be correct (which doesn't matter) and most importantly there almost certainly will be more than one record per second say 3 rows '2013-08-08 14:57:05' and before the second expired one more might be inserted.
By the time for '2013-08-08 14:57:05' and get records BETWEEN '2013-08-08 14:57:05' AND '2013-08-08 15:02:05' there will be more records for '2013-08-08 14:57:05' which I would have missed.
Essentially:
imprecise wall clock time
no sequential IDs
multiple records per second
query execution time
unreliable frequency of running the query
Are all preventing me from getting a valid set of rows in a specified rolling time window. Any suggestions for how I can go around these?
If you are using the same clock then i see no reason why things would be wrong. a resolution you would want to consider is a datetime table. So that way, every time you updated the start and stop times based on the server time.... then as things are added it would be guarenteed to be within that timeframe.
I mean, you COULD do it by hardcoding, but my way would sort of forcibly store a start and stop point in the database to use.
I would use Cron to handle the intervals and timing somewhat. Not use the time from that, but just to not lock up the database by checking all the time.
I probably not got all the details but to answer to your question title "Reliably select from a database table at fixed time intervals"...
I don't think you could even hope for a query to be run at "second precise" time.
One key problem with that approach is that you will have to deal with concurrent access and lock. You might be able to send the query at fixed time maybe, but your query might be waiting on the DB server for several seconds (or being executed seeing fairly outdated snapshot of the db). Especially in your case since the table is apparently "busy".
As a suggestion, if I were you, I would spend some time to think about queue messaging systems (like http://www.rabbitmq.com/ just to cite one, not presaging it is somehow "your" solution). Anyway those kind of tools are probably more suited to your needs.

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.