I have a fairly 'active' CDR table I want to select records from it every say 5 minutes for those last 5 minutes. The problem is it has a SHA IDs generated on a few of the other columns so all I have to lean on is a timestamp field by which I filter by date to select the time window of records I want.
The next problem is that obviously I cannot guarantee my script will run on the second precisely every time, or that the wall clocks of the server will be correct (which doesn't matter) and most importantly there almost certainly will be more than one record per second say 3 rows '2013-08-08 14:57:05' and before the second expired one more might be inserted.
By the time for '2013-08-08 14:57:05' and get records BETWEEN '2013-08-08 14:57:05' AND '2013-08-08 15:02:05' there will be more records for '2013-08-08 14:57:05' which I would have missed.
Essentially:
imprecise wall clock time
no sequential IDs
multiple records per second
query execution time
unreliable frequency of running the query
Are all preventing me from getting a valid set of rows in a specified rolling time window. Any suggestions for how I can go around these?
If you are using the same clock then i see no reason why things would be wrong. a resolution you would want to consider is a datetime table. So that way, every time you updated the start and stop times based on the server time.... then as things are added it would be guarenteed to be within that timeframe.
I mean, you COULD do it by hardcoding, but my way would sort of forcibly store a start and stop point in the database to use.
I would use Cron to handle the intervals and timing somewhat. Not use the time from that, but just to not lock up the database by checking all the time.
I probably not got all the details but to answer to your question title "Reliably select from a database table at fixed time intervals"...
I don't think you could even hope for a query to be run at "second precise" time.
One key problem with that approach is that you will have to deal with concurrent access and lock. You might be able to send the query at fixed time maybe, but your query might be waiting on the DB server for several seconds (or being executed seeing fairly outdated snapshot of the db). Especially in your case since the table is apparently "busy".
As a suggestion, if I were you, I would spend some time to think about queue messaging systems (like http://www.rabbitmq.com/ just to cite one, not presaging it is somehow "your" solution). Anyway those kind of tools are probably more suited to your needs.
Related
We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?
Last year I was working on a project for university where one feature necessitated the expiry of records in the database with almost to-the-second precision (i.e. exactly x minutes/hours after creation). I say 'almost' because a few seconds probably wouldn't have meant the end of the world for me, although I can imagine that in something like an auction site, this probably would be important (I'm sure these types of sites use different measures, but just as an example).
I did research on MySQL events and did end up using them, although now that I think back on it I'm wondering if there is a better way to do what I did (which wasn't all that precise or efficient). There's three methods I can think of using events to achieve this - I want to know if these methods would be effective and efficient, or if there is some better way:
Schedule an event to run every second and update expired records. I
imagine that this would cause issues as the number of records
increases and takes longer than a second to execute, and might even
interfere with normal database operations. Correct me if I'm wrong.
Schedule an event that runs every half-hour or so (could be any
time interval, really), updating expired records. At the same time, impose
selection criteria when querying the database to only return records
whose expiration date has not yet passed, so that any records that
expired since the last event execution are not retrieved. While this
would be accurate at the time of retrieval, it defeats the purpose
of having the event in the first place, and I'd assume the extra
selection criteria would slow down the select query. In my project
last year, I used this method, and the event updating the records
was really only for backend logging purposes.
At insert, have a trigger that creates a dynamic event specific to
the record that will expire it precisely when it should expire.
After the expiry, delete the event. I feel like this would be a
great method of doing it, but I'm not too sure if having so many
events running at once would impact on the performance of the
database (imagine a database that has even 60 inserts an hour -
that's 60 events all running simultaneously for just one hour. Over
time, depending on how long the expiration is, this would add up).
I'm sure there's more ways that you could do this - maybe using a separate script that runs externally to the RDBMS is an option - but these are the ones I was thinking about. If anyone has any insight as to how you might expire a record with precision, please let me know.
Also, despite the fact that I actually did use it in the past, I don't really like method 2 because while this works for the expiration of records, it doesn't really help me if instead of expiring a record at a precise time, I wanted to make it active at a certain time (i.e. a scheduled post in a blog site). So for this reason, if you have a method that would work to update a record at a precise time, regardless of what that that update does (expire or post), I'd be happy to hear it.
Option 3:
At insert, have a trigger that creates a dynamic event specific to the record that will expire it precisely when it should expire. After the expiry, delete the event. I feel like this would be a great method of doing it, but I'm not too sure if having so many events running at once would impact on the performance of the database (imagine a database that has even 60 inserts an hour - that's 60 events all running simultaneously for just one hour. Over time, depending on how long the expiration is, this would add up).
If you know the expiry time on insert just put it in the table..
library_record - id, ..., create_at, expire_at
And query live records with the condition:
expire_at > NOW()
Same with publishing:
library_record - id, ..., create_at, publish_at, expire_at
Where:
publish_at <= NOW() AND expire_at > NOW()
You can set publish_at = create_at for immediate publication or just drop create_at if you don't need it.
Each of these, with the correct indexing, will have performance comparable to an is_live = 1 flag in the table and save you a lot of event related headache.
Also you will be able to see exactly why a record isn't live and when it expired/should be published easily. You can also query things such as records that expire soon and send reminders with ease.
I am working on a project with node.js (not express) server and a mysql database. When a user clicks a button on the page, it uploads 2 values (say SpecificName, Yes/No). Now these values get inserted into the mysql database through the node server. Later, mysql runs a check for the specificName column (if it finds none, it then creates a column with the same Name) and updates the second value in it.
Now I would like to keep every update of the second value that the user makes through website (i.e yes) for 5 minutes in the mysql database after which it automatically updates the the specific location with another value (say cancel). I'm auspicious in solving every thing except this 5 minutes paradox. Also I'm keeping 15-20 so called specificName columns in which the value (say yes/no) is being updated and at the same time there are more than 1000 rows that are working simultaneously so a lots of 5 minute timers going on for the values. Is there a way to store value temporarily in mysql after which it is destroyed automatically?.
I came across :
node-crons (too complex, don't even know if its a right choice)
mysql events (I'm not sure how to use it with node)
timestamp (can't create more than one timestamp (guess I need one for each column))
datetime (haven't tested it yet) and other things like
(DELETE FROM table WHERE timestamp > DATE_SUB(NOW(), INTERVAL 5 MINUTE)).
Now I have no idea what to use or how to resolve this dilemma.
Any help would be appreciated.
Per my conversation with Sammy on kik, I'm pretty sure you don't want to do this. This doesn't sound like a use case that fits MySQL. I also worry that your MySQL knowledge is super limited, in which case, you should take the time to do more research on MySQL. Without a better understanding of the larger goal(s) your application is trying to accomplish, I can't suggest better alternatives. If you can think of a way to explain the application behavior without compromising the product idea, that would be very helpful in helping us solve your problem.
General things I want to make clear before giving you potential answers:
You should not be altering columns from your application. This is one of my issues with the Node/Mongo world. Relational databases don't like frequently changing table definitions. It's a quick way to a painful day. Doing so is fine in non-relational systems like Mongo or Cassandra, but traditional relational databases do not like this. The application should only be inserting, updating, and deleting rows. Ye hath been warned.
I'm not sure you want to put data into MySQL that has a short expiration date. You probably want some sort of caching solution like memcache or redis. Now, you can make MySQL behave like a cache, but this is not its intended use. There are better solutions. If you're persistent on using MySQL for this, I recommend investigating the MEMORY storage engine for faster reads/writes at the cost of losing data if the system suddenly shuts down.
Here are some potential solutions:
MySQL Events - Have a timestamp column and an event scheduled to run... say every minute or so. If the event finds that a row has lived more than 5 minutes, delete it.
NodeJS setTimeout - From the application, after inserting the record(s), set a timeout for 5 minutes to go and delete said records. You'll probably want to ensure you have some sort of id or timestamp column for supahfast reference of the values.
Those are the two best solutions that come to mind for me. Again, if you're comfortable revealing how your application behaves that requires an unusual solution like this, we can likely help you arrive at a better solution.
ok so guess i figured it out myself. I'm posting this answer for all those who still deal with this query.I used DATETIME stamp in mysql that i created for each column of specificName.so with every specificName column,there exist another specificName_TIME column that stores the time at which the value (yes/no) is updated.the reason i didn't use timestamp is because its not possible to create end number of timestamp in mysql versions lower than 5.6.Now i updated the current time by adding 5 minutes before storing it in database.Then i ran 2 chain functions. First one checks if the datetime in the database is smaller than the current time (SELECT specificName FROM TABLE WHERE specificName_TIME < NOW()).If it turns out to be true it shows me the value else it reflects null.Then i ran the second function to update the value if its true to continue the whole process again and if not to continue it anyways updating the last value with null.
HOPE THIS HELPS.
The system I work is a little more complex to explain here but I can reduce it to something more simple.
Let's say I have a simple chat server and I count the seconds every client stays connected and save it in a table(I'm using mysql).
So every time a client connects I save the time he stays connected in seconds (int)
If he disconnects and connects again I save this info in another row because this is how I want. The number of times a client connects to the server in a day is between 50k-500k or even more(I know, I know but this is related to my complex system but irrelevant to my question here).
My problem is that I want to show to every client some stats about his visits similar to google analytics(by days), to be more specific I'm interested in showing the average time he spent on a certain day.
I'm looking at an optimized way to do this. So far I've thought about the following solutions:
use select avg(time) from table where date=.... but speed problems might occur
save the avg time in a separate table for every day and user. This solutions is ok, but raises another question: how do I save the average time? here are the situations I was thinking:
a) use mysql trigger to update the stats every time a client is connecting (using INSERT AFTER ...) this solution is not bad, however like I said the client can connect 500k times/day which means 500k times mysql needs to calculate the average time
b) make a separate application similar to a cron job or a timer task that updates the stats every X hours ,this way I know the mysql server will be used only once a few hours depending on the number of clients I have.
So far I'm thinking of implementing the 2.b solution, but I said to ask you first before proceeding. If you have better ideas please share.
Thanks
You can use solution a but don't recalculate the average over and over again. You can do it by storing the current average and the amount of items that where used to calculate the average. Your formula would be like:
(current_average*number_of_old_items+new_value)/(number_of_old_items+1)
In my opinion, this:
speed problems might occur
is not enough reason to avoid what is certainly the simplest and least error-prone solution, especially when it is so easy to change if and when speed problems do occur.
That being said — in the event of speed problems, I agree with your assessment: better to use a scheduled job that computes the average than to add a trigger that will impose a penalty on every insert.
The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.