I have developed a script which works with a large MySQL database. The script works on IIS, MySQL, ASP Classic. The script selects 10,000 or 100,000 records and works with the records one by one and updates the database. Everything works fine, but with very slow performance. The reason for the slowness is not because of select or update statements or slow server, but because of working with those records one by one and doing some changes, then updates.
For example
SELECT * from mytable
WHERE isempty(title)
ORDER BY length(title) DESC LIMIT 100000;
Then working with those 100,000 records one by one takes, e.g. 100,000 minutes. So, I want to run the same script with 2 or 3 browsers, let say IE, Chrome, FireFox..
I was thinking to do it like this, but I am not sure if it is possible or not.
On IIS when runs the script on browser 1, it selects 100,000 records and starts work on them and starts making some changes. On browser 2 it selects 100,000 but on database less records with same condition, it might selects 90,000 and start work. Since the browser started little early, it might do some changes, so while both threads work, each other has to see those changes and work with those changes. For example the title finished on current record, then pass that record and choose another one. Is that possible? I am not sure and never used the cursor location and cursor type or whatever..
Let say 101,000 records are on the database, script 1 started first and selects 100,000 rows. After 100 minutes browser 2 starts. But when browser 2 selects 100,000 rows, that time the browser 1 has already finished 10,000, so the browser 2 will get only 91,000 records. But since those two browsers work on the same record, how can they see each others changes?
Is there any solution for my current situation? I am not MySQL expert, thats why I don't know what to do.
I am sorry for my English, but I hope you understand my question.
UPDATE;
this is not because of any script problems, or slow server problem or any other problem. this is slow because between "DO WHILE RS.EOF" AND "LOOP" I do lots of things AND aswell it doesn't really takes one minute per record, just saying an example. but I was thinking simultaneously 2 or 3 instances running the script.
ASP-Classic does not support the type of multi-threading you are looking for, however you could write a COM component or something similar that would and call it from your page.
Unless there is some sort of input required from the user, you could also write a server-side task in VBScript/PowerShell/Python/etc. to occasionally run through the data and perform what ever task it is you are trying to accomplish. It's hard to be specific when the question isn't very specific.
Having said all that, it really does sound like there are more problems with your code than you realize. It's hard to point out due to the lack of a concrete example in front of us. If you haven't already, I'd double-check to make sure the bottlenecks are where you think they are.
I've used a crude ASP Profiler in the past to look for where the specific bottlenecks are in the ASP/VBScript sites I still maintain, and on a few occasions I've found that the problem was in the least likely spots.
The bottom line is that your question is missing a fair amount of information for providing useful answers, and seems to make some assumptions that might not necessarily be true. Show us some code, provide us with some data, and you'll probably get better answers.
Related
I'm having a hard time wrapping my head around the issue of an ELO-score-like calculation for a large amount of users on our platform.
For example. For every user in a large set of users, a complex formule, based on variable amounts of "things done", will result in a score for each user for a match-making-like principle.
For our situation, it's based on the amount of posts posted, connections accepted, messages sent, amount of sessions in a time period of one month, .. other things done etc.
I had two ideas to go about doing this:
Real-time: On every post, message, .. run the formula for that user
Once a week: Run the script to calculate everything for all users.
The concerns about these two I have:
Real-time: Would be an overkill of queries and calculations for each action a user performs. If let's say, 500 users are active, all of them are performing actions, the database would be having a hard time I think. There would them also run a script to re-calculate the score for inactive users (to lower their score)
Once a week: If we have for example 5.000 users (for our first phase), than that would result into running the calculation formula 5.000 times and could take a long time and will increase in time when more users join.
The calculation-queries for a single variable in a the entire formula of about 12 variables are mostly a simple 'COUNT FROM table', but a few are like counting "all connections of my connections" which takes a few joins.
I started with "logging" every action into a table for this purpose, just the counter values and increase/decrease them with every action and running the formula with these values (a record per week). This works but can't be applied for every variable (like the connections of connections).
Note: Our server-side is based on PHP with MySQL.
We're also running Redis, but I'm not sure if this could improve those bits and pieces.
We have the option to export/push data to other servers/databases if needed.
My main example is the app 'Tinder' which uses a sort-like algorithm for match making (maybe with less complex data variables because they're not using groups and communities that you can join)
I'm wondering if they run that real-time on every swipe, every setting change, .. or if they have like a script that runs continiously for a small batch of users each time.
Where it all comes down to. What would be the most efficient/non-database-table-locking way to do this, with keeping the idea in mind that there will be a moment that we're having 50.000 users for example?
The way I would handle this:
Implement the realtime algorithm.
Measure. Is it actually slow? Try optimizing
Still slow? Move the algorithm to a separate asynchronous process. Have the process run whenever there's an update. Really this is the same thing as 1, but it doesn't slow down PHP requests and if it gets busy, it can take more time to catch up.
Still slow? Now you might be able to optimize by batching several changes.
If you have 5000 users right now, make sure it runs well with 5000 users. You're not going to grow to 50.000 overnight, so adjust and invest in this as your problem changes. You might be surprised where your performance problems are.
Measuring is key though. If you really want to support 50K users right now, simulate and measure.
I suspect you should use the database as the "source of truth" aka "persistent storage".
Then fetch whatever is needed from the dataset when you update the ratings. Even lots of games by 5000 players should not take more than a few seconds to fetch and compute on.
Bottom line: Implement "realtime"; come back with table schema and SELECTs if you find that the table fetching is a significant fraction of the total time. Do the "math" in a programming language, not SQL.
I'm facing a dilemma regarding how to store data a user generates, because they do so very frequently.
You see, the general idea of the program is that the user answers MPC-questions, and that their progress gets stored and shown - i.e., the more questions are answered correctly, the closer to a 100% the user gets, which is updated in the DB with every answer, as well as on screen. Note that the longer a user is doing this, the faster their answers become, up to the point of answering every 1,5 second. This will NOT be uncommon.
You can see how this adds to the number of queries made - and that's without taking additional users into consideration.
Now I've thought of regular-interval updates (every X questions, or every Y minutes, etc.), but that comes with the risk of the client closing the session and effectively destroying data I would very much love to have had. I've also thought of cookies, which hold data that's read at every login, but this will always entail that the user's last session will not be stored. Also: the restriction on cookie size and numbers make this worthless, as my program offers TONS of different topics of which MPC-Questions are created.
So the question is: how can I be as efficient as possible in the amount of queries made, without losing data? Is there a common approach to this type of problem - I mean, how do statistical or just DB-heavy websites pull this stuff off?!
(Also: 80% of the program is JS (Jquery). Just sayin'.)
I'm curious what solutions or ideas will come up!
I have a fairly 'active' CDR table I want to select records from it every say 5 minutes for those last 5 minutes. The problem is it has a SHA IDs generated on a few of the other columns so all I have to lean on is a timestamp field by which I filter by date to select the time window of records I want.
The next problem is that obviously I cannot guarantee my script will run on the second precisely every time, or that the wall clocks of the server will be correct (which doesn't matter) and most importantly there almost certainly will be more than one record per second say 3 rows '2013-08-08 14:57:05' and before the second expired one more might be inserted.
By the time for '2013-08-08 14:57:05' and get records BETWEEN '2013-08-08 14:57:05' AND '2013-08-08 15:02:05' there will be more records for '2013-08-08 14:57:05' which I would have missed.
Essentially:
imprecise wall clock time
no sequential IDs
multiple records per second
query execution time
unreliable frequency of running the query
Are all preventing me from getting a valid set of rows in a specified rolling time window. Any suggestions for how I can go around these?
If you are using the same clock then i see no reason why things would be wrong. a resolution you would want to consider is a datetime table. So that way, every time you updated the start and stop times based on the server time.... then as things are added it would be guarenteed to be within that timeframe.
I mean, you COULD do it by hardcoding, but my way would sort of forcibly store a start and stop point in the database to use.
I would use Cron to handle the intervals and timing somewhat. Not use the time from that, but just to not lock up the database by checking all the time.
I probably not got all the details but to answer to your question title "Reliably select from a database table at fixed time intervals"...
I don't think you could even hope for a query to be run at "second precise" time.
One key problem with that approach is that you will have to deal with concurrent access and lock. You might be able to send the query at fixed time maybe, but your query might be waiting on the DB server for several seconds (or being executed seeing fairly outdated snapshot of the db). Especially in your case since the table is apparently "busy".
As a suggestion, if I were you, I would spend some time to think about queue messaging systems (like http://www.rabbitmq.com/ just to cite one, not presaging it is somehow "your" solution). Anyway those kind of tools are probably more suited to your needs.
The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.
after months, I thought maybe this would work and it didn't:
declare #d1 datetime2(7); set #d1=getdate();
select c1,c2 from t1 where (datediff(ss,#d1,getdate())<5)
Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.
The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.
To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.
So...how to make this happen?
In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)
OK, So what can YOU do in your situation?
If you can't do it the "right" way, solve it with a hack.
My suggestion: keep your own "estimated cost" statistics.
Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.
So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.
And if your moving average ever exceeds 4.75s, subtract 1%.
Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)
Just one -- scratch that, two -- more things...
As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.
Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.
Your while loop idea won't solve the problem entirely. It is possible that the very first iteration through the loop could take longer than 5 seconds. Plus, it will likely result in retrieving far fewer rows in the allotted time than if you tried to do it with just a single query.
Personally, I wouldn't try to solve this exact problem. Instead, I would do some testing, and through trial and error identify a number of records that I am confident will load in under five seconds. Then, I would just place a LIMIT on the loading query.
Next, depending on the requirements I would either set a timeout on the DB call of five seconds or just live with the chance that some calls will exceed the time restriction.
Lastly, consider that on most modern hardware for most queries, you can return a very large number of records within five seconds. It's hard to imagine returning all of that data to the UI and still have it be usable, if that is your intention.
-Jason
I've never tried this, but if a script is running this query you could try running an unbuffered query (in php, this would be something like mysql_unbuffered_query())... you could then store these into an array while the query is running. You could then set the mysql query timeout to five minutes. When the query is killed, if you've set your while() loop to check for a timeout response it can then terminate the loop and you'll have an array with all of the records returned in 5 minutes. Again, I'm not sure this would work, but I'd be interested to see if it would accomplish what you're looking to do.
You could approach this problem like this, but I doubt that this logic is really what I'd recommend for real world use.
You have a 10s interval, you try one query, it gets you the row in 0.1s. That would imply you could get at least 99 similar queries still in the remaining 9.9s.
However, getting 99 queries at once should proove faster than getting them one-by-one (which your initial calculation would suggest). So you get the 99 queries and check the time again.
Let's say the operation performed 1.5 times as fast as the single query, because getting more queries at once is more efficient, leaving you with 100rows at a time of 7.5s. You calculate that by average you have so far gotten 100rows per 7.5s, calculate a new amount of possible queries for the rest of the time and query again, and so on. You would, however, need to set a threshold limit for this loop, let's say something like: Don't get any new queries any more after 9.9s.
This solution obviously is neither the most smooth nor something I'd really use, but maybe it serves to solve the OP's problem.
Also, jmacinnes already pointed out: "It is possible that the very first iteration through the loop could take longer than 10[5] seconds."
I'd certainly be interested myself, if someone can come up with a proper solution to this problem.
To get data from the table you should do two things:
execute a query (SELECT something FROM table)
fill the table or read data
You are asking about second one. I'm not that familiar with php, but I think it does not matter. We use fetching to get first records quickly and show them to the user, then fetch records as needed. In ADO.NET you could use IDataReader to get records one by one, in php I think you could use similar methods, for example - mysqli_fetch_row in mysqli extension or mysql_fetch_row in mysql extension. In this case you could stop reading data at any moment.
My question is a lot like this one. However I'm on MySQL and I'm looking for the "lowest tech" solution that I can find.
The situation is that I have 2 databases that should have the same data in them but they are updated primarily when they are not able to contact each other. I suspect that there is some sort of clustering or master/slave thing that would be able to sync them just fine. However in my cases that is major overkill as this is just a scratch DB for my own use.
What is a good way to do this?
My current approach is to have a Federated table on one of them and, every so often, stuff the data over the wire to the other with an insert/select. It get a bit convoluted trying to deal with primary keys and what not. (insert ignore seems to not work correctly)
p.s. I can easily build a query that selects the rows to transfer.
MySQL's inbuilt replication is very easy to set up and works well even when the DBs are disconnected most of the time. I'd say configuring this would be much simpler than any custom solution out there.
See http://www.howtoforge.com/mysql_database_replication for instructions, you should be up and running in 10-15 mins and you won't have to think about it again.
The only downside I can see is that it is asynchronous - ie. you must have one designated master that gets all the changes.
My current solution is
set up a federated table on the source box that grabs the table on the target box
set up a view on the source box that selects the rows to be updated (as a join of the federated table)
set up another federated table on the target box that grabs the view on the source box
issue an INSERT...SELECT...ON DUPLICATE UPDATE on the target box to run the pull.
I guess I could just grab the source table and do it all in one shot, but based on the query logs I've been seeing, I'm guessing that I'd end up with about 20K queries being run or about 100-300MB of data transfer depending on how things happen. The above setup sold result in about 4 queries and little more data transfered than actually needed to be.