I'm facing a dilemma regarding how to store data a user generates, because they do so very frequently.
You see, the general idea of the program is that the user answers MPC-questions, and that their progress gets stored and shown - i.e., the more questions are answered correctly, the closer to a 100% the user gets, which is updated in the DB with every answer, as well as on screen. Note that the longer a user is doing this, the faster their answers become, up to the point of answering every 1,5 second. This will NOT be uncommon.
You can see how this adds to the number of queries made - and that's without taking additional users into consideration.
Now I've thought of regular-interval updates (every X questions, or every Y minutes, etc.), but that comes with the risk of the client closing the session and effectively destroying data I would very much love to have had. I've also thought of cookies, which hold data that's read at every login, but this will always entail that the user's last session will not be stored. Also: the restriction on cookie size and numbers make this worthless, as my program offers TONS of different topics of which MPC-Questions are created.
So the question is: how can I be as efficient as possible in the amount of queries made, without losing data? Is there a common approach to this type of problem - I mean, how do statistical or just DB-heavy websites pull this stuff off?!
(Also: 80% of the program is JS (Jquery). Just sayin'.)
I'm curious what solutions or ideas will come up!
Related
Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.
I have developed a script which works with a large MySQL database. The script works on IIS, MySQL, ASP Classic. The script selects 10,000 or 100,000 records and works with the records one by one and updates the database. Everything works fine, but with very slow performance. The reason for the slowness is not because of select or update statements or slow server, but because of working with those records one by one and doing some changes, then updates.
For example
SELECT * from mytable
WHERE isempty(title)
ORDER BY length(title) DESC LIMIT 100000;
Then working with those 100,000 records one by one takes, e.g. 100,000 minutes. So, I want to run the same script with 2 or 3 browsers, let say IE, Chrome, FireFox..
I was thinking to do it like this, but I am not sure if it is possible or not.
On IIS when runs the script on browser 1, it selects 100,000 records and starts work on them and starts making some changes. On browser 2 it selects 100,000 but on database less records with same condition, it might selects 90,000 and start work. Since the browser started little early, it might do some changes, so while both threads work, each other has to see those changes and work with those changes. For example the title finished on current record, then pass that record and choose another one. Is that possible? I am not sure and never used the cursor location and cursor type or whatever..
Let say 101,000 records are on the database, script 1 started first and selects 100,000 rows. After 100 minutes browser 2 starts. But when browser 2 selects 100,000 rows, that time the browser 1 has already finished 10,000, so the browser 2 will get only 91,000 records. But since those two browsers work on the same record, how can they see each others changes?
Is there any solution for my current situation? I am not MySQL expert, thats why I don't know what to do.
I am sorry for my English, but I hope you understand my question.
UPDATE;
this is not because of any script problems, or slow server problem or any other problem. this is slow because between "DO WHILE RS.EOF" AND "LOOP" I do lots of things AND aswell it doesn't really takes one minute per record, just saying an example. but I was thinking simultaneously 2 or 3 instances running the script.
ASP-Classic does not support the type of multi-threading you are looking for, however you could write a COM component or something similar that would and call it from your page.
Unless there is some sort of input required from the user, you could also write a server-side task in VBScript/PowerShell/Python/etc. to occasionally run through the data and perform what ever task it is you are trying to accomplish. It's hard to be specific when the question isn't very specific.
Having said all that, it really does sound like there are more problems with your code than you realize. It's hard to point out due to the lack of a concrete example in front of us. If you haven't already, I'd double-check to make sure the bottlenecks are where you think they are.
I've used a crude ASP Profiler in the past to look for where the specific bottlenecks are in the ASP/VBScript sites I still maintain, and on a few occasions I've found that the problem was in the least likely spots.
The bottom line is that your question is missing a fair amount of information for providing useful answers, and seems to make some assumptions that might not necessarily be true. Show us some code, provide us with some data, and you'll probably get better answers.
I have results that are calculated from multiple rows in many different tables. These results are then displayed in a profile. To show the most current results on request, should i store these results in a separate table and update them on change? or should i go with calculating these results on the fly?
As usual with performance questions, the answer is "it depends".
I'd probably start by calculating the results on the fly and go with precomputing them when it starts to be a problem.
If you have precomputed/summarized copies of your main data, you'll have to set up an updating process to make sure your summaries are correct. This can be quite tricky and can add a lot of complexity to your application so I wouldn't do it unless I had to. You'll also want to have a set of sanity check tools to make sure your summaries are, in fact, correct. And a set of "kill it all and rebuild the generated summaries" tools will also come in handy.
If these calculations are a problem (and by that I mean that you have measured the performance and know that your the queries in question are a bottleneck and you have the numbers to prove it) then it might be worth the extra coding and maintenance effort.
Situation:
I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.
Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.
Standard method:
When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.
Hypothesised method:
A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.
The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible
The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.
Question:
The question boils down to two points:
Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
Is there anything else I have missed?
I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.
I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.
Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.
The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.
Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.
I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.
I'm creating a forum app in php and have a question regarding database design:
I can get all the posts for a specific topic.All the posts have an auto_increment identity column as well as a timestamp.
Assuming I want to know who the topic starter was, which is the best solution?
Get all the posts for the topic and order by timestamp. But what happens if someone immediately replies to the topic. Then I have the first two posts with the same timestamp(unlikely but possible). I can't know who the first one was. This is also normalized but becomes expensive after the table grows.
Get all the posts for the topic and order by post_id. This is an auto_increment column. Can I be guaranteed that the database will use an index id by insertion order? Will a post inserted later always have a higher id than previous rows? What if I delete a post? Would my database reuse the post_id later? This is mysql I'm using.
The easiest way off course is to simply add a field to the Topics table with the topic_starter_id and be done with it. But it is not normalized. I believe this is also the most efficient method after topic and post tables grow to millions of rows.
What is your opinion?
Zed's comment is pretty much spot on.
You generally want to achieve normalization, but denormalization can save potentially expensive queries.
In my experience writing forum software (five years commercially, five years as a hobby), this particular case calls for denormalization to save the single query. It's perfectly sane and acceptable to store both the first user's display name and id, as well as the last user's display name and id, just so long as the code that adds posts to topics always updates the record. You want one and only one code path here.
I must somewhat disagree with Charles on the fact that the only way to save on performance is to de-normalize to avoid an extra query.
To be more specific, there's an optimization that would work without denormalization (and attendant headaches of data maintenance/integrity), but ONLY if the user base is sufficiently small (let's say <1000 users, for the sake of argument - depends on your scale. Our apps use this approach with 10k+ mappings).
Namely, you have your application layer (code running on web server), retrieve the list of users into a proper cache (e.g. having data expiration facilities). Then, when you need to print first/last user's name, look it up in a cache on server side.
This avoids an extra query for every page view (as you need to only retrieve the full user list ONCE per N page views, when cache expires or when user data is updated which should cause cache expiration).
It adds a wee bit of CPU time and memory usage on web server, but in Yet Another Holy War (e.g. spend more resources on DB side or app server side) I'm firmly on the "don't waste DB resources" camp, seeing how scaling up DB is vastly harder than scaling up a web or app server.
And yes, if that (or other equally tricky) optimization is not feasible, I agree with Charles and Zed that you have a trade-off between normalization (less headaches related to data integrity) and performance gain (one less table to join in some queries). Since I'm an agnostic in that particular Holy War, I just go with what gives better marginal benefits (e.g. how much performance loss vs. how much cost/risk from de-normalization)