Update statistic counter or just count(*) - Perfomance - mysql

What is the faster/better way to keep track on statistical data in a message board?
-> number of posts/topics
Update a column like 'number_of_posts' for each incoming post or after a post gets deleted.
Or just count(*) on the posts matching a topicId?

Just use count(*) - it's built into the database. It's well tested, and already written.
Having a special column to do this for you means you need to write the code to manage it, keep it in sync with the actual value (on adds and deletes). Why make more work for yourself?

Related

I came up with this SQL structure to allow rolling back and auditing user information, will this be adequate?

So, I came up with an idea to store my user information and the updates they make to their own profiles in a way that it is always possible to rollback (as an option to give to the user, for auditing and support purposes, etc.) while at the same time improving (?) the security and prevent malicious activity.
My idea is to store the user's info in rows but never allow the API backend to delete or update those rows, only to insert new ones that should be marked as the "current" data row. I created a graphical explanation:
Schema image
The potential issues that I come up with this model is the fact that users may update the information too frequently, bloating up the database (1 million users and an average of 5 updates per user are 5 million entries). However, for this I came up with the idea of putting apart the rows with "false" in the "current" column through partitioning, where they should not harm the performance and will await to be cleaned up every certain time.
Am I right to choose this model? Is there any other way to do such a thing?
I'd also use a second table user_settings_history.
When a setting is created, INSERT it in the user_settings_history table, along with a timestamp of when it was created. Then also UPDATE the same settings in the user_settings table. There will be one row per user in user_settings, and it will always be the current settings.
So the user_settings would always have the current settings, and the history table would have all prior sets of settings, associated with the date they were created.
This simplifies your queries against the user_settings table. You don't have to modify your queries to filter for the current flag column you described. You just know that the way your app works, the values in user_settings are defined as current.
If you're concerned about the user_settings_history table getting too large, the timestamp column makes it fairly easy to periodically DELETE rows over 180 days old, or whatever number of days seems appropriate to you.
By the way, 5 million rows isn't so large for a MySQL database. You'd want your queries to use an index where appropriate, but the size alone isn't disadvantage.

Leasing jobs (atomic update and get) from a MySQL database

I have a MySQL table that manages jobs that worker-clients can lease for processing. Apart from the columns that describe the job, the table has a unique primary key column id, a time-stamp-column lease, a boolean-column complete, and an int-column priority.
I'm trying to write a (set of) SQL statement(s) that will manage the leasing-process. My current plan is to find the first incomplete job that has a lease-date that is at least 8 hours in the past (no job should take more than one hour, so an incomplete lease that is that old probably means that the client died and the job needs to be restarted), set its lease-date to the current time-stamp, and return its info. All of this, of course, needs to happen atomically.
I found a neat trick here on SO and a variation of it in the discussion of the MySQL documentation (see post on 7-29-04 here) that uses user-defined variables to return the leased job from an UPDATE statement.
And, indeed, this works fine:
UPDATE jobs SET lease=NOW() WHERE TIMESTAMPDIFF(HOUR,lease,NOW())>=8 AND NOT complete AND #id:=id LIMIT 1;
SELECT * FROM jobs WHERE id=#id;
The problem comes in when I try to add priorities to the jobs and add ORDER BY priority into the UPDATE statement right before LIMIT. The UPDATE still works as expected, but the SELECT always returns the same row back (either the first or the last, but not the one that was actually updated). I'm a little confused by this, since LIMIT 1 should make sure that the first update that actually happens will terminate the UPDATE process, leaving #id set to the correct value of that updated row, no? For some reason it seems to keep evaluating the condition #id:=id for all rows anyways, even after it's done with its update (or maybe it evaluates it first for all rows before even figuring out which one to update, I don't know...).
To fix this, I tried rewriting the statement to make sure the variable really only gets set for the matching row:
UPDATE jobs SET lease=NOW(),#id:=id WHERE TIMESTAMPDIFF(HOUR,lease,NOW())>=8 AND NOT complete ORDER BY priority LIMIT 1;
But for some reason, this gives me the following error:
Error Code : 1064
You have an error in your SQL syntax; check the manual that corresponds
to your MySQL server version for the right syntax to use near
'#id:=id WHERE TIMESTAMPDIFF(HOUR,lease,NOW())>=8 AND NOT complete ORDER BY prior'
at line 1
So, it seems that I can't assign the variable in the SET-part of the UPDATE (although this was the way it was suggested in the SO-answer linked above).
Can this approach be salvaged somehow or is there a better one altogether?
PS: I'm using MySQL server v5.5.44-0+deb8u1
My solution with a little trick:
first: you must use a subselect so that UPDATE not nows thats the same table an
second: you must initialize the #id with "(SELECT #id:=0)" else if the found no row they returns the last set value. Here you can also specify if they return 0 or '' when no result is found.
UPDATE jobs SET lease=NOW() WHERE id =
( SELECT * FROM
( SELECT #id:=id FROM jobs,(SELECT #id:=0) AS tmp_id
WHERE TIMESTAMPDIFF(HOUR,lease,NOW())>=8
AND NOT complete ORDER BY priority LIMIT 1
) AS tmp
);
It is OK that you found a solution.
If this must be quite stable, I would go for a different solution. I would not use atomicity, but "commit"- like workflows. You should identify your worker-client with a unique key, either in it's own table or with a secure hash key. You add two fields to your jobs-table: worker and state. So if you look for a job for worker W345, you assign worker to that job.
First part would be
update jobs set worker='W345', state='planning', lease=now()
where TIMESTAMPDIFF(HOUR,lease,NOW())>=8
AND NOT complete
ORDER BY priority LIMIT 1;
Next part (could be even from different part of application)
select * from jobs where worker='W345' and state='planning';
get id and data, update:
update jobs set state='sending', lease=now() where id=...;
Maybe you even can commit the sending of the job, otherwise you guess that it started after sending.
update jobs set state='working', lease=now() where id = ...;
You find all jobs that are dead before being sent to worker by their state and some short minutes old lease. You can find out where the process got into trouble. You can find out which workers get most trouble, and so on.
Maybe the real details differ, but as long as you have some status column you should be quite flexible and find your solution.
I was able to fix things with the following hack:
UPDATE jobs SET lease=IF(#id:=id,NOW(),0) WHERE TIMESTAMPDIFF(HOUR,lease,NOW())>=8 AND NOT complete ORDER BY priority LIMIT 1;
Seems like it's simply not allowed to set a local variable within the SET section of UPDATE.
Note:Since the id column is an auto-increment primary key, it is never 0 or NULL. Thus, the assignment #id:=id inside the IF-statement should always evaluate to TRUE and therefore lease should be set correctly (correct me if I'm wrong on this, please!).
One thing to keep in mind:The variable #id by default is scoped to the MySQL connection (not any Java Statement-object, for example, or similar), so if one connection is to be used for multiple job-leases, one needs to ensure that the different UPDATE/SELECT-pairs never get interleaved. Or one could add an increasing number to the variable-name (#id1, #id2, #id3, ...) to guarantee correct results, but I don't know what performance (or memory-use) impact this will have on the MySQL-server. Or, the whole thing could be packaged up into a stored procedure and the variable declared as local.

Is it better to store list of each user's Blocked users for query exclusion in $_SESSION var, or to exclude in "real-time" with sub-query?

On one of my PHP/MySQL sites, every user can block every other user on the site. These blocks are stored in a Blocked table with each row representing who did the blocking and who is the target of the block. The columns are indexed for faster retrieval of a user's entire "block list".
For each user, we must exclude from any search results any user that appears in their block list.
In order to do that, is it better to:
1) Generate the "block list" whenever the user logs in by querying the Blocked table once at login and saving it to the $_SESSION (and re-querying any time they make a change to their "block list" and re-saving it to the $_SESSION), and then querying as such:
NOT IN ($commaSeparatedListFromSession)
or
2) Exclude the blocked users in "real-time" directly in the query by using a sub-query for each user's search query as such:
NOT IN (SELECT userid FROM Blocked WHERE Blocked.from = $currentUserID) ?
If the website is PHP and the blocklist is less than say 100 total per user I would store it in a table, load it to $_SESSION when changed/loggging in. You could just as easily load it from SQL on each page load into a local variable however.
What I would store in $_SESSION is a flag 'has_blocklist_contents' that would decide whether or not you should load or check the blocklist on page load.
Instead of then using a NOT IN with all of your queries the list I think it might be smarter to filter them out using PHP.
I have two reasons for wanting to implement this way:
Your database can re-use the SQL for all users on the system resulting in a performance boost for retrieving comments and such.
Your block list will most of the time be empty, so you're not adding any processing time for the majority of users.
I think there is 3rd solution to it. In my opinion this would be the better way to go.
If you can write this
NOT IN (SELECT userid FROM Blocked WHERE Blocked.from = $currentUserID)
Then you can surely write this.
....
SomeTable st
LEFT JOIN
Blocked b
ON( st.userid = b.userid AND Blocked.from = $currentUserID)
WHERE b.primaryKey IS NULL;
I hope you understand what I mean by the above query.
This way you get the best of both worlds i.e. You don't have to run 2 queries, and you don't have to save data in $_SESSION
Don't use the $_SESSION as a substitute for a proper caching system. The more junk you pile into $_SESSION, the more you'll have to load for each and every request.
Using a sub-select for exclusions can be brutally slow if you're not careful to keep your database tuned. Make sure your indexes are covering all your WHERE conditions.

MySQL Update entire table with unknown # of rows and clear the rest

I'm pretty sure this particular quirk isn't a duplicate so here goes.
I have a table of services. In this table, I have about 40 rows of the following columns:
Services:
id_Services -- primary key
Name -- name of the service
Cost_a -- for variant a of service
Cost_b -- for variant b of service
Order -- order service is displayed in
The user can go into an admin tool and update any of this information - including deleting multiple rows, adding a row, editing info, and changing the order they are displayed in.
My question is this, since I will never know how many rows will be incoming from a submission (there could be 1 more or 100% less), I was wondering how to address this in my query.
Upon submission, every value is resubmitted. I'd hate to do it this way but the easiest way I can think of is to truncate the table and reinsert everything... but that seems a little... uhhh... bad! What is the best way to accomplish this?
RE-EDIT: For example: I start with 40 rows, update with 36. I still have to do something to the values in rows 37-40. How can I do this? Are there any mysql tricks or functions that will do this for me?
Thank you very much for your help!
You're slightly limited by the use case; you're doing insertion/update/truncation that's presented to the user as a batch operation, but in the back-end you'll have to do these in separate statements.
Watch out for concurrency: use transactions if you can.

Versioned and indexed data store

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!