I was wondering what would be the best solution to dynamically archive rows. For instance when a user marks a task as completed, that task needs to be archived yet still accessible.
What would be the best practices for achieving this? Should I just leave it all in the same table and leave out completed tasks from the queries? I'm afraid that over time the table will become huge (1,000,000 rows in a year or less). Or should I create another table ie task_archive and query that row whenever data is needed from it?
I know similar questions have been asked before but most of them where about archiving thousands of rows simultaneously, I just need to know what would be the best method (and why) to archive 1 row at a time once it's been marked completed
For speed and ease of use, I would generally leave the row in the same table (and flag it as completed) and then later move it to an archive table. This way the user doesn't incur the delay of making that move on the spot; the move can happen as a batch process during non-busy periods.
When that move should happen depends on your application. For example, if they have a dashboard widget that shows "Recently Completed Tasks" that shows all of the tasks completed in the past week (and lets them drill in to see details), it might make sense to move the rows to the archive a week after they've been completed. Or if they frequently need to look at tasks from the current semester (for an academic app) but rarely for previous semesters, make the batch move happen at the end of the semester.
If the table is indexed 1,000,000 rows shouldn't be that big a deal, honestly.
You could use a trigger to capture that the order was marked completed, remove from the current table, and insert into the archive table.
Or, you could create a stored procedure that performed the archive. For example
sp_markcompleted(taskid)
start transaction;
insert into newtable select * from oldtable where id=taskid;
delete from oldtable where id=taskid;
commit;
Related
So, I came up with an idea to store my user information and the updates they make to their own profiles in a way that it is always possible to rollback (as an option to give to the user, for auditing and support purposes, etc.) while at the same time improving (?) the security and prevent malicious activity.
My idea is to store the user's info in rows but never allow the API backend to delete or update those rows, only to insert new ones that should be marked as the "current" data row. I created a graphical explanation:
Schema image
The potential issues that I come up with this model is the fact that users may update the information too frequently, bloating up the database (1 million users and an average of 5 updates per user are 5 million entries). However, for this I came up with the idea of putting apart the rows with "false" in the "current" column through partitioning, where they should not harm the performance and will await to be cleaned up every certain time.
Am I right to choose this model? Is there any other way to do such a thing?
I'd also use a second table user_settings_history.
When a setting is created, INSERT it in the user_settings_history table, along with a timestamp of when it was created. Then also UPDATE the same settings in the user_settings table. There will be one row per user in user_settings, and it will always be the current settings.
So the user_settings would always have the current settings, and the history table would have all prior sets of settings, associated with the date they were created.
This simplifies your queries against the user_settings table. You don't have to modify your queries to filter for the current flag column you described. You just know that the way your app works, the values in user_settings are defined as current.
If you're concerned about the user_settings_history table getting too large, the timestamp column makes it fairly easy to periodically DELETE rows over 180 days old, or whatever number of days seems appropriate to you.
By the way, 5 million rows isn't so large for a MySQL database. You'd want your queries to use an index where appropriate, but the size alone isn't disadvantage.
I have an app where you can select a person from a list to view more details. Now each time a person is selected to view more details, I want to record that in a recent history table. I'm not sure the best way to approach this to have say the latest 10 person selections.
I know I need to create the history table but am wondering if I should just do an insert for each person click and select only 10 results with the most recent dates or if I should worry about just updating only 10 records to keep the row count low?
Any input would be appreciated.
I would update, otherwise you keep adding more and more data which you are not going to use. Maybe you won't run into problems with this specific case (because people won't select tens of thousands of persons a day), but in general you should be careful with just adding data without cleaning it up.
That said, the first step is the same anyway:
Adding or updating the person
So, if someone selects 'Bob', I would try to update 'Bob' in the history and set his lastselecteddate to now. If that fails, then insert 'Bob', again with the current timestamp.
Cleaning up the older history
After that you can decide whether or not to clean up old history. You could delete all rows but the newest ten, or you can keep a hundred, or not clean them at all. Also, instead of cleaning them up right away, you could make a job that does this once every day.
If you clean up right away, you can decide to not do that when you updated a person. After all, if you didn't insert one, you shouldn't have to clean-up one.
I have a table, to which rows are only appended (not updated or deleted) with transactions (I'll explain why this is important), and I need to fetch the new, previously unfetched, rows of this table, every minute with a cron.
How am I going to do this? In any programming language (I use Perl but that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest auto_incrementing id of the rows fetched, so in the next minute I can fetch with: WHERE id > $last_id. But that can miss rows. Because new rows are inserted in transactions, it's possible that the transaction that saves the row with id = 5 commits before the transaction that saves the row with id = 4. It's therefore possible that the cron script retrieves row 5 but not row 4, and when row 4 gets committed one split second later, it will never gets fetched (because 4 is not > than 5 which is the $last_id).
Then I thought I could make the cron job fetch all rows that have a date field in the last TWO minutes, check which of these rows have been retrieved again in the previous run of the cron job (to do this I would need to save somewhere which row ids were retrieved), compare, and process only the new ones. Unfortunately this is complicated, and also doesn't solve the problem that will occur if a certain inserting transaction takes TWO AND A HALF minutes to commit for some weird database reason, which will cause the date to be too old for the next iteration of the cron job to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any other. The same process that does the inserting transaction, would notify RabbitMQ of the new row, and RabbitMQ would then notify an always-running process that processes new rows. So instead of getting a batch of rows inserted in the last minute, that process would get the new rows one-by-one as they are written. This sounds good, but has too many points of failure - RabbitMQ might be down for a second (in a restart for example) and in that case the insert transaction will have committed without the receiving process having ever received the new row. So the new row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30 of them, doing the exact same job on exactly the same data, so the same rows get processed 30 times, once by each receiving process) could write in another table that they have processed row X when they process it, then when time comes they can ask for all rows in the main table that don't exist in the "have_processed" table with an OUTER JOIN query. But I believe (correct me if I'm wrong) that such a query will consume a lot of CPU and HD on the DB server, since it will have to compare the entire list of ids of the two tables to find new entries (and the table is huge and getting bigger each minute). It would have been fast if the receiving process was only one - then I would have been able to add a indexed field named "have_read" in the main table that would make looking for new rows extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is simple, but a solution seems hard (for me) to find.
Thank you.
I believe the 'best' way to do this would be to use one process that checks for new rows and delegates them to the thirty consumer processes. Then your problem becomes simpler to manage from a database perspective and a delegating process is not that difficult to write.
If you are stuck with communicating to the thirty consumer processes through the database, the best option I could come up with is to create a trigger on the table, which copies each row to a secondary table. Copy each row to the secondary table thirty times (once for each consumer process). Add a column to this secondary table indicating the 'target' consumer process (for example a number from 1 to 30). Each consumer process checks for new rows with its unique number and then deletes those. If you are worried that some rows are deleted before they are processed (because the consumer crashes in the middle of processing), you can fetch, process and delete them one by one.
Since the secondary table is kept small by continuously deleting processed rows, INSERTs, SELECTs and DELETEs would be very fast. All operations on this secondary table would also be indexed by the primary key (if you place the consumer ID as first field of the primary key).
In MySQL statements, this would look like this:
CREATE TABLE `consumer`(
`id` INTEGER NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `consumer`(`id`) VALUES
(1),
(2),
(3)
-- all the way to 30
;
CREATE TABLE `secondaryTable` LIKE `primaryTable`;
ALTER TABLE `secondaryTable` ADD COLUMN `targetConsumerId` INTEGER NOT NULL FIRST;
-- alter the secondary table further to allow several rows with the same primary key (by adding targetConsumerId to the primary key)
DELIMTER //
CREATE TRIGGER `mark_to_process` AFTER INSERT ON `primaryTable`
FOR EACH ROW
BEGIN
-- by doing a cross join with the consumer table, this automatically inserts the correct amount of rows and adding or deleting consumers is just a matter of adding or deleting rows in the consumer table
INSERT INTO `secondaryTable`(`targetConsumerId`, `primaryTableId`, `primaryTableField1`, `primaryTableField2`) SELECT `consumer`.`id`, `primaryTable`.`id`, `primaryTable`.`field1`, `primaryTable`.`field2` FROM `consumer`, `primaryTable` WHERE `primaryTable`.`id` = NEW.`id`;
END//
DELIMITER ;
-- loop over the following statements in each consumer until the SELECT doesn't return any more rows
START TRANSACTION;
SELECT * FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID LIMIT 1;
-- here, do the processing (so before the COMMIT so that crashes won't let you miss rows)
DELETE FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID AND primaryTableId = PRIMARY_TABLE_ID_OF_ROW_JUST_SELECTED;
COMMIT;
I've been thinking on this for a while. So, let me see if I got it right. You have a HUGE table in which N, amount which may vary in time, processes write (let's call them producers). Now, there are these M, amount which my vary in time, other processes that need to at least process once each of those records added (let's call them consumers).
The main issues detected are:
Making sure the solution will work with dynamic N and M
It is needed to keep track of the unprocessed records for each consumer
The solution has to escalate as much as possible due to the huge amount of records
In order to tackle those issues I thought on this. Create this table (PK in bold):
PENDING_RECORDS(ConsumerID, HugeTableID)
Modify the consumers so that each time they add a record to the HUGE_TABLE they also add M records to the PENDING_RECORDS table so that it has the HugeTableID and also each of the ConsumerID that exist at that time. Each time a consumer runs it will query the PENDING_RECORDS table and will find a small amount of matches for itself. It will then join against the HUGE_TABLE (note it will be an inner join, not a left join) and fetch the actual data it needs to process. Once the data is processed then the consumer will delete the records fetched from the PENDING_RECORDS table, keeping it decently small.
Interesting, i must say :)
1) First of all - is it possible to add a field to the table that has rows only added (let's call it 'transactional_table')? I mean, is it a design paradigm and you have a reason not to do any sort of updates on this table, or is it "structurally" blocked (i.e. user connecting to db has no privileges to perform updates on this table) ?
Because then the simplest way to do it is to add "have_read" column to this table with default 0, and update this column on fetched rows with 1 (even if 30 processess do this simultanously, you should be fine as it would be very fast and it won't corrupt your data). Even if 30 processess mark the same 1000 rows as fetched - nothing is corrupt. Although if you do not operate on InnoDB, this might be not the best way as far as performance is concerned (MyISAM locks whole tables on updates, InnoDB only rows that are updated).
2) If this is not what you could use - I would surely check out the solution you gave as your last one, with a little modification. Create a table (let's say: fetched_ids), and save fetched rows' ids in that table. Then you could use something like :
SELECT tt.* from transactional_table tt
RIGHT JOIN fetched_ids fi ON tt.id = fi.row_id
WHERE fi.row_id IS NULL
This will return the rows from you transactional table, that have not been saved as already fetched. As long as both (tt.id) and (fi.row_id) have (ideally unique) indexes, this should work just fine even on large sets of data. MySQL handles JOINS on indexed fields pretty well. Do not fear trying out - create new table, copy ids to it, delete some of them and run your query. You'll see the results and you'll know if they are satisfactory :)
P.S. Of course, adding rows to this 'fetched_ids' table should be ran carefully not to create unnecessary duplicates (30 simultaneous processes could write 30 times the data you need - and if you need performance, you should watch out for this case).
How about a second table with a structure like this:
source_fk - this would hold an ID of the data rows you want to read.
process_id - This would be a unique id for one of the 30 processes.
then do a LEFT JOIN and exclude items from your source that have entries matching the specified process_id.
once you get your results, just go back and add the source_fk and process_id for each result you get.
One plus about this is you can add more processes later on with no problem.
I would try adding a timestamp column and use it as a reference when retrieving new rows.
I'm using the database to name files when they are uploaded, and my application uses the database to identify the file (for video streaming), so they need to be the same. But if an upload fails, or someone cancel the upload i'm left with a wasted ID.
overtime I have more wasted ID numbers then one's being used, I created a script that labels them unused after 5 days of the row is inserted, and was thinking of making another script to use those "unused" IDs and if none exists insert a new row. But I can see problems when I have multi servers doing that task.
Can I delete a row and have the ID from that row be re-used in future auto increments?
I'm open to other ways of accomplishing this task, but if I can re-use deleted ID's in future auto increment.
If you delete the row you can run
ALTER TABLE yourtable AUTO_INCREMENT = 1
to re-set the auto-incrementing to start at the last unused ID, but I don't see the point in doing this. If you're doing this for sake of the database, don't. The database doesn't care.
I'd +1 the "DON'T" if it's for the sake of the database, because really, it doesn't care at all. With that said, I do anyway because I'm a bit of a neat freak and I'll give you an example of why I decided to do it.
I created a service for musicians that uses an auto_increment as an index reference for songs, purchases and a few other things, and as with most apps that have registration forms and mail functions, it was bombed with bots, spammers and the like who created tons and tons of free accounts and they get deleted often (cron tab automatically cleans house about every 3 days) and what I end up getting are very inconsistent counts. 1 to 41, then 129 to 240, then 491 to 800 and so on.
It was more of a personal preference to 'close the gaps' because, as I said, i'm a bit of a neat freak and I like consistency. I also have to print out reports for managers and so on that want to track data and they like consistency, so it helps a lot.
And as for what happens when auto increment hits a number that already exits? It skips to the next available one and continues. That's why you can reset it to 1 and everything's cool. no harm done.
I have a model Post which has a expiry_date. I want to know what is the
best way to manage scalability in this case. 2 options:
Whenever I want to SELECT from the table, I need to include where
expiry_date > NOW. If the table Post grows like a monster, I will be in
trouble. Imagine after 3 years or more. Indexes will be huge too.
Have a trigger, cron job, or a plugin (if it exists) that would go
around the table and move expired items to a new table Post_Archive.
That way, I maintain only current Posts in my main table, which implies
that over 3 years I won't be as bad as option 1.
If you need to archive data on a continuous basis (your #2) than a good option is MaatKit.
http://www.maatkit.org/
It can "nibble" away at data in chunks rather than running mass queries which consume lots of resources (and avoiding polluting your key cache).
So yes, you would run a Maatkit job from cron.
In the meantime, if you want to do #1 at the same time, you could maybe implement a view which conveniently wraps up the "WHERE expiry_dat > NOW" condition so you dont have to include it all on your code.
A cron job sounds good to me, and it can be done by feeding a simple script directly to the mysql command, e.g., roughly:
CREATE TEMPORARY TABLE Moving
SELECT * FROM Post WHERE expiry > NOW();
INSERT INTO Post_Archive
SELECT * FROM Moving;
DELETE FROM Post
WHERE id IN (SELECT id FROM Moving);
DROP TEMPORARY TABLE Moving;