How to create SQL table with limited entries? - mysql

I have a table that is used to store the latest actions the user did (like a ctrl+z for the program), but I want to limit this table to about 200 entries, and after that, every new entry would delete the oldest in the table.
Is there any option to make the table behave this way on SQL or do I need to add some code to the program to do it?

I've seen this kind of idea before, but I've rarely seen a case where it was a good idea.
Your table would need these columns in addition to columns for the normal data.
A column of type integer to hold the row number.
A column of type timestamp (standard SQL timestamp) to hold the time of the last update.
The normal approach to limit this table to 200 rows would be to add a check constraint to the column of row numbers. For example, CHECK (row_num between 1 and 200). MySQL doesn't enforce check constraints, so instead you'll need to use a foreign key reference to a table of row numbers (1 to 200).
All insert statements will need to determine whether the table is full, examine the time of the last update, and either a) insert a new row with a new row number, or b) delete the oldest row or overwrite it.
My advice? Renegotiate this requirement.

Assuming that "200" is not a hard limit, in other words if the number of entries occasionally went over that by a small amount it would be OK...
Don't do the pruning on line, do it as an off line process, run as often as needed to keep the totals per user from not getting "too high".
For example, one such solution would be to fire the SQL that does that query every hour using crontab.

Related

making unique records in MS access

I have 1.7 million records in an access table sorted A to Z. the records are not unique and there are repeated records. I want to make them unique based on their frequency. if a record has been repeated 4 times I want the first one to get "-1" at the end of the record value, the second record get "-2" and so on. in this way similar records will become unique. all similar record are beside each other because of sorting. in excel I do this task by an If function (if this cell value<>the cell value above then "1" else above repeat number plus 1) but in access I don't know what to do (I'm a beginner).
finally I want to add a column to original table which is (original record value - repeat number).
I appreciate your help
Note about sort order:
Sort order in a relational database is not concrete like in a spreadsheet. There is no concept of rows being "next to each other", unless in context of an index. An index is largely a tool for the database to handle the data more efficiently (and to aid in defining uniqueness). The order itself is still largely dynamic because the order of a particular query can be specified differently from the index (or from storage order) and this does not change how the data is actually stored. Being "next to each other" is essentially a useless concept in SQL queries, unless you mean "next to each other numerically", for instance with an AutoNumber field or with the "repeat numbers" you want to add. Unlike in a spreadsheet, you cannot refer to the row "just above this row" or the "row offset by 2 from the 'current' row".
Solution
Regardless of whether or not you will use the AutoNumber column later, add a Long Integer AutoNumber column anyway. This column is named [ID] in the example code. Why? Because until you add something to allow the database to differentiate between the rows, there is technically no way using standard SQL to reliably reference individual duplicates since there is no way to distinguish individual rows. Even though you say that there are other differentiating columns, your own description rules out using them as a reliable key in referring to specific rows. (Even without such a differentiating column, Access can technically distinguish between rows. Iterating through a DAO.Recordset object in VBA would work, but perhaps not very elegant / efficient.)
Also add a new integer column for counting repeats, which below is named [DupeIndex]. A separate field is preferred (necessary?) because it allows continued reference to the original, unaltered duplicate values. If the reference number were directly updated, it would no longer match other fields and so would not be easily detected as a duplicate anymore. The following solution relies on grouping of ALL duplicate values, even those already "marked" with a [DupeIndex] number.
You should also realize that in comparing different data sets, that having separate fields allows more flexibility in matching the data. Having the values appended to the reference number complicates comparison, since you likely not only want to compare rows with the same duplication index, rather you will want to compare all possible combinations. For example, comparing records 123-1 in one set to 123-4 in another... how do you select such rows in an automated fashion? You don't want to have to manually code all combinations, but that's what you'll end up doing if you don't keep them separate like {123,1} and {123,4}.
Create and save this as a named query [Duplicates]. This query is referenced by later queries. It could instead be embedded as a sub query, but my preferences is to use saved queries for easier visualization and debugging in Access:
SELECT Data.RefNo, Count(Data.ID) AS Dupes, Max(Data.DupeIndex) AS IndexMax
FROM Data
GROUP BY Data.RefNo
HAVING Count(Data.ID) > 1
Execute the following to create a temporary table with new duplicate index values:
SELECT D1.ID, D1.RefNo,
IIf([Duplicates].[IndexMax] Is Null,0,[Duplicates].[IndexMax])
+ 1
+ (SELECT Count(D2.ID) FROM Data As D2
WHERE D2.[RefNo]=[D1].[RefNo]
And [D2].[DupeIndex] Is Null
And [D2].[ID]<[D1].[ID]) AS NewIndex
INTO TempIndices
FROM Data AS D1 INNER JOIN Duplicates ON D1.RefNo = Duplicates.RefNo
WHERE (D1.DupeIndex Is Null);
Execute the update query to set the new duplicate index values:
UPDATE Data
INNER JOIN TempIndices ON Data.ID = TempIndices.ID
SET Data.DupeIndex = [NewIndex]
Optionally remove the AutoNumber field and now assign the combined [RefNo] and new [DupeIndex] as primary key. The temporary table can also be deleted.
Comments about the queries:
Solution assume that [DupeIndex] is Null for unprocessed duplicates.
Solution correctly handles existing duplicate index numbers, only updating duplicate rows without an unique index.
Access has rather strict conditions for UPDATE queries, namely that updates are not based on circular references and/or that that joins will not produce multiple updates for the same row, etc. The temporary table is necessary in this case, since the query determining new index values refers multiple times in sub queries to the very column that is being updated. (If the update is attempted using joins on the subqueries, for example, Access complains that Operation must use an updatable query.)

Generic trigger on update / delete to backup current row

My situation is this:
I have a table, call it x.
Every time a row is updated or deleted, a copy of the old row should be inserted into x_history.
Additionally x_history will have its' own auto-incrementing id column, call that histid.
It is very important to have its' own id column as this will give us the flexibility to build version restore functionality.
I have 100+ tables to apply this to so I'm looking for a generic trigger that can be used for any table to backup one row into a history table. Only the 2 table names should vary from trigger to trigger. Specifying all column names is really not what I'm looking for.
I need to do this in MySQL but have added MSSQL too - I know both so can convert between one and the other easy enough.
Usually, Triggers are not the optimal solution for such purposes.
If possible, you might want to consider changing your database design.
Normally, a better way to handle such things are keeping the hole history in the source table, And have a status column that tells you for each row if it's deleted, updated, or current.
I have little to no experience with MySql, but I have been working with Sql server for the past 7 or 8 years, so what I'm about to say is true for sql server, but may be different for MySql.
If you choose to go with the triggers approach, keep in mind that after update triggers will execute even if the update does not change the row data (e.g update tableName set col1 = 1 where idCol = 4, the update trigger will be executed even if the col1 value before the update was 1, so no data was changed.)
For SqlServer, you might want to consider a common history table, that has only 6 columns:
1. Identity column
2. Table name column
3. Row Id column (original id from the original table)
4. Row Status column (e.g updated, deleted)
5. Action date (the date the row was copied to the history table)
6. Row content column (this should be an XML datatype (not sure if MySql has such dataType))
and then all you have to do is to use "SELECT * FROM deleted/inserted FOR XML AUTO" to create the content for the 6th. column.

How to retrieve the new rows of a table every minute

I have a table, to which rows are only appended (not updated or deleted) with transactions (I'll explain why this is important), and I need to fetch the new, previously unfetched, rows of this table, every minute with a cron.
How am I going to do this? In any programming language (I use Perl but that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest auto_incrementing id of the rows fetched, so in the next minute I can fetch with: WHERE id > $last_id. But that can miss rows. Because new rows are inserted in transactions, it's possible that the transaction that saves the row with id = 5 commits before the transaction that saves the row with id = 4. It's therefore possible that the cron script retrieves row 5 but not row 4, and when row 4 gets committed one split second later, it will never gets fetched (because 4 is not > than 5 which is the $last_id).
Then I thought I could make the cron job fetch all rows that have a date field in the last TWO minutes, check which of these rows have been retrieved again in the previous run of the cron job (to do this I would need to save somewhere which row ids were retrieved), compare, and process only the new ones. Unfortunately this is complicated, and also doesn't solve the problem that will occur if a certain inserting transaction takes TWO AND A HALF minutes to commit for some weird database reason, which will cause the date to be too old for the next iteration of the cron job to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any other. The same process that does the inserting transaction, would notify RabbitMQ of the new row, and RabbitMQ would then notify an always-running process that processes new rows. So instead of getting a batch of rows inserted in the last minute, that process would get the new rows one-by-one as they are written. This sounds good, but has too many points of failure - RabbitMQ might be down for a second (in a restart for example) and in that case the insert transaction will have committed without the receiving process having ever received the new row. So the new row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30 of them, doing the exact same job on exactly the same data, so the same rows get processed 30 times, once by each receiving process) could write in another table that they have processed row X when they process it, then when time comes they can ask for all rows in the main table that don't exist in the "have_processed" table with an OUTER JOIN query. But I believe (correct me if I'm wrong) that such a query will consume a lot of CPU and HD on the DB server, since it will have to compare the entire list of ids of the two tables to find new entries (and the table is huge and getting bigger each minute). It would have been fast if the receiving process was only one - then I would have been able to add a indexed field named "have_read" in the main table that would make looking for new rows extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is simple, but a solution seems hard (for me) to find.
Thank you.
I believe the 'best' way to do this would be to use one process that checks for new rows and delegates them to the thirty consumer processes. Then your problem becomes simpler to manage from a database perspective and a delegating process is not that difficult to write.
If you are stuck with communicating to the thirty consumer processes through the database, the best option I could come up with is to create a trigger on the table, which copies each row to a secondary table. Copy each row to the secondary table thirty times (once for each consumer process). Add a column to this secondary table indicating the 'target' consumer process (for example a number from 1 to 30). Each consumer process checks for new rows with its unique number and then deletes those. If you are worried that some rows are deleted before they are processed (because the consumer crashes in the middle of processing), you can fetch, process and delete them one by one.
Since the secondary table is kept small by continuously deleting processed rows, INSERTs, SELECTs and DELETEs would be very fast. All operations on this secondary table would also be indexed by the primary key (if you place the consumer ID as first field of the primary key).
In MySQL statements, this would look like this:
CREATE TABLE `consumer`(
`id` INTEGER NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `consumer`(`id`) VALUES
(1),
(2),
(3)
-- all the way to 30
;
CREATE TABLE `secondaryTable` LIKE `primaryTable`;
ALTER TABLE `secondaryTable` ADD COLUMN `targetConsumerId` INTEGER NOT NULL FIRST;
-- alter the secondary table further to allow several rows with the same primary key (by adding targetConsumerId to the primary key)
DELIMTER //
CREATE TRIGGER `mark_to_process` AFTER INSERT ON `primaryTable`
FOR EACH ROW
BEGIN
-- by doing a cross join with the consumer table, this automatically inserts the correct amount of rows and adding or deleting consumers is just a matter of adding or deleting rows in the consumer table
INSERT INTO `secondaryTable`(`targetConsumerId`, `primaryTableId`, `primaryTableField1`, `primaryTableField2`) SELECT `consumer`.`id`, `primaryTable`.`id`, `primaryTable`.`field1`, `primaryTable`.`field2` FROM `consumer`, `primaryTable` WHERE `primaryTable`.`id` = NEW.`id`;
END//
DELIMITER ;
-- loop over the following statements in each consumer until the SELECT doesn't return any more rows
START TRANSACTION;
SELECT * FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID LIMIT 1;
-- here, do the processing (so before the COMMIT so that crashes won't let you miss rows)
DELETE FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID AND primaryTableId = PRIMARY_TABLE_ID_OF_ROW_JUST_SELECTED;
COMMIT;
I've been thinking on this for a while. So, let me see if I got it right. You have a HUGE table in which N, amount which may vary in time, processes write (let's call them producers). Now, there are these M, amount which my vary in time, other processes that need to at least process once each of those records added (let's call them consumers).
The main issues detected are:
Making sure the solution will work with dynamic N and M
It is needed to keep track of the unprocessed records for each consumer
The solution has to escalate as much as possible due to the huge amount of records
In order to tackle those issues I thought on this. Create this table (PK in bold):
PENDING_RECORDS(ConsumerID, HugeTableID)
Modify the consumers so that each time they add a record to the HUGE_TABLE they also add M records to the PENDING_RECORDS table so that it has the HugeTableID and also each of the ConsumerID that exist at that time. Each time a consumer runs it will query the PENDING_RECORDS table and will find a small amount of matches for itself. It will then join against the HUGE_TABLE (note it will be an inner join, not a left join) and fetch the actual data it needs to process. Once the data is processed then the consumer will delete the records fetched from the PENDING_RECORDS table, keeping it decently small.
Interesting, i must say :)
1) First of all - is it possible to add a field to the table that has rows only added (let's call it 'transactional_table')? I mean, is it a design paradigm and you have a reason not to do any sort of updates on this table, or is it "structurally" blocked (i.e. user connecting to db has no privileges to perform updates on this table) ?
Because then the simplest way to do it is to add "have_read" column to this table with default 0, and update this column on fetched rows with 1 (even if 30 processess do this simultanously, you should be fine as it would be very fast and it won't corrupt your data). Even if 30 processess mark the same 1000 rows as fetched - nothing is corrupt. Although if you do not operate on InnoDB, this might be not the best way as far as performance is concerned (MyISAM locks whole tables on updates, InnoDB only rows that are updated).
2) If this is not what you could use - I would surely check out the solution you gave as your last one, with a little modification. Create a table (let's say: fetched_ids), and save fetched rows' ids in that table. Then you could use something like :
SELECT tt.* from transactional_table tt
RIGHT JOIN fetched_ids fi ON tt.id = fi.row_id
WHERE fi.row_id IS NULL
This will return the rows from you transactional table, that have not been saved as already fetched. As long as both (tt.id) and (fi.row_id) have (ideally unique) indexes, this should work just fine even on large sets of data. MySQL handles JOINS on indexed fields pretty well. Do not fear trying out - create new table, copy ids to it, delete some of them and run your query. You'll see the results and you'll know if they are satisfactory :)
P.S. Of course, adding rows to this 'fetched_ids' table should be ran carefully not to create unnecessary duplicates (30 simultaneous processes could write 30 times the data you need - and if you need performance, you should watch out for this case).
How about a second table with a structure like this:
source_fk - this would hold an ID of the data rows you want to read.
process_id - This would be a unique id for one of the 30 processes.
then do a LEFT JOIN and exclude items from your source that have entries matching the specified process_id.
once you get your results, just go back and add the source_fk and process_id for each result you get.
One plus about this is you can add more processes later on with no problem.
I would try adding a timestamp column and use it as a reference when retrieving new rows.

How to best check if a SQL table contents have not changed?

Assuming I have the following table named "contacts":
id|name|age
1|John|5
2|Amy|2
3|Eric|6
Is there some easy way to check whether or not this table changes much like how a sha/md5 hash works when getting the checksum for a file on your computer?
So for example, if a new row was added to this table, or if a value was changed within the table, the "hash" or some generated value shows that the table has changed.
If there is no direct mechanism, what is the best way to do this (could be some arbirary hash mechanism, as long as the method puts emphasis on performance and minimizing latency)? Could it be applied to multiple tables?
There is no direct mechanism to get that information through SQL.
You could consider adding an additional LastModified column to each row. To know the last time the table was modified, select the maximum value for that column.
You could achieve a similar outcome by using a trigger on the table for INSERT, UPDATE and DELETE, which updates a separate table with the last modified timestamp.
If you want to know if something has changed, you need something to compare. For example a date. You can add a table with two columns, the tablename and the timestamp, and program a trigger for the events on the table you are interested to control, so this trigger will update the timestamp column of this control table.
If the table isn't too big, you could take a copy of the entire table. When you want to check for changes, you can then query the old vs. new data.
drop table backup_table_name;
CREATE TABLE backup_table_name LIKE table_name;
INSERT INTO backup_table_name SELECT * FROM `table_name`;

MySQL Partitioning, Delete old data from multiple related tables

I am new to MySQL partitioning, therefore any example will be appreciated.
I am trying to create a sort of an ageing mechanism for a data that is distributed between several MyISAM tables.
My question will actually include several sub-questions.
The relevant tables are:
First table contains raw data with high input frequency (next to each record there is an auto incremented id).
Second table contains processed results, there is a result record per every raw data record (result record contains the source id record of the auto incremented field of raw data record)
Questions:
I need to be able to partition the raw data table and result data table similarly so that both of them will include only 10 weeks of data in single partition (each raw data record contains unixtimestamp field), how do i do it , can someone write small example case for two such tables?.
I want to be able to change the 10 weeks constraint on the fly.
I want that when ever the current partition will be filled or a new partition is created , the previous (10 weeks before) partition will be deleted automatically.
I don't want the auto increment id integer to be overflown, as much as i understand the ids are unique for the partition only, so if i am not wrong the auto increment id will start from zero for the next partition? but what if the previous partition still exist, will i have 2 duplicated ids , how i know to reference only for the last id when i present a result record?
I want to load raw data using LOAD DATA INTO... instead of multiple inserts , is MySQL partitioning functionality affected?
And the last question, would you suggest some other approach to implement aging mechanism (i am writing Java implementation product that processes around 1 GB or raw data per day and stores the results in MySQL)
It's hard to give a real answer on this question since it depends on your data. But let me give you some things to think about.
I assume we're talking about some kind of logs with recent data (so not spanning multiple years). You can partition by range. You could add one field to your table with the year/week number (ie 201201, 201202, etc). If this question is related to your question about importing into multiple tables, you can easily do this is that import script.
On the fly as in, repartition your data on the fly (70GB?). I would not recommend it. But you could do it if you had the weeknumber in there. If you later want to change it to 12 days, you could add a column for the date and partition by that.
Well it won't be deleted automatically but a cron job can handle that right? Just check how many partitions there are, and if there are 3(?) delete the first one.
The partition needs to have a primary index on the field that you partition (if you want to use auto increment). Therefor you can never fully rely on the auto increment id alone. I don't see a way around this.
I'm not sure what you mean.
If your data is just some logs in chronological order then you might just use separate tables for each period. Then before you start the new period (at 00:00) check the last id of the last table, create a new table and set the auto increment to that value +1. Then your import will decide when a new period will begin so it can be easily changed. Your import script can use a small table in where it can store the next period.
LOAD DATA is really quite fast. I would just have two steps(in no partic order) - LOAD DATA and then 'delete .. where date < 10 weeks'. Autoincrement will go on for as long as the datatype you're using. If you wanted to be super careful you could push it back to zero periodically.
Once the data is in the 'raw' table run your routine to create the 'processed' table. We use a v similar process where I work. We keep a separate table that has 'write' and 'parse' pointers to all of our 'raw' tables. As new data comes in and gets parsed the appropriate row pointers get set. If the 'raw' table gets truncated you can reset the 'write' pointer but leave the 'parse' pointer. (we store the offset in another table when this happens - just to be sure).
And if I recommend , creating the index column for each of the related columns can also enhanced the performance Delete old data from multiple related tables since we have just compared the index numbers rather than strings.
I wonder if your tables are being sorted or not.