I have 1.7 million records in an access table sorted A to Z. the records are not unique and there are repeated records. I want to make them unique based on their frequency. if a record has been repeated 4 times I want the first one to get "-1" at the end of the record value, the second record get "-2" and so on. in this way similar records will become unique. all similar record are beside each other because of sorting. in excel I do this task by an If function (if this cell value<>the cell value above then "1" else above repeat number plus 1) but in access I don't know what to do (I'm a beginner).
finally I want to add a column to original table which is (original record value - repeat number).
I appreciate your help
Note about sort order:
Sort order in a relational database is not concrete like in a spreadsheet. There is no concept of rows being "next to each other", unless in context of an index. An index is largely a tool for the database to handle the data more efficiently (and to aid in defining uniqueness). The order itself is still largely dynamic because the order of a particular query can be specified differently from the index (or from storage order) and this does not change how the data is actually stored. Being "next to each other" is essentially a useless concept in SQL queries, unless you mean "next to each other numerically", for instance with an AutoNumber field or with the "repeat numbers" you want to add. Unlike in a spreadsheet, you cannot refer to the row "just above this row" or the "row offset by 2 from the 'current' row".
Solution
Regardless of whether or not you will use the AutoNumber column later, add a Long Integer AutoNumber column anyway. This column is named [ID] in the example code. Why? Because until you add something to allow the database to differentiate between the rows, there is technically no way using standard SQL to reliably reference individual duplicates since there is no way to distinguish individual rows. Even though you say that there are other differentiating columns, your own description rules out using them as a reliable key in referring to specific rows. (Even without such a differentiating column, Access can technically distinguish between rows. Iterating through a DAO.Recordset object in VBA would work, but perhaps not very elegant / efficient.)
Also add a new integer column for counting repeats, which below is named [DupeIndex]. A separate field is preferred (necessary?) because it allows continued reference to the original, unaltered duplicate values. If the reference number were directly updated, it would no longer match other fields and so would not be easily detected as a duplicate anymore. The following solution relies on grouping of ALL duplicate values, even those already "marked" with a [DupeIndex] number.
You should also realize that in comparing different data sets, that having separate fields allows more flexibility in matching the data. Having the values appended to the reference number complicates comparison, since you likely not only want to compare rows with the same duplication index, rather you will want to compare all possible combinations. For example, comparing records 123-1 in one set to 123-4 in another... how do you select such rows in an automated fashion? You don't want to have to manually code all combinations, but that's what you'll end up doing if you don't keep them separate like {123,1} and {123,4}.
Create and save this as a named query [Duplicates]. This query is referenced by later queries. It could instead be embedded as a sub query, but my preferences is to use saved queries for easier visualization and debugging in Access:
SELECT Data.RefNo, Count(Data.ID) AS Dupes, Max(Data.DupeIndex) AS IndexMax
FROM Data
GROUP BY Data.RefNo
HAVING Count(Data.ID) > 1
Execute the following to create a temporary table with new duplicate index values:
SELECT D1.ID, D1.RefNo,
IIf([Duplicates].[IndexMax] Is Null,0,[Duplicates].[IndexMax])
+ 1
+ (SELECT Count(D2.ID) FROM Data As D2
WHERE D2.[RefNo]=[D1].[RefNo]
And [D2].[DupeIndex] Is Null
And [D2].[ID]<[D1].[ID]) AS NewIndex
INTO TempIndices
FROM Data AS D1 INNER JOIN Duplicates ON D1.RefNo = Duplicates.RefNo
WHERE (D1.DupeIndex Is Null);
Execute the update query to set the new duplicate index values:
UPDATE Data
INNER JOIN TempIndices ON Data.ID = TempIndices.ID
SET Data.DupeIndex = [NewIndex]
Optionally remove the AutoNumber field and now assign the combined [RefNo] and new [DupeIndex] as primary key. The temporary table can also be deleted.
Comments about the queries:
Solution assume that [DupeIndex] is Null for unprocessed duplicates.
Solution correctly handles existing duplicate index numbers, only updating duplicate rows without an unique index.
Access has rather strict conditions for UPDATE queries, namely that updates are not based on circular references and/or that that joins will not produce multiple updates for the same row, etc. The temporary table is necessary in this case, since the query determining new index values refers multiple times in sub queries to the very column that is being updated. (If the update is attempted using joins on the subqueries, for example, Access complains that Operation must use an updatable query.)
Related
I have and old MySQL database where I need to insert new columns into tables (to support new parts of the front-end). But some of the old parts use SQL commands that depend on column count and order instead of their names. e.g.:
INSERT INTO `data` VALUES (null /*auto-id*/, "name", "description", ...)
When I add new columns into this table, I get the error:
1136 - Column count doesn't match value count at row 1
Right now I know about the INSERT which needs to be changed to:
INSERT INTO `data` (`name`, `desc`, ...) VALUES ("name", "description", ...)
The question is: are there any other commands that can use similar syntax that rely on an order or count of the columns instead of their names? I need to update all the old SQL commands before updating the DB and using trial & error method would be really long.
SELECTs are not a problem, because the front-end uses associative mapping and correctly uses their names everywhere so new columns will be just ignored. Also I'm sure there are no commands that modifying the DB structure (e.g. ALTER TABLE).
You ruled out data structure modifying queries, so this leaves us with insert, update, delete, and select.
Insert you are already aware of.
Update requires each updated field to be specified, so mostly that's ok. However, subqueries may be used in the where clause, and mysql allows multi-table updates, so my points around select do apply.
Delete applies to a whole record, so there is nothing that an extra field would influence. However, subqueries may be used in the where clause, so my points around select do apply.
You tried to rule out select, but you should not. It is not only the final resultset that can be influenced by a new field:
A subquery may use select * that and an extra field may cause error in the outer query. For example the newly introduced field mayhave the same name as another field in the outer query leading to ambiguous field name error.
If select * is used in union, then column counts may not match after adding a new field.
Natural joins may also be affected by an introduction of a new field.
I have a table that is used to store the latest actions the user did (like a ctrl+z for the program), but I want to limit this table to about 200 entries, and after that, every new entry would delete the oldest in the table.
Is there any option to make the table behave this way on SQL or do I need to add some code to the program to do it?
I've seen this kind of idea before, but I've rarely seen a case where it was a good idea.
Your table would need these columns in addition to columns for the normal data.
A column of type integer to hold the row number.
A column of type timestamp (standard SQL timestamp) to hold the time of the last update.
The normal approach to limit this table to 200 rows would be to add a check constraint to the column of row numbers. For example, CHECK (row_num between 1 and 200). MySQL doesn't enforce check constraints, so instead you'll need to use a foreign key reference to a table of row numbers (1 to 200).
All insert statements will need to determine whether the table is full, examine the time of the last update, and either a) insert a new row with a new row number, or b) delete the oldest row or overwrite it.
My advice? Renegotiate this requirement.
Assuming that "200" is not a hard limit, in other words if the number of entries occasionally went over that by a small amount it would be OK...
Don't do the pruning on line, do it as an off line process, run as often as needed to keep the totals per user from not getting "too high".
For example, one such solution would be to fire the SQL that does that query every hour using crontab.
In BO XI 3.1, is it possible to create a condition object that filters on multiple tables, without adding all of those tables to the query if they weren't already present?
For example, if I have several tables which all contain both current and historical data, and each table has a flag to indicate if the record is current or historical - can I create a single "Current Data" condition that filters all of such tables to pull only current data? The catch would be that the query might not be selecting from all of these tables, and I don't want the inclusion of the condition to add joins to tables I'm not selecting from.
In other words, can a condition check which tables are being used by the query and apply filters only on those tables?
You could add a self-restricting join to each of those tables, and use an #prompt function to ask whether to return current data or historical data. If you use the same text and same datatype for all of the prompts in each self-restricting join, the prompt will only be shown once, and will only be applied to the tables that are actually used in the generated query.
The self-restricting join could look something like:
<table>.<history_flag>
= #Prompt('Select current or historical data','A',{'C','H'}, Mono, constrained, , {'C'})
In the above example, we assume that the flag is an alphanumeric column (A) with values C or H ({'C','H'}). The user is only allowed to pick from these two values (constrained) and only one value can be chosen (Mono). The default choice is set to current data ({'C'}).
Have a look at the Universe Designer guide for the #prompt syntax. Self-restricting joins are explained in the same manual.
I am new to MySQL partitioning, therefore any example will be appreciated.
I am trying to create a sort of an ageing mechanism for a data that is distributed between several MyISAM tables.
My question will actually include several sub-questions.
The relevant tables are:
First table contains raw data with high input frequency (next to each record there is an auto incremented id).
Second table contains processed results, there is a result record per every raw data record (result record contains the source id record of the auto incremented field of raw data record)
Questions:
I need to be able to partition the raw data table and result data table similarly so that both of them will include only 10 weeks of data in single partition (each raw data record contains unixtimestamp field), how do i do it , can someone write small example case for two such tables?.
I want to be able to change the 10 weeks constraint on the fly.
I want that when ever the current partition will be filled or a new partition is created , the previous (10 weeks before) partition will be deleted automatically.
I don't want the auto increment id integer to be overflown, as much as i understand the ids are unique for the partition only, so if i am not wrong the auto increment id will start from zero for the next partition? but what if the previous partition still exist, will i have 2 duplicated ids , how i know to reference only for the last id when i present a result record?
I want to load raw data using LOAD DATA INTO... instead of multiple inserts , is MySQL partitioning functionality affected?
And the last question, would you suggest some other approach to implement aging mechanism (i am writing Java implementation product that processes around 1 GB or raw data per day and stores the results in MySQL)
It's hard to give a real answer on this question since it depends on your data. But let me give you some things to think about.
I assume we're talking about some kind of logs with recent data (so not spanning multiple years). You can partition by range. You could add one field to your table with the year/week number (ie 201201, 201202, etc). If this question is related to your question about importing into multiple tables, you can easily do this is that import script.
On the fly as in, repartition your data on the fly (70GB?). I would not recommend it. But you could do it if you had the weeknumber in there. If you later want to change it to 12 days, you could add a column for the date and partition by that.
Well it won't be deleted automatically but a cron job can handle that right? Just check how many partitions there are, and if there are 3(?) delete the first one.
The partition needs to have a primary index on the field that you partition (if you want to use auto increment). Therefor you can never fully rely on the auto increment id alone. I don't see a way around this.
I'm not sure what you mean.
If your data is just some logs in chronological order then you might just use separate tables for each period. Then before you start the new period (at 00:00) check the last id of the last table, create a new table and set the auto increment to that value +1. Then your import will decide when a new period will begin so it can be easily changed. Your import script can use a small table in where it can store the next period.
LOAD DATA is really quite fast. I would just have two steps(in no partic order) - LOAD DATA and then 'delete .. where date < 10 weeks'. Autoincrement will go on for as long as the datatype you're using. If you wanted to be super careful you could push it back to zero periodically.
Once the data is in the 'raw' table run your routine to create the 'processed' table. We use a v similar process where I work. We keep a separate table that has 'write' and 'parse' pointers to all of our 'raw' tables. As new data comes in and gets parsed the appropriate row pointers get set. If the 'raw' table gets truncated you can reset the 'write' pointer but leave the 'parse' pointer. (we store the offset in another table when this happens - just to be sure).
And if I recommend , creating the index column for each of the related columns can also enhanced the performance Delete old data from multiple related tables since we have just compared the index numbers rather than strings.
I wonder if your tables are being sorted or not.
I have a case where we are maintaining a table containing resources. This table has a varchar column that contains role ids as comma separated values (I know normalizing SHOULD have been the way to go, but can't change a long running working system). E.g. role_ids column contains '1,4,6,9,10' and another row contains '5,10,15'.
Then, for a user in system, I have the associated role ids as a list, e.g. 4,15. Now I need to find 'any in many', i.e. any resource that may have any of the role ids present in resource.role_ids column.
This question is something similar to this one, but the solution expected is not expected in Grails.
I'm looking for a MySQL solution - either a query or a stored procedure. Though finding a set of resources could have been achieved using 'FIND_IN_SET()', but don't want to perform multiple calls to DB with each of user's role_id list.
Use a function like this one, to turn your lists into individual records, then join everything up normally.