Best way to check for duplicated records - mysql

I have two tables A and B with a relationship of One-to-many from A to B.
A has 5 columns:
a1, a2, a3, a4, a5
and B has 5 columns
b1, b2, b3, b4, a1.
Note a1 is foreign key in table B.
I have a requirement to check duplicate records in the table i.e. no two records should have exactly same values for all the attributes.
The most efficient way I can think of for determining their uniqueness is by creating a checksum sort of value and keep it in every row of table A. But this requires extra space plus I will have to make sure that the checksum is really unique.
Is this the best way to go ahead or is there some other way I am unaware of?
For e.g. Lets say table A is Rules Table and Table B is Trigger table. Now Rules table has records of various rules created by different users.(This means that there will be a mapping to Users Table in Rules Table.). Now what I actually want is that a user should not be able to create identical rules. So when a user saves rules I run a query to check if there is record of identical checksum for this particular user if yes then I give the appropriate error otherwise I let the user to create the record.I guess this clears that why I can't put unique constraint on all records.

Do a SELECT with a GROUP BY clause. For example:
SELECT a1, a2, a3, a4, a5, COUNT(*) FROM #TempPersons GROUP BY a1, a2, a3, a4, a5 HAVING COUNT(*) > 1;
This will return a result with the a1, a2, a3, a4, a5 and a count of how many times that value appears

Having a UNIQUE constraint on those columns seems like the way to go.
However, just for the sake of answering your other remarks: I've worked with extra columns to check for changes in the past before. Back then I did something similar to this:
CONVERT([NVARCHAR](42),HASHBYTES('SHA1',CONCAT(Column1, '||', Column2, ...),(1))
I found it to be a rather nice way to concat many columns into a single hash, unique depending on it's contents & without it blowing out of proportion. (I used this in a datawarehousing environment, to check large tables for record level changes based on a business key. Stored this as a PERSISTED column to allow an index to run on this too).

Related

SSIS Lookup Transformation No Match Output Only Populates Null

I am trying to use the lookup transformation but can not seem to get the functionality out of it that I need. I have two tables that are the exact same structure
Temp Table (input): Smaller table but may have entries that do not exist in other table
Reference Lookup Table: Larger table that may not have identical entries to Temp Table.
I am trying to compare the entries of the Temp Table to the entries of the Reference Lookup Table. Anything that exists in the Temp Table, but not the Lookup should be output to a separate table (No match output).
It is a very simple Data Flow, but it does not seem to accomplish the lookup properly. It will find "No Match" rows, but the "no match" table is populated with null values for every column. I am trying to figure out why the data is losing its values?
How the Lookup is setup:
The data in temp table is what drives your data flow. 151 rows flowed out of it.
Your lookup is going to match based on whatever criteria you specify and you've identified that if there is no match, I want to push the no-match data into a table.
Since the lookup task cannot add columns to the no-match output path, this would imply your source (temp table) started NULL across the board.
Drop a data viewer/data tap onto the data flow between the lookup and the destination and then compare that data to your source. I suspect you're going to discover that the process that populated Temp table is at fault.
In the Lookup Transformation, in the columns tab you have identified that you want to use the value from the reference table to replace the value from the source.
Which works great until you get a no-match. In which case, the component is going to do the non-intuitive (even to me with 15+ years of working with it) action of update that column whether it matches or not.
Source query
SELECT 21 AS tipID, NULL AS tipYear
UNION ALL SELECT 22, 2020
UNION ALL SELECT 64263810, 2020
This adds three rows to my data flow, the first with no tipYear and the next two rows with a year of 2020. Stamp of 1 in the below image
Lookup query
SELECT
*
FROM
(
values (20, 1111), (21, 2021), (22, 2022)
)D(tipID, tipYear)
This reference data will supply a year for all the matches (21 and 22). In the matched path, we'll see 21 supplied with a value and 22 will have its year updated. Stamp 2 in the image
For id 64263810 however, no match will be found and we'll see the initial value of 2020 replaced with the matching row aka NULL. Stamp 3
Lessons learned. If you need to use the data from the reference table but have a no-match output path, do not replace column in the lookup transformation (unless your intention is to wipe out data)

MySql Indexing Strategy With Multiple Shared Columns

We have a database table which stores browser data for visitors, broken down by multiple different subtypes. For simplicity, let's use the table schema below. The querying will basically be on any single id column, the metric column, the timestamp column (stored as seconds since epoch), and one of the device, browser, or os columns.
We are going to performance test the star vs snowflake schema (where all of the ids go into a single column, but then an additional column id_type is added to determine which type of identifier it is) for this table, but as long as the star schema (which is how it is now) is within 80% of the snowflake performance, we are going to keep it since it will make our load process much easier. Before I do that however, I want to make sure the indexes are optimized on the star schema.
create table browser_data (
id_1 int,
id_2 int,
id_3 int,
id_4 int,
metric varchar(20),
browser varchar(20),
device varchar(20),
os varchar(20),
timestamp bigint
)
Would it be better to create individual indexes on just the id columns, or also include the metric and timestamp columns in those indexes as well?
Do not normalize "continuous" values, such as DATETIME, FLOAT, INT. Do leave the values in the main table.
When you move the value to other table(s), especially "snowflake", it makes querying based on the values somewhere between a little slower and a lot slower. This especially happens when you need to filter on more than one metric that is not in the main table. Either of these perform very poorly because of "snowflake" or "over-normalization":
WHERE a.x = 123 AND b.y = 345
ORDER BY a.x, b.y
As for what indexes to create -- that depends entirely on the queries you need to perform. So, I strongly recommend you sketch out the likely SELECTs based on your tentative CREATE TABLEs.
INT is 4 bytes. TIMESTAMP is 5, FLOAT is 4, etc. That is, normalizing such things are also inefficient on space.
More
When doing JOINs, the Optimizer will almost always start with one table, then move on to the another table, etc. (See "Nested Loop Join".)
For example (building on the above 'code'), when 2 columns are normalized, and you are testing on the values, you do not have two ids in hand, you only have the two values. This makes the query execution very inefficient. For
SELECT ...
FROM main
JOIN a USING(a_id)
JOIN b USING(b_id)
WHERE a.x = 123 AND b.y = 345
The following is very likely to be the 'execution plan':
Reach into a to find the row(s) with x=123; get the id(s) for those rows. This may include many rows that are yet to be filtered by b.y. a needs INDEX(x)
Go back to the main table, looking up rows with those id(s). main needs INDEX(a_id). Again, more rows than necessary may be hauled around.
Only now, do you get to b (using b_id) to check for y=345; toss the unnecessary rows you have been hauling around. b needs INDEX(b_id)
Note my comment about "haul around". Blindly using * (in SELECT *) adds to the problem -- all the columns are being hauled around while performing the steps.
On the other hand... If x and y were in the main table, then the code works like:
WHERE main.x = 123
AND main.y = 345
only needs INDEX(x,y) (in either order). And it quickly locates exactly the rows desired.
In the case of ORDER BY a.x, b.y, it cannot use any index on any table. So the query must create a tmp table, sort it, then deliver the rows in the desired order.
But if x and y are in the same table, then INDEX(x,y) (in that order) may be useful for ORDER BY x,y and avoid the tmp table and the sort.
With a single table, the Optimizer might use an index for WHERE, or it might use an index for ORDER BY, depending on the phase of the moon. In some cases, one index can be used for both -- this is optimal.
Another note: If you also have LIMIT 10,... If the sort is avoided, then only 10 rows need to be looked at, not the entire set from the WHERE.

Following which database design have less cost for SQL query?

Suppose I have a website and for its database there is one table is
Table_name table_1 and attributes are like table_1(a1(primary key,a2,a3,a4,a5,a6,a7) and in my website for for most of transactions I only uses attributes (a1,a2,a3) but the a4,a5 ,a6 and a7 are rarely used so I want to know what is better design approach to access data from following option
A)keep this table as it is and use this query select a1,a2,a3 from table_1;
B) Create 2 separate table table1(a1,a2,a3) and table_2(a1,a4,a5,a6,a7)
which approch have lower cost or load on database?
For read querys over (a1, a2, a3), obviusly "b" is (not noticeably) cheaper.
But all the other things are worst except if (a4, a5, a6, a7) are, in most of cases, nulls and you used (1->0,1) cardinality between both tables (that is: for each a1 in table_1 there is 0 or 1 tuples with the same value of a1 in table_2 and, of course, all values of a1 in table_2 exists in table_1).
Anyway, as I said, any possible advantage will be minimal compared to the complexity, maintainability issues, and also efficiency reduction (for inserts and when you need data from both tables).
So, if I were you, I would select "a" layout without any doubt.
B is provide less cost than A . Because if you choose A option you'll have been wasted space for a4, a5, a6,a7. But if you choose B option, you must create a foreign key (a1) for connect table_1. And your SQL query become cheap.

Index counter shared by multiple tables in mysql

I have two tables, each one has a primary ID column as key. I want the two tables to share one increasing key counter.
For example, when the two tables are empty, and counter = 1. When record A is about to be inserted to table 1, its ID will be 1 and the counter will be increased to 2. When record B is about to be inserted to table 2, its ID will be 2 and the counter will be increased to 3. When record C is about to be inserted to table 1 again, its ID will be 3 and so on.
I am using PHP as the outside language. Now I have two options:
Keep the counter in the database as a single-row-single-column table. But every time I add things to table A or B, I need to update this counter table.
I can keep the counter as a global variable in PHP. But then I need to initialize the counter from the maximum key of the two tables at the start of apache, which I have no idea how to do.
Any suggestion for this?
The background is, I want to display a mix of records from the two tables in either ASC or DESC order of the creation time of the records. Furthermore, the records will be displayed in page-style, say, 50 records per page. Records are only added to the database rather than being removed. Following my above implementation, I can just perform a "select ... where key between 1 and 50" from two tables and merge the select datasets together, sort the 50 records according to IDs and display them.
Is there any other idea of implementing this requirement?
Thank you very much
Well, you will gain next to nothing with this setup; if you just keep the datetime of the insert you can easily do
SELECT * FROM
(
SELECT columnA, columnB, inserttime
FROM table1
UNION ALL
SELECT columnA, columnB, inserttime
FROM table2
)
ORDER BY inserttime
LIMIT 1, 50
And it will perform decently.
Alternatively (if chasing last drop of preformance), if you are merging the results it can be an indicator to merge the tables (why have two tables anyway if you are merging the results).
Or do it as SQL subclass (then you can have one table maintain IDs and other common attributes, and the other two reference the common ID sequence as foreign key).
if you need creatin time wont it be easier to add a timestamp field to your db and sort them according to that field?
i believe using ids as a refrence of creation is bad practice.
If you really must do this, there is a way. Create a one-row, one-column table to hold the last-used row number, and set it to zero. On each of your two data tables, create an AFTER INSERT trigger to read that table, increment it, and set the newly-inserted row number to that value. I can't remember the exact syntax because I haven't created a trigger for years; see here http://dev.mysql.com/doc/refman/5.0/en/triggers.html

Deleting Duplicates in Access 2003

I have an Access 2003 table with ~4000 records which was made from 17 different tables. Roughly half of these records are duplicates. There is no unique identifying column (id, name etc). There is an id column which was auto filled when the tables were combined meaning that the duplicates aren't completely identical (though this column could be removed if it makes things easier).
I have used the Access Find Duplicates Query Wizard which gives me a list of the duplicated records but won't let me delete them (seriously what use is this query if I can't delete them?). I've tried converting the generated query to a remove query but that changes the number of rows that it finds. I'd alter the sql by hand but it's a bit beyond me and is 7 lines long.
Does anyone know a good way of getting rid of the duplicates?
The reason the find duplicates query won't let you delete the records is because it is basically just an aggregate query, it is counting the number of duplicates it finds and returning the cases where the count is greater than 1.
Consider that if you did make a delete query based on the find duplicates, it would delete all rows that have duplicate values, which is maybe not what you want. You want to delete all but one of the duplicates.
You should try to delete all duplicates of a record apart from one, excluding the ID column in your comparison. I suggest the simplest way to do this is to make a make-table query of all the unique values (Select Distinct Field1, Field2... from MyTable) instead for every field except for the ID field, using the results in a to create a new table of around 2000 records (if half are duplicates).
Then, create an ID column on your new table, use an update query to update this ID to the first matching ID in the original table (you could do this using DLookup, which will return the first EXPRESSION value where CRITERIA is true in DOMAIN).
The DLookup() function returns one
value from a single field even if more
than one record satisfies the
criteria. If no record satisfies the
criteria, or if the domain contains no
records, DLookup() returns a Null.
Since you are identifying the first matching ID based on all the other fields, which are unique values, the unmatched IDs will belong to duplicates. You will be reversing the PK relation, identifying the first matching key given a set of unique fields. After that, you should set the ID to be PK. Of course this assumes the ID has no inherent meaning, and you don't care about keeping one particular ID for a given duplicated row over any of the IDs belonging to the other duplicated rows. This assumes you care about the data in the ID column so you want to preserve it for all remaining rows, otherwise just ignore the DLookup step and do a Select Distinct on all columns apart from the ID.
Use a select with all columns except the ID column:
SELECT DISTINCTROW Column1, Column2, Column3
INTO MYNEWTABLE
FROM TABLE
You can simply swap the names.
This solution will give you a new table with non duplicates.
The following will preserve original IDs and do it in one step:
DELETE FROM table_with_duplicates
WHERE table_with_duplicates.id NOT IN
(SELECT max(id)
FROM table_with_duplicates
GROUP BY duplicated_field_1, duplicated_field_2, ...
)
Now you have the original table with no duplicates and preserved ids.
And always remember to backup you data before trying large DELETEs.
DELETE * FROM table_with_duplicates
WHERE table_with_duplicates.ID In
(SELECT max(ID)
FROM table_with_duplicates
GROUP BY [duplicated_field_1]
HAVING Count(*)>1
)
Actually I Found A very simple solution took a while but it all of your fields across are the same like a complete duplicate record then just make one query with every field and sort by "Group BY". Thus the duplicates will combine and you can just append this information to a new table and rename it the same as the existing table. If you have a primary key field you could just ignore it in the query and then it would still combine the data (assuming that you don't care about the data in the primary field). I don't know why no one has mentioned this solution took me 5 hr. to come up with. :)