Best way to handle duplicated rows - mysql

I have insurance companies "dictionary" in my database, let's say:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
But I'm fetching data from another system, and in result I got duplicates of insurance companies, but without my data:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
| 2 | InsuranceCompany1 | |
+----+-------------------+----------+
Both records are related in variety of models but they refer to the same data, and what I want is to pair these records without changing queries or data in other tables, so noone knows there are two records, but both refer to one instance which is
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
My question is: Is there some proper way to handle situations like this?
I've came up with solution which is to add parent_id column, and manually set parent_id in duplicated rows, and then override Eloquent methods like find in a model to return parent if there is parent_id set.
Copying SomeData column is not an option because there can be condition if insurance_company_id == id;

You can try creating a view of your dict table something like this:
CREATE VIEW unique_dict AS
SELECT MIN(ID) ID,
Name,
GROUP_CONCAT(Data) Data
FROM dict
GROUP BY Name
That will give you one row per name.
Then, in your queries requiring one row per name, SELECT from the unique_dict view rather than the dict table.
GROUP_CONCAT() yields a list of values from Data, which helps if more than one duplicated row contains a value: you get them all.
Longer term you might be smart to consider these duplicates to be "dirty data", and clean them up as you INSERT new rows. How to do that?
Create a unique index on Name.
CREATE UNIQUE INDEX unique_name ON dict(Name);
Then, when loading new data into dict use Eloquent's updateOrCreate() function. Here's something to read about that. Laravel 5.1 Create or Update on Duplicate

Related

Delete partial duplicate rows

I have a Dataverse table that has a few columns. One of those columns is an Order Number column. There should only be one row per order number. If there is more than 1, only the first one should be kept. How can I do this in Power Automate?
What I have tried so far: First, I created an array of all the order numbers. From there, I feel stuck. I started to add an Apply to Each action, loop through the table, count how many of each order number there are, but then I confused myself and didn't think that was the right way to go.
Or...is there a way to keep the "duplicate" rows from getting added to the Dataverse table in the first place? The data is getting loaded into the table via a JSON load. Is there a way to delete the "duplicate" items from the JSON?
Here's an example of the situation:
| OrderNumber | OrderDate | CustomerName |
| 450123| 2-24-22 | Business A |
| 450123| 2-25-22 | Business A |
| 383238| 2-24-22 | Business B |

How do I turn a list of interconnected pairs of ids into a cluster of ids?

I have a table with pairs (and sometimes triples) of ids, which act as sort of links in a chain
+------+-----+
| from | to |
+------+-----+
| id1 | id2 |
| id2 | id3 |
| id4 | id5 |
+------+-----+
I want to create a new table where all the links are clustered into chains/families:
+-----+----------+
| id | familyid |
+-----+----------+
| id1 | 1 |
| id2 | 1 |
| id3 | 1 |
| id4 | 2 |
| id5 | 2 |
+-----+----------+
i.e. add up all chains in a link into a single family, and give it an id.
in the example above, the first 2 rows of the first table create one family, and the last row creates another family.
Solution
I will use node.js to query big batches of rows (a few thousands every batch), process them, and insert them into my own table with a family id.
The issue
The problem is I have a few tens of thousands of id pairs, and I will also need to add new ids over time after the initial creation of the families table, and i will need to add ids to existing families
Are there good algorithms for clustering pairs of data into families/clusters, keeping my issue in mind?
Not sure if it's an answer as more some ideas...
I created two tables similar to the ones you have, the first one I populated with the same data as you have.
Table Base, fromID, toID
Table chain, fromID, chainID (numeric, null allowed)
I then inserted all unique values from Base into chain with a null value for chainID. The idea being these are the rows as yet unprocessed.
It was then a case of repeatedly running a couple of statements...
update chain c
set chainID = n
where chainid is null and exists ( select 1 from base b where b.fromID = c.fromID )
order by fromID
limit 1
This would allocate the next chain ID to the first row without one (n needs to be generated from somewhere and incremented each time you run this)
Then the one that relates all of the records...
update chain c
join base b on b.toID = c.fromID
join chain c1 on b.fromID = c1.fromID
set c.chainID = c1.chainID
where c.chainID is null and c1.chainID is not null
This is run repeatedly until it affects 0 rows (i.e. it's nothing more to do).
Then run the first update to create the next chain etc. Again if you run the first update till it affects 0 rows, this shows that they are all linked.
Would be interested if you want to try this and see if it stands up with more complex scenarios.
This looks a lot like clustering over graph dataset where 'familyid' is the cluster center number.
Here is a question I think is relevant.
Here is the algorithm description. You will need to implement under the conditions you described.

How to compare ids of different formats across databases

Problem: I need a way to compare ids of a type varchar to ids of a type int.
Background: I have a list of ids that almost map to the ids in my table. I have ~10k ids, but I suspect there are only 3-5 variations to clean up.
The tables I'm working with could be simplified as follows. A big table of articles with good ids, and a temp table I've dumped all the dirty ids into. I'd like to update my temp table with the correct id whenever I'm able to make a match
licensing.articles
+----------+
| id (int) |
+----------+
| 1000 |
| 1001 |
| 1002 |
+----------+
tempDB.ids
+-------------------+----------------+
| id_dirty (string) | id_clean (int) |
+-------------------+----------------+
| 1000Z | |
| R1001 | |
| 1002 | |
+-------------------+----------------+
So my first query is the simple version: for records that share the same id between the licensing.articles table and the tempDB.ids table, I want to populate tempDB.ids.id_clean with the good id. (In my example, there is one shared id (1002), but in reality there's probably ~3k of them.)
When I try something like this:
UPDATE tempDB.ids AS dirty
JOIN licensing.articles clean ON clean.id = CAST(dirty.id_dirty as unsigned)
SET dirty.id_clean = clean.id
WHERE isnull(dirty.id_clean);
I get an error message Error : Truncated incorrect INTEGER value: '1000Z'. That makes sense; presumably it is failing to convert '1000Z' to an integer.
So how can I say
FOR ONLY tempDB.ids.id_dirty values that can be successfully cast to an int
SELECT the matching record from licensing.articles
AND copy licensing.articles.id to tempDB.ids.id_clean

"You cannot add or change a record because a related record is required", but related record exists?

I have two related tables, results and userID.
results looks like this:
+----+--------+--------+
| ID | userID | result |
+----+--------+--------+
| 1 | abc | 124 |
| 2 | abc | 792 |
| 3 | def | 534 |
+----+--------+--------+
userID looks like this:
+----+--------+---------+
| id | userID | name |
+----+--------+---------+
| 1 | abc | Angela |
| 2 | def | Gerard |
| 3 | zxy | Enrico |
+----+--------+---------+
In results, the userID field is a lookup field; it stores userID.id but the combo box has userID.userID as its choices.
When I try to enter data into results by setting the userID combo box and entering a value for result, I get this error message:
You cannot add or change a record because a related record
is required in table `userID`.
This is strange, because I'm specifically selecting a value that's provided in the userID combo box.
Oddly, there are about 100 rows of data already in results with the same value for userID.
I thought this might be a database corruption issue, so i created a blank database and imported all the tables into it. But I still got the same error. What's going on here?
Both tables include a text field named LanID. You are using that field in this relationship, which enforces referential integrity:
The problem you're facing is due to the Lookup field properties. This is the Row Source:
SELECT [LanID].ID, [LanID].LanID FROM LanID ORDER BY [LanID];
But the value which gets stored (the Bound Column property) is the first column from that SELECT statement, which is the Long Integer [LanID].ID. So that number will not satisfy the relationship, which requires results.LanID = [LanID].LanID.
You must change the relationship or change the Lookup properties so both reference the same field value.
But if it were me, I would just eliminate the Lookup on the grounds that simple operations (such as this) become unnecessarily confusing when Lookup fields are involved. Make results.LanID a plain numeric or text field. If you want some kind of user-friendly drop-down for data entry, build a form with a combo or list box.
For additional arguments against Lookup fields, see The Evils of Lookup Fields in Tables.
If you are using a parameter query, make sure you have them in the same order as the table you are modifying and the query you have created. You might have one parameter inserting the conflicting data. Parameters are used in the order they are created...not the name of the parameter. I had the same problem and all I had to do was switch the order they were in so they matched the query. This is an old thread, so I hope this helps someone who is just now having this problem.

Keeping page changes history. A bit like SO does for revisions

I have a CMS system that stores data across tables like this:
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
Entries META Table
+----+----------+-------+-------+
| id | entry_id | value | param |
+----+----------+-------+-------+
Files Table
+----+----------+----------+
| id | entry_id | filename |
+----+----------+----------+
Entries-to-Tags Table
+----+----------+--------+
| id | entry_id | tag_id |
+----+----------+--------+
Tags Table
+----+-----+
| id | tag |
+----+-----+
I am in trying to implement a revision system, a bit like SO has. If I was just doing it for the Entries Table I was planning to just keep a copy of all changes to that table in a separate table. As I have to do it for at least 4 tables (the TAGS table doesn't need to have revisions) this doesn't seem at all like an elegant solution.
How would you guys do it?
Please notice that the Meta Tables are modeled in EAV (entity-attribute-value).
Thank you in advance.
Hi am currently working on solution to similar problem, I am solving it by splitting my tables into two, a control table and a data table. The control table will contain a primary key and reference into the data table, the data table will contain auto increment revision key and the control table's primary key as a foreign key.
taking your entries table as an example
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
becomes
entries entries_data
+----+----------+ +----------+----+--------+------+--------+--------+
| id | revision | | revision | id | title | text | index1 | index2 |
+----+----------+ +----------+----+--------+------+--------+--------+
to query
select * from entries join entries_data on entries.revision = entries_data.revision;
instead of updating the entries_data table you use an insert statement and then update the entries table's revision with the new revision of the entries table.
The advantage of this system is that you can move to different revisions simply by changing the revision property within the entries table. The disadvantage is you need to update your queries. I am currently integrating this into an ORM layer so the developers don't have worry about writing SQL anyway. Another idea I am toying with is for there to be a centralised revision table which all the data tables use. This would allow you to describe the state of the database with a single revision number, similar to how subversion revision numbers work.
Have a look at this question: How to version control a record in a database
Why not have a separate history_table for each table (as per the accepted answer on the linked question)? That simply has a compound primary key of the original tables' PK and the revision number. You will still need to store the data somewhere after all.
For one of our projects we went the following way:
Entries Table
+----+-----------+---------+
| id | date_from | date_to |
+----+--------_--+---------+
EntryProperties Table
+----------+-----------+-------+------+--------+--------+
| entry_id | date_from | title | text | index1 | index2 |
+----------+-----------+-------+------+--------+--------+
Pretty much complicated, still allows to keep track of full object's lifecycle. So for querying active entities we were going for:
SELECT
entry_id, title, text, index1, index2
FROM
Entities INNER JOIN EntityProperties
ON Entities.id = EntityProperties.entity_id
AND Entities.date_to IS NULL
AND EntityProperties.date_to IS NULL
The only concern was for a situation with entity being removed (so we put a date_to there) and then restored by admin. Using given scheme there's no way to track such kind of tricks.
Overall downside of any attempt like that is obvious - you've to write tons of TSQL where non-versioned DBs will go for something like select A join B.