How to compare ids of different formats across databases - mysql

Problem: I need a way to compare ids of a type varchar to ids of a type int.
Background: I have a list of ids that almost map to the ids in my table. I have ~10k ids, but I suspect there are only 3-5 variations to clean up.
The tables I'm working with could be simplified as follows. A big table of articles with good ids, and a temp table I've dumped all the dirty ids into. I'd like to update my temp table with the correct id whenever I'm able to make a match
licensing.articles
+----------+
| id (int) |
+----------+
| 1000 |
| 1001 |
| 1002 |
+----------+
tempDB.ids
+-------------------+----------------+
| id_dirty (string) | id_clean (int) |
+-------------------+----------------+
| 1000Z | |
| R1001 | |
| 1002 | |
+-------------------+----------------+
So my first query is the simple version: for records that share the same id between the licensing.articles table and the tempDB.ids table, I want to populate tempDB.ids.id_clean with the good id. (In my example, there is one shared id (1002), but in reality there's probably ~3k of them.)
When I try something like this:
UPDATE tempDB.ids AS dirty
JOIN licensing.articles clean ON clean.id = CAST(dirty.id_dirty as unsigned)
SET dirty.id_clean = clean.id
WHERE isnull(dirty.id_clean);
I get an error message Error : Truncated incorrect INTEGER value: '1000Z'. That makes sense; presumably it is failing to convert '1000Z' to an integer.
So how can I say
FOR ONLY tempDB.ids.id_dirty values that can be successfully cast to an int
SELECT the matching record from licensing.articles
AND copy licensing.articles.id to tempDB.ids.id_clean

Related

Best way to handle duplicated rows

I have insurance companies "dictionary" in my database, let's say:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
But I'm fetching data from another system, and in result I got duplicates of insurance companies, but without my data:
+----+-------------------+----------+
| ID | Name | Data |
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
| 2 | InsuranceCompany1 | |
+----+-------------------+----------+
Both records are related in variety of models but they refer to the same data, and what I want is to pair these records without changing queries or data in other tables, so noone knows there are two records, but both refer to one instance which is
+----+-------------------+----------+
| 1 | InsuranceCompany1 | SomeData |
+----+-------------------+----------+
My question is: Is there some proper way to handle situations like this?
I've came up with solution which is to add parent_id column, and manually set parent_id in duplicated rows, and then override Eloquent methods like find in a model to return parent if there is parent_id set.
Copying SomeData column is not an option because there can be condition if insurance_company_id == id;
You can try creating a view of your dict table something like this:
CREATE VIEW unique_dict AS
SELECT MIN(ID) ID,
Name,
GROUP_CONCAT(Data) Data
FROM dict
GROUP BY Name
That will give you one row per name.
Then, in your queries requiring one row per name, SELECT from the unique_dict view rather than the dict table.
GROUP_CONCAT() yields a list of values from Data, which helps if more than one duplicated row contains a value: you get them all.
Longer term you might be smart to consider these duplicates to be "dirty data", and clean them up as you INSERT new rows. How to do that?
Create a unique index on Name.
CREATE UNIQUE INDEX unique_name ON dict(Name);
Then, when loading new data into dict use Eloquent's updateOrCreate() function. Here's something to read about that. Laravel 5.1 Create or Update on Duplicate

How to update a column with specific data for each row? [duplicate]

I'm trying to update one MySQL table based on information from another.
My original table looks like:
id | value
------------
1 | hello
2 | fortune
3 | my
4 | old
5 | friend
And the tobeupdated table looks like:
uniqueid | id | value
---------------------
1 | | something
2 | | anything
3 | | old
4 | | friend
5 | | fortune
I want to update id in tobeupdated with the id from original based on value (strings stored in VARCHAR(32) field).
The updated table will hopefully look like:
uniqueid | id | value
---------------------
1 | | something
2 | | anything
3 | 4 | old
4 | 5 | friend
5 | 2 | fortune
I have a query that works, but it's very slow:
UPDATE tobeupdated, original
SET tobeupdated.id = original.id
WHERE tobeupdated.value = original.value
This maxes out my CPU and eventually leads to a timeout with only a fraction of the updates performed (there are several thousand values to match). I know matching by value will be slow, but this is the only data I have to match them together.
Is there a better way to update values like this? I could create a third table for the merged results, if that would be faster?
I tried MySQL - How can I update a table with values from another table?, but it didn't really help. Any ideas?
UPDATE tobeupdated
INNER JOIN original ON (tobeupdated.value = original.value)
SET tobeupdated.id = original.id
That should do it, and really its doing exactly what yours is. However, I prefer 'JOIN' syntax for joins rather than multiple 'WHERE' conditions, I think its easier to read
As for running slow, how large are the tables? You should have indexes on tobeupdated.value and original.value
EDIT:
we can also simplify the query
UPDATE tobeupdated
INNER JOIN original USING (value)
SET tobeupdated.id = original.id
USING is shorthand when both tables of a join have an identical named key such as id. ie an equi-join - http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join
It depends what is a use of those tables, but you might consider putting trigger on original table on insert and update. When insert or update is done, update the second table based on only one item from the original table. It will be quicker.

SQL statement to return elements from a column only if no elements from a different column match

Sorry for the confusing question, I will try to clarify.
I have an SQL database ( that I did not create ) that I would like to write a query for. I know very little about SQL, so it is hard for me to even know what to search for to see if this question has already been asked, so sorry if it has. It should be an easy solution for those in the know.
The query I need is for a search I would like to perform on an existing data management system. I want to return all the documents that a given user has NOT signed-off on, as indicated by rows in a signoffs_table. The data is stored similarly to as follows: (this is actually a simplification of the actual schema and hides several LEFT JOINS and columns)
signoffs_table:
| id | user_id | document_id | signers_list |
The naive solution I had was to do something like the following:
SELECT document_id from signoffs_table WHERE (user_id <> $BobsID) AND signers_list LIKE "%Bob%";
This works if ONLY Bob signs the document. The problem is that if Bob and Mary have signed the document then the table looks like this:
signoffs_table:
-----------------------------------------------
| id | user_id | document_id | signers_list |
-----------------------------------------------
| 1 | 10 | 100 | "Bob,Mary,Jim" |
| 2 | 20 | 100 | "Bob,Mary,Jim" |
-----------------------------------------------
(assume Bob's ID = 10 and mary's ID = 20).
and then when I do the query then I get back document_id 100 (in row #2) because there is a row that Bob should have signed, but did not.
Is what I am trying to do possible with the given database structure? I can provide more details if needed. I am not sure how much details are needed.
I guess this query is what you mean:
SELECT document_id FROM signoffs_table AS t1
WHERE signers_list LIKE "%Bob%"
AND NOT EXISTS (
SELECT 1 FROM signoffs_table AS t2
WHERE (t2.user_id = $BobsID) AND t2.document_id = t1.document_id )
I believe your design is incorrect. You have a many-to-many relationship between documents and signers. You should have a junction table, something like:
ID DocumentID SignerID

Update one MySQL table with values from another

I'm trying to update one MySQL table based on information from another.
My original table looks like:
id | value
------------
1 | hello
2 | fortune
3 | my
4 | old
5 | friend
And the tobeupdated table looks like:
uniqueid | id | value
---------------------
1 | | something
2 | | anything
3 | | old
4 | | friend
5 | | fortune
I want to update id in tobeupdated with the id from original based on value (strings stored in VARCHAR(32) field).
The updated table will hopefully look like:
uniqueid | id | value
---------------------
1 | | something
2 | | anything
3 | 4 | old
4 | 5 | friend
5 | 2 | fortune
I have a query that works, but it's very slow:
UPDATE tobeupdated, original
SET tobeupdated.id = original.id
WHERE tobeupdated.value = original.value
This maxes out my CPU and eventually leads to a timeout with only a fraction of the updates performed (there are several thousand values to match). I know matching by value will be slow, but this is the only data I have to match them together.
Is there a better way to update values like this? I could create a third table for the merged results, if that would be faster?
I tried MySQL - How can I update a table with values from another table?, but it didn't really help. Any ideas?
UPDATE tobeupdated
INNER JOIN original ON (tobeupdated.value = original.value)
SET tobeupdated.id = original.id
That should do it, and really its doing exactly what yours is. However, I prefer 'JOIN' syntax for joins rather than multiple 'WHERE' conditions, I think its easier to read
As for running slow, how large are the tables? You should have indexes on tobeupdated.value and original.value
EDIT:
we can also simplify the query
UPDATE tobeupdated
INNER JOIN original USING (value)
SET tobeupdated.id = original.id
USING is shorthand when both tables of a join have an identical named key such as id. ie an equi-join - http://en.wikipedia.org/wiki/Join_(SQL)#Equi-join
It depends what is a use of those tables, but you might consider putting trigger on original table on insert and update. When insert or update is done, update the second table based on only one item from the original table. It will be quicker.

Keeping page changes history. A bit like SO does for revisions

I have a CMS system that stores data across tables like this:
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
Entries META Table
+----+----------+-------+-------+
| id | entry_id | value | param |
+----+----------+-------+-------+
Files Table
+----+----------+----------+
| id | entry_id | filename |
+----+----------+----------+
Entries-to-Tags Table
+----+----------+--------+
| id | entry_id | tag_id |
+----+----------+--------+
Tags Table
+----+-----+
| id | tag |
+----+-----+
I am in trying to implement a revision system, a bit like SO has. If I was just doing it for the Entries Table I was planning to just keep a copy of all changes to that table in a separate table. As I have to do it for at least 4 tables (the TAGS table doesn't need to have revisions) this doesn't seem at all like an elegant solution.
How would you guys do it?
Please notice that the Meta Tables are modeled in EAV (entity-attribute-value).
Thank you in advance.
Hi am currently working on solution to similar problem, I am solving it by splitting my tables into two, a control table and a data table. The control table will contain a primary key and reference into the data table, the data table will contain auto increment revision key and the control table's primary key as a foreign key.
taking your entries table as an example
Entries Table
+----+-------+------+--------+--------+
| id | title | text | index1 | index2 |
+----+-------+------+--------+--------+
becomes
entries entries_data
+----+----------+ +----------+----+--------+------+--------+--------+
| id | revision | | revision | id | title | text | index1 | index2 |
+----+----------+ +----------+----+--------+------+--------+--------+
to query
select * from entries join entries_data on entries.revision = entries_data.revision;
instead of updating the entries_data table you use an insert statement and then update the entries table's revision with the new revision of the entries table.
The advantage of this system is that you can move to different revisions simply by changing the revision property within the entries table. The disadvantage is you need to update your queries. I am currently integrating this into an ORM layer so the developers don't have worry about writing SQL anyway. Another idea I am toying with is for there to be a centralised revision table which all the data tables use. This would allow you to describe the state of the database with a single revision number, similar to how subversion revision numbers work.
Have a look at this question: How to version control a record in a database
Why not have a separate history_table for each table (as per the accepted answer on the linked question)? That simply has a compound primary key of the original tables' PK and the revision number. You will still need to store the data somewhere after all.
For one of our projects we went the following way:
Entries Table
+----+-----------+---------+
| id | date_from | date_to |
+----+--------_--+---------+
EntryProperties Table
+----------+-----------+-------+------+--------+--------+
| entry_id | date_from | title | text | index1 | index2 |
+----------+-----------+-------+------+--------+--------+
Pretty much complicated, still allows to keep track of full object's lifecycle. So for querying active entities we were going for:
SELECT
entry_id, title, text, index1, index2
FROM
Entities INNER JOIN EntityProperties
ON Entities.id = EntityProperties.entity_id
AND Entities.date_to IS NULL
AND EntityProperties.date_to IS NULL
The only concern was for a situation with entity being removed (so we put a date_to there) and then restored by admin. Using given scheme there's no way to track such kind of tricks.
Overall downside of any attempt like that is obvious - you've to write tons of TSQL where non-versioned DBs will go for something like select A join B.