Can MySQL automatically and transparently de-duplicate strings?

Can MySQL automatically and transparently de-duplicate strings? - mysql

In C, the compiler assigns "strings" numeric IDs (4-byte pointers) and only keeps one copy of each string: for char *a="Hello", *b="Hello";, only one copy of "Hello" is stored in memory. This is totally automatic and transparent to the user.
My question is whether MySQL can do the same, i.e, de-duplicate strings automatically and transparently to the user.
Ideally, I would expect it to be an internal storage mechanism of the database, so that (as in case of C) for the user the database would look and behave completely as if it contained actual strings, while in implementation it would only contain pointers.
In my database there are many repeating strings, like this:
`unit`, `building`, `office`, `firstName`, `lastName`
Chicago main production unit | headquarters | accounting | Jane | Smith
Chicago main production unit | office | sales | Jane | Dow
Miami administrative department | headquarters | sales | Mary | Smith
Miami administrative department | office | accounting | Mary | Dow
etc. where strings like 'Miami administrative department' or 'accounting' or 'Smith' are repeated many times in different records.
This increases the size of the database, so that I hit hosting limitations.
An obvious solution is data normalization: to maintain a separate table for names
`id`, `string`
1 | Chicago main production unit
2 | Miami administrative department
3 | headquarters
4 | accounting
5 | Jane
6 | Smith
7 | office
8 | sales
9 | Dow
and then have my table as
`unit_id`, `building_id`, `office_id`, `firstName_id`, `lastName_id`
1 | 3 | 4 | 5 | 6
1 | 7 | 8 | 5 | 9
and translate all strings on input and output. But of course this is very cumbersome.
My question is whether MySQL can do it automatically and transparently for the user: whenever I INSERT a row, it would automatically update the table of strings and only store the ids instead of strings in the table, and same for DELETE, WHERE, etc., so that to the user the table would look exactly the same as if it had strings, but occupy less space.

My question is whether MySQL can do the same.
Although you can certainly achieve the desired result (it is called data normalization) MySQL does not do it implicitly.
Can MySQL do it automatically and transparently for the user?
No, MySQL cannot do it automatically for you - you have to do it yourself. You need to be explicit about it in your queries and DDL statements.
Here is a short demo to show how you can create a lookup table, and then use it in your inserts and selects:
create table lookup(id int, name varchar(10));
create table data(id int, id_lookup int);
insert into lookup(id,name) values (1,'quick');
insert into lookup(id,name) values (2,'brown');
insert into lookup(id,name) values (3,'fox');
insert into data (id, id_lookup)
values (110, (select id from lookup where name = 'quick'));
insert into data (id, id_lookup)
values (120, (select id from lookup where name = 'brown'));
insert into data (id, id_lookup)
values (130, (select id from lookup where name = 'quick'));
insert into data (id, id_lookup)
values (140, (select id from lookup where name = 'fox'));
Now data has these rows:
110 1
120 2
130 1
140 3
To select the name, you need to join to your lookup table:
select d.id, t.name
from data d
join lookup t on t.id=d.id_lookup
Demo on sqlfiddle.
Note: it is uncommon to create a lookup table for all your strings. Commonly you would create a separate lookup table for each kind of strings (i.e. unit_lookup, building_lookup, and so on) or to partition your lookup table with a special lookup code column:
id code name
-- ---- ----
1 unit Chicago
2 unit Miami
3 bldg Headquarters
4 bldg Office

Related

How do I turn a list of interconnected pairs of ids into a cluster of ids?

I have a table with pairs (and sometimes triples) of ids, which act as sort of links in a chain
+------+-----+
| from | to |
+------+-----+
| id1 | id2 |
| id2 | id3 |
| id4 | id5 |
+------+-----+
I want to create a new table where all the links are clustered into chains/families:
+-----+----------+
| id | familyid |
+-----+----------+
| id1 | 1 |
| id2 | 1 |
| id3 | 1 |
| id4 | 2 |
| id5 | 2 |
+-----+----------+
i.e. add up all chains in a link into a single family, and give it an id.
in the example above, the first 2 rows of the first table create one family, and the last row creates another family.
Solution
I will use node.js to query big batches of rows (a few thousands every batch), process them, and insert them into my own table with a family id.
The issue
The problem is I have a few tens of thousands of id pairs, and I will also need to add new ids over time after the initial creation of the families table, and i will need to add ids to existing families
Are there good algorithms for clustering pairs of data into families/clusters, keeping my issue in mind?

Not sure if it's an answer as more some ideas...
I created two tables similar to the ones you have, the first one I populated with the same data as you have.
Table Base, fromID, toID
Table chain, fromID, chainID (numeric, null allowed)
I then inserted all unique values from Base into chain with a null value for chainID. The idea being these are the rows as yet unprocessed.
It was then a case of repeatedly running a couple of statements...
update chain c
set chainID = n
where chainid is null and exists ( select 1 from base b where b.fromID = c.fromID )
order by fromID
limit 1
This would allocate the next chain ID to the first row without one (n needs to be generated from somewhere and incremented each time you run this)
Then the one that relates all of the records...
update chain c
join base b on b.toID = c.fromID
join chain c1 on b.fromID = c1.fromID
set c.chainID = c1.chainID
where c.chainID is null and c1.chainID is not null
This is run repeatedly until it affects 0 rows (i.e. it's nothing more to do).
Then run the first update to create the next chain etc. Again if you run the first update till it affects 0 rows, this shows that they are all linked.
Would be interested if you want to try this and see if it stands up with more complex scenarios.

This looks a lot like clustering over graph dataset where 'familyid' is the cluster center number.
Here is a question I think is relevant.
Here is the algorithm description. You will need to implement under the conditions you described.

How to design my database to accommodate this data

I am developing a database for a payroll application, and one of the features I'll need is a table that stores the list of employees that work at each store, each day of the week.
Each employee has an ID, so my table looks like this:
| Mon | Tue | Wed | Thu | Fri | Sat | Sun
Store 1 | 3,4,5 | 3,4,5 | 3,4,5 | 4,5,7 | 4,5,7 | 4,5,6,7 | 4,5,6,7
Store 2 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9
Store 3 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12
Store 4 | 15 | 15 | 15 | 16 | 16 | 16 | 16
Store 5 | 6,11,13 | 6,11,13 | 6,11,13 | 14,18,19| 14,18,19| 14,18,19| 14,18,19
My question is, how do I represent that on my database? I came up with the following ideas:
Idea 1: Pretty much replicate the design above, creating a table with the following columns: [Store_id | Mon | Tue ... | Sat | Sun] and then store the list of employee IDs of each day as a string, with IDs separated by commas. I know that comma-separated lists are not good database design, but sometimes they do look tempting, as in this case.
Store_id | Mon | Tue | Wed | Thu | Fri | Sat
---------+---------+---------+---------+---------+---------+---------
1 | '3,4,5' | '3,4,5' | '3,4,5' | '4,5,7' | '4,5,7' | '4,5,6,7'
2 | '1,8,9' | '1,8,9' | '1,8,9 '| '1,8,9' | '1,8,9' | '1,8,9'
Idea 2: Create a table with the following columns: [Store_id | Day | Employee_id]. That way each employee working at a specific store at a specific day would be an entry in this table. The problem I see is that this table would grow quite fast, and it would be harder to visualize the data at the database level.
Store_id | Day | Employee_id
---------+-----+-------------
1 | mon | 3
1 | mon | 4
1 | mon | 5
1 | tue | 3
1 | tue | 4
Any of these ideas sound viable? Any better way of storing the data?

if I were you I would store the employee data and stores data in separate tables... but still keep the design of your main table. so do something like this
CREATE TABLE stores (
id INT, -- make it the primary key auto increment.. etc
store_name VARCHAR(255)
-- any other data for your store here.
);
CREATE TABLE schedule (
id INT, -- make it the primary key auto increment.. etc
store_id INT, -- FK to the stores table id
day VARCHAR(20),
emp_id INT -- FK to the employees table id
);
CREATE TABLE employees
id INT, -- make it the primary key auto increment.. etc
employee_name VARCHAR(255)
-- whatever other employee data you need to store.
);
I would have a table for stores and for employees as that way you can have specific data for each store or employee
BONUS:
if you wanted a query to show the store name with the employees name and their schedule and everything then all you have to do is join the two tables
SELECT s.store_name, sh.day, e.employee_name
FROM schedule sh
JOIN stores s ON s.id = sh.store_id
JOIN employees e ON e.id = sh.emp_id
this query has limitations though because you cannot order by days so you could get data by random days.. so in reality you also need a days table with specific data for the day that way you can order the data by the beginning or end of the week.
if you did want to make a days table it would just be the same thing again
CREATE TABLE days(
id INT,
day_name VARCHAR(20),
day_type VARCHAR(55)
-- any more data you want here
)
where day name would be Mon Tue... and day_type would be Weekday or Weekend
and then all you would have to do for your query is
SELECT s.store_name, sh.day, e.employee_name
FROM schedule sh
JOIN stores s ON s.id = sh.store_id
JOIN employees e ON e.id = sh.emp_id
JOIN days d ON d.id = sh.day_id
ORDER BY d.id
notice the two colums in the schedule table for day would be replaced with one column for the day_id linked to the days table.
hope thats helpful!

The second design is correct for a relational database. One employee_id per row, even if it results in multiple rows per store per day.
The number of rows is not likely to get larger than the RDBMS can handle, if your example is accurate. You have no more than 4 employees per store per day, and 5 stores, and up to 366 days per year. So no more than 7320 rows per year, and perhaps less.
I regularly see databases in MySQL that have hundreds of millions or even billions of rows in a given table. So you can continue to run those stores for many years before running into scalability problems.

I upvoted John Ruddell's answer, which is basically your option #2 with the addition of tables to hold data about the store and the employee. I won't repeat what he said, but let me just add a couple of thoughts that are too long for a comment:
Never ever ever put comma-separated values in a database record. This makes the data way harder to work with.
Sure, either #1 or #2 makes it easy to query to find which employees are working at store 1 on Friday:
Method 1:
select Friday_employees from schedule where store_id='store 1'
Method 2:
select employee_id from schedule where store_id=1 and day='fri'
But suppose you want to know what days employee #7 is working.
With method 2, it's easy:
select day from schedule where employee_id=7
But how would you do that with method 1? You'd have break the field up into it's individual pieces and check each piece. At best that's a pain, and I've seen people screw it up regularly, like writing
where Friday_employees like '%7%'
Umm, except what if there's an employee number 17 or 27? You'll get them too. You could say
where Friday_employees like '%,7,%'
But then if the 7 is the first or the last on the list, it doesn't work.
What if you want the user to be able to select a day and then give them the list of employees working on that day?
With method 2, easy:
select employee_id from schedule where day=#day
Then you use a parameterized query to fill in the value.
With method 1 ...
select employee_id from schedule where case when #day='mon' then Monday_employees when #day='tue' then Tuesday_employees when #day='wed' then Wednesday_employees when #day='thu' then Thursday_employees when #day='fri' then Friday_employees when #day='sat' then Saturday_employees as day_employees
That's a beast, and if you do it a lot, sooner or later you're going to make a mistake and leave a day out or accidentally type "when day='thu' then Friday_employees" or some such. I've seen that happen often enough.
Even if you write those long complex queries, performance will suck. If you have a field for employee_id, you can index on it, so access by employee will be fast. If you have a comma-separated list of employees, then a query of the "like '%,7,%' variety requires a sequential search of every record in the database.

Select rows from a table that contain any word from a long list of words in another table

I have one table with every Fortune 1000 company name:
FortuneList:
------------------------------------------------
|fid | coname |
------------------------------------------------
| 1 | 3m |
| 2 | Amazon |
| 3 | Bank of America |
| 999 | Xerox |
------------------------------------------------
I have a 2nd table with every user on my newsletter:
MyUsers:
------------------------------------------------
|uid | name | companyname |
------------------------------------------------
| 1350 | John Smith | my own Co |
| 2731 | Greg Jones | Amazon.com, Inc |
| 3899 | Mike Mars | Bank of America, Inc |
| 6493 | Alex Smith | Handyman America |
------------------------------------------------
How do I pull out every one of my newsletter subscribers that works for a Fortune 1000 company? (By scanning my entire MyUsers table for every record that has any of the coname's from the FortuneList table)
I would want output to pull:
------------------------------------------------
|uid | name | companyname |
------------------------------------------------
| 2731 | Greg Jones | Amazon.com, Inc |
| 3899 | Mike Mars | Bank of America, Inc |
------------------------------------------------
(See how it finds "Amazon" in the middle of "Amazon.com, Inc")

Try using this, which uses an INNER JOIN, the LIKE operator, and CONCAT:
SELECT *
FROM MyUsers
INNER JOIN FortuneList
ON FortuneList.coname LIKE CONCAT('%', MyUsers.companyname, '%)
(This wouldn't use your Full Text index, I'm trying to figure out how you could use a MATCH...AGAINST in a JOIN.)

If you were doing this in Oracle, this would yield your desired result (with the example data):
with fortunelist as(
select 1 as fid, '3m' as coname from dual union all
select 2, 'Amazon' from dual union all
select 3, 'Bank of America' from dual union all
select 999, 'Xerox' from dual
)
, myusers as(
select 1350 as usrid, 'John Smith' as name, 'my own Co' as companyname from dual union all
select 2731, 'Greg Jones', 'Amazon.com, Inc.' from dual union all
select 3899, 'Mike Mars', 'Bank of America, Inc' from dual union all
select 6493, 'Alex Smith', 'Handyman America' from dual
)
select utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as sim
, myusers.companyname
, fortunelist.coname
from fortunelist
, myusers
where utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80
The reason being, the Jaro Winkler result for the 2 you're after are 87 and 95 (Amazon and BOA, respectively). You can bump the 80 in the query up or down to make the matching threshold higher or lower. The higher you go, the fewer matches you'll have, but the more likely they will be. The lower you go, the more matches you'll have, but you risk getting matches back that aren't really matches. For instance, "Handyman America" vs. "Bank of America" = 73/100. So if you lowered it to 70, you would get a false positive, using your example data. Jaro Winkler is generally meant for people's names, not company names, however because company names are typically also very short strings, it may still be useful for you.
I know you tagged this as MySQL and while this function does not exist, from what I've read people have already done the work creating a custom function for it:
http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/
http://dannykopping.com/blog/fuzzy-text-search-mysql-jaro-winkler
You could also try string replacements, ex. eliminating common reasons for a match not being found (such as there being an "Inc." on one table but not the other).
Edit 2/10/14:
You can do this in MySQL (via phpmyadmin) following these steps:
Go into phpmyadmin then your database and paste the code from this URL link (below) into a SQL window and hit Go. This will create the custom function that you'll need to use in Step 2. I'm not going to paste the code for the function here because it's long, also it's not my work. It basically allows you to use the jaro winkler algorithm in MySQL, the same way you would with utl_match if you were using Oracle.
http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/
After that function is created, run the following SQL:
-
select jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as similarity
, myusers.uid
, myusers.name
, myusers.companyname as user_co
, fortunelist.coname as matching_co
from fortunelist
, myusers
where jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80
This should yield the exact result you're looking for, but like I said you'll want to play around with the 80 in that SQL and go up or down so that you have a good balance between avoiding false positives but also finding the matches that you want to find.
I don't have a MySQL database with which to test so if you run into an issue please let me know, but this should work.

Using LOCATE (no index thus):
select uid, name, companyname
from MyUsers JOIN FortuneList
WHERE LOCATE(coname, companyname) > 0

How to join/subquery a second table

I have two tables, one table has some information in each row along with a comma seperated list of ids that another table contains. Right now I am grabbing the data from table A (with the comma seperated ids), and I want to also grab all of the data from Table B (the table containing additional information). I would like to do this in the most efficient SQL method possible.
I was thinking about joining Table B to Table A based on the ids IN the field, but I was not sure if this is possible. It is also important to note that I am grabbing data from Table A based on another IN statement, so my ultimate goal is to attach all of the rows in Table B to Table A's rows depending on which ids are in the field in Table A's rows (row by row basis)
If someone could follow all of that and knows what I am trying to do I would appreciate a sample query :D
If you need any further clarifaction I would be happy to provide them.
Thanks
The way Table A is setup now:
`table_a_id` VARCHAR ( 6 ) NOT NULL,
`table_b_ids` TEXT NOT NULL, -- This is a comma seperated list at the moment
-- More data here that is irrelevant to this question but i am grabbing
Table B is setup like this:
`table_b_id` VARCHAR ( 6 ) NOT NULL,
`name` VARCHAR ( 128 ) NOT NULL,
-- More data that is not relevant to the question
Also I want to eventually switch to a NOSQL system like Cassandra, from what I have briefly read I understand there are no such things as joins in NOSQL? A bonus help would be to help me to setup these tables so I can convert over with less conversions and difficulty.

You need to add another table.
Person -- your Table A
------
PersonID
Thing -- your Table B
------
ThingID
ThingName
PersonThing -- new intersection table
-------
PersonID
ThingID
Then your query becomes
SELECT * from Person
INNER JOIN PersonThing ON Person.PersonID = PersonThing.PersonID
INNER JOIN Thing ON PersonThing.ThingID = Thing.ThingID
So where now you have
001 | Sam Spade | 12,23,14
You would have
Person
001 | Sam Spade
Thing
12 | box
23 | chair
14 | wheel
PersonThing
001 | 12
001 | 23
001 | 14
This is what the other answers mean by "normalizing".
Edited to add
From what I understand of NoSQL, you would get around the joins like this:
Person -- your Table A
------
PersonID
OtherPersonStuff
Thing -- your Table B
------
ThingID
ThingName
OtherThingStuff
PersonThing -- denormalized table, one record for each Thing held by each Person
-------
PersonID
ThingID
ThingName
OtherThingStuff
In exchange for taking up extra space (by duplicating the Thing information many times) and potential data management headaches (keeping the duplicates in sync), you get simpler, faster queries.
So your last table would look like this:
PersonThing
001 | 12 | box | $2.00
001 | 23 | chair | $3.00
001 | 14 | wheel | $1.00
002 | 12 | box | $2.00
003 | 14 | wheel | $1.00
In this case OtherThingStuff is the value of the Thing.

You should consider normalizing your database schema in order to use a join. Using comma separated lists will not allow you to use any SQL IN commands.
The best way to do it is to store a row for each unique ID, then you can JOIN on TableA.id = TableB.id

Hashed string in MySQL

Is there some kind of hashed string type in MySQL?
Let's say we have a table
user | action | target
-----------------------
1 | likes | 14
2 | follows | 190
I don't want to store "action" as text, because it takes much space and is slow to index. Actions are likely to be limited (up to 50 actions, I guess) but can be added/removed in the future. So I would like to avoid storing all actions by numbers in PHP. I would like to have a table that handles this transparently.
For example, table above would be stored as (1,1,14), (2,2,190) internally, and keys would be stored in another table (1 = likes, 2 = follows).
INSERT INTO table (41, "likes", 153)
Here "likes" is resolved to 1.
INSERT INTO table (23, "dislikes", 1245)
Here we have no key for "dislikes" to it is added and stored internally as 3.
Possible?

If you have a fixed (or reasonably fixed) set of values, then you can use an enum field. This is implemented as a bitmask internally and as a result takes a small amount of disk space. Here is an example definition:
CREATE TABLE enum_test (
myEnum enum('enabled', 'disabled', 'unknown')
);

Yes it is, with a subquery like this:
INSERT INTO table (23, (SELECT id FROM actions WHERE action="dislikes") , 1245)
This way it is possible to don't know the ID from PHP side, but only the action name, and still input it in the database as an ID
This assuming you have a 'actions' table
id | action
-----------
1 | like
2 | dislike

You want a table called "actions", and a foreign key called "action_id". That is how database normalization works:
user_actions:
user | action_id | target
-----------------------
1 | 1 | 14
2 | 2 | 190
actions:
id | name
--------------
1 | likes
2 | follows
As far as making insert into user_actions (1, 'likes', 47) work: You shouldn't care. Trying to make your SQL pretty is a pointless pursuit; you should never actually have to write any in your application code. The database interactions should be handled by a layer of models/business objects, and their internal implementation shouldn't matter to you.
As far as making insert into user_actions (1, 'dislikes', 47) automatically create new records in the actions table: That again isn't the database's job. Your models should be handling this.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008