Best way to store and retrieve synonyms in database mysql - mysql

I am making a synonyms list that I will store it in database and retrieve it before doing full text search.
When users enters like: word1
I need to lookup for this word in my synonyms table. So if the word is found, I would SELECT all the synonyms of this word and use it in the fulltext search on the next query where I contruct the query like
MATCH (columnname) AGAINST ((word1a word1b word1c) IN BOOLEAN MODE)
So how do I store the synonyms in a table? I found 2 choices:
using key and word columns like
val keyword
-------------
1 word1a
1 word1b
1 word1c
2 word2a
2 word2b
3 word3a
etc.
So then I can find exact match of the entered word in one query and find it's ID. In the next select I get all the words with that ID and somehow concate them using a recordset loop in server side langauge. I can then construct the real search on the main table that I need to look for the words.
using only word columns like
word1a|word1b|word1c
word2a|word2b|word2c
word3a
Now I so the SELECT for my word if it is inside any record, if it is, extract all the record and explode it at | and I have my words again that I can use.
This second approach lookes easier to maintain for the one who would make this database of synonyms, but I see 2 problems:
a) How do I find in mysql if a word is inside the string? I can not LIKE 'word1a' it because synonims can be very alike in a way word1a could be strowberry and strowberries could be birds and word 2a could be berry. Obviously I need exact match, so how could a LIKE statement exact match inside a string?
b) I see a speed problem, using LIKE would I guess take more mysql take than "=" using the first approach where I exact match a word. On the other hand in the first option I need 2 statements, one to get the ID of the word and second to get all the words with this ID.
How would you solve this problem, more of a dilemma which approach to take? Is there a third way I don't see that is easy for admin to add/edit synonyms and in the same time fast and optimal? Ok I know there is no best way usually ;-)
UPDATE: The solution to use two tables one for master word and second for the synonym words will not work in my case. Because I don't have a MASTER word that user types in search field. He can type any of the synonyms in the field, so I am still wondering how to set this tables as I don't have master words that I would have ID's in one table and synonims with ID of the master in second table. There is no master word.

Don't use a (one) string to store different entries.
In other words: Build a word table (word_ID,word) and a synonym table (word_ID,synonym_ID) then add the word to the word table and one entry per synonym to the synonyms table.
UPDATE (added 3rd synonym)
Your word table must contain every word (ALL), your synonym table only holds pointers to synonyms (not a single word!) ..
If you had three words: A, B and C, that are synonyms, your DB would be
WORD_TABLE SYNONYM_TABLE
ID | WORD W_ID | S_ID
---+----- -----+-------
1 | A 1 | 2
2 | B 2 | 1
3 | C 1 | 3
3 | 1
2 | 3
3 | 2
Don't be afraid of the many entries in the SYNONYM_TABLE, they will be managed by the computer and are needed to reflect the existing relations between the words.
2nd approach
You might also be tempted (I don't think you should!) to go with one table that has separate fields for word and a list of synonyms (or IDs) (word_id,word,synonym_list). Beware that that is contrary to the way a relational DB works (one field, one fact).

I think 3 columns and only one table is better
WORD_TABLE
ID | WORD | GroupID
---+----------------
1 | A | 1
2 | B | 1
3 | C | 1

Another approach is to store meaning (this does not use master words, but a meaning table that groups instead)
would be to store the words in a words table without synonyms and with only text, like this:
Many words, one meaning
meaning_table
meaning_id
---
1
2
3
And store the words in another table, for example if A, B and C were all synonyms of 1 meaning
word_table
word_id | meaning_id | word
--------+------------+------
1 | 1 | A
2 | 1 | B
3 | 1 | C
Even though it looks a lot like what Hasan Amin Sarand suggests, it has the key difference that you don't select from the WORD_TABLE but instead select from the MEANING_TABLE, this is much better and I learned that the hard way.
This way you store the meaning in one table and as many words for that meaning as you like in another.
Although it assumes that you have 1 meaning per word.
Many words, many meanings
if you want to store words with multiple meanings then you need another table for the many to many relationship and the whole thing becomes:
meaning_table
-------------
meaning_id
-------------
1
2
3
word_meaning_table
--------------------
word_id | meaning_id
--------+-----------
1 | 1
2 | 1
3 | 1
word_table
--------------
word_id | word
--------+-----
1 | A
2 | B
3 | C
Now you can have as many words with as many meanings as you want, where any word can mean anything you want and any meaning can have many words.
If you want to select a word and it's synonyms then you do
SELECT
meaning_id,word_id,word
FROM meaning_table
INNER JOIN word_meaning_table USING (meaning_id)
INNER JOIN word_table USING (meaning_id)
WHERE meaning_id=1
You can also then store meaning that does not have a word yet or that you don't know the word of.
If you don't know what meaning it belongs to then you can just insert a new meaning for every new word and fix the meaning_id in the word_table later.
You can then even store and select the words that are the same but mean different things
SELECT
meaning_id,word_id,word
FROM meaning_table
INNER JOIN word_meaning_table USING (meaning_id)
INNER JOIN word_table USING (meaning_id)
WHERE word_id=1

Related

Relational databases: Integrate one tree structure into another

I'm currently designing a relational database table in MySQL for handling multiple categories, representing them later in a tree structure on the client side and filtering on them. Here is a picture of how the structure looks like:
So we have a root element which is set by default. We can after that add children to it (Level one). So far a table structure in the simplest case could be defined so:
| id | name | parent_id |
--------------------------------
1 All Categories NULL
2 History 1
However, I have a requirement that I need to include another tree structure type (Products) in the table (a corresponding API is available). The records from the other table have their own id types (UUID). Basically I need to ingest them in my table. A possible structure will look like so:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 NULL All Categories NULL
2 NULL History 1
3 NULL Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 NULL Shipping 1
I am new to relational databases, but all of these possible NULL values for the UUID indicate (at least for me) to be bad design of database table. Is there a way of avoiding this, or even better way for this "ingestion"?
If you had a table for users, with columns first_name, middle_name, last_name but then a user signed up and said they have no middle name, you could just store NULL for that user's middle_name column. What's bad design about that?
NULL is used when an attribute is unknown or inapplicable on a given row. It seems appropriate for the case you describe, i.e. when records that did not come from the external source have no UUID, and need no UUID.
That said, some computer science theorists insist that NULL is never appropriate. There's a decades-old controversy about whether SQL should even have a NULL.
The alternative would be to create a second table, in which you store only the UUID and the reference to the entity in your first table. Then just don't store rows for the other id's.
| id | UUID |
-------------------
4 CN1001231232
5 CN1001231242
And don't store the UUID column in your first table. This eliminates the NULLs, but it means you need to do a JOIN of the two tables whenever you want to query the entities with their UUID's.
First make sure you actually have to combine these in the same table. Are the products categories? If they are categories and are used like categories then it makes sense to have them in the same table, but if they have categories then they should be kept separate and given a category/parent id.
If you're sure it's appropriate to store them in the same table then the way you have it is good with one adjustment. For the UUID you can use a separate naming scheme that makes it interchangeable with id for those entries and avoids collisions with the other uuids. For example:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 CAT000000001 All Categories NULL
2 CAT000000002 History 1
3 CAT000000003 Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 CAT000000006 Shipping 1
Your requirements combine the two things relational database are not great with out of the box: modelling hierarchies, and inheritance (in the object-oriented sense).
Your design users the "single table inheritance" model (one of 3 competing options). It's the simplest option in terms of design.
In practical terms, you may want to add a column to explicitly state which type of record you're dealing with ("regular category" and "product category") so your queries are more obvious to others.

Count the number of occurences of a comma separated string

I am needing a way to count comma separated values like this - any
suggestions please?
Table:
id (int) | site (varchar)
1 | 1,2,3
2 | 2,3
3 | 1,3
Desired output:
site | # of occurrences
1 | 2
2 | 2
3 | 3
Without getting into exactly what you're doing, I'll assume you have a sites table. If so, it's technically achievable with something like
SELECT sites.site_id AS site, COUNT(1) AS `# of occurrences`
FROM sites
INNER JOIN table ON FIND_IN_SET(sites.site_id, table.site)
GROUP BY sites.site_id
Performance of that will be appalling, as there is no way to use an index, and the data will be able to get inconsistent very easily.
What the comments in your question are alluding to, is to use a relational table of some description, where instead of storing a comma-separated list, you store a row for each 'occurrence'

Most efficient, scalable mysql database design

I have interesting question about database design:
I come up with following design:
first table:
**Survivors:**
Survivor_Id | Name | Strength | Energy
second table:
**Skills:**
Skill_Id | Name
third table:
**Survivor_skills:**
Surviror_Id |Skill_Id | Level
In first table Survivors there will be many records and will grow from time to time.
In second table will be just few skills which can survivors learn (for example: recoon (higher view range), sniper (better accuracy), ...). Theese skills aren't like strength or energy which all survivors have.
Third table is the most interesting, there survivors and skills join together. Everything will work just fine but I am worried about data duplication.
For example: survivor with id 1 will have 5 skills so first table would look like this:
// survivor_id | level_id | level
1 | 1 | 2
1 | 2 | 3
1 | 3 | 1
1 | 4 | 5
1 | 5 | 1
First record: survivor with id 1 has skill with id 1 on level 2
Second record ...
Is this proper approach or should I use something different.
Looks good to me. If you are worried about data duplication:
1) your server-side code should be gear to not letting this happen
2) you could check before inserting if it already exists
3) you could use MYSQL: REPLACE INTO - this will replace duplicate rows if configure proerply, or insert new ones (http://dev.mysql.com/doc/refman/5.0/en/replace.html)
4) set a unique index on columns where you want only unique rows, e.g. level_id, level
I concur with the others - this is the proper approach.
However, there is one aspect which hasn't been discussed: the order of columns in the composite key {Surviror_Id, Skill_Id}, which will be governed by the kinds of queries you need to run...
If you need to find skills of the given survivor, the order needs to be: {Surviror_Id, Skill_Id}.
If you need to find survivors with the given skill, the order needs to be: {Skill_Id, Surviror_Id}.
If you need both, you'll need both the key (and the implied index) on {Surviror_Id, Skill_Id} and an index on {Skill_Id, Surviror_Id}1. Since InnoDB tables are clustered, accessing Level through that secondary index requires double-lookup - to avoid that, consider using a covering index {Skill_Id, Surviror_Id, Level} instead.
1 Or vice-verse.

MySQL query get column value similar to given

Sorry if my question seems unclear, I'll try to explain.
I have a column in a row, for example /1/3/5/8/42/239/, let's say I would like to find a similar one where there is as many corresponding "ids" as possible.
Example:
| My Column |
#1 | /1/3/7/2/4/ |
#2 | /1/5/7/2/4/ |
#3 | /1/3/6/8/4/ |
Now, by running the query on #1 I would like to get row #2 as it's the most similar. Is there any way to do it or it's just my fantasy? Thanks for your time.
EDIT:
As suggested I'm expanding my question. This column represents favourite artist of an user from a music site. I'm searching them like thisMyColumn LIKE '%/ID/%' and remove by replacing /ID/ with /
Since you did not provice really much info about your data I have to fill the gaps with my guesses.
So you have a users table
users table
-----------
id
name
other_stuff
And you like to store which artists are favorites of a user. So you must have an artists table
artists table
-------------
id
name
other_stuff
And to relate you can add another table called favorites
favorites table
---------------
user_id
artist_id
In that table you add a record for every artist that a user likes.
Example data
users
id | name
1 | tom
2 | john
artists
id | name
1 | michael jackson
2 | madonna
3 | deep purple
favorites
user_id | artist_id
1 | 1
1 | 3
2 | 2
To select the favorites of user tom for instance you can do
select a.name
from artists a
join favorites f on f.artist_id = a.id
join users u on f.user_id = u.id
where u.name = 'tom'
And if you add proper indexing to your table then this is really fast!
Problem is you're storing this in a really, really awkward way.
I'm guessing you have to deal with an arbitrary number of values. You have two options:
Store the multiple ID's in a blob object in JSON format. While MySQL doesn't have JSON functions built in, there are user defined functions that will extract values for you, etc.
See: http://blog.ulf-wendel.de/2013/mysql-5-7-sql-functions-for-json-udf/
Alternatively, switch to PostGres
Add as many columns to your table as the maximum number of ID's you expect to have. So if /1/3/7/2/4/8/ is the longest entry, have 6 columns in your table. Reason this is bad: you'll have sparse columns that'll unnecessarily slow your tables.
I'm sure you could write some horrific regex to accomplish the task, but I caution on using complex regex's on enormous tables.

Store multiple values in a single cell instead of in different rows

Is there a way I can store multiple values in a single cell instead of different rows, and search for them?
Can I do:
pId | available
1 | US,UK,CA,SE
2 | US,SE
Instead of:
pId | available
1 | US
1 | UK
1 | CA
1 | SE
Then do:
select pId from table where available = 'US'
You can do that, but it makes the query inefficient. You can look for a substring in the field, but that means that the query can't make use of any index, which is a big performance issue when you have many rows in your table.
This is how you would use it in your special case with two character codes:
select pId from table where find_in_set('US', available)
Keeping the values in separate records makes every operation where you use the values, like filtering and joining, more efficient.
you can use the like operator to get the result
Select pid from table where available like '%US%'