How can I select MySQL fulltext index entries for automplete functionality? - mysql

Is there a way to select entries from fulltext index in MySQL?

No, not that I know of. It would be a great feature though.
I built a search interface with autocomplete on top of MySQL. I run a daily job that scans all columns in all tables that I want to search in, extract words with regular expressions, then store the words in a separate table. I also have a many-to-many table with one column to hold the id of the object, and one column to hold the id of the word so as to record the fact that "word is part of text belonging to object".
The autocomplete works by taking the words typed into the box, and then generating a query that goes like:
SELECT obj.title
FROM obj_word
INNER JOIN obj
ON obj_word.obj_id = obj.id
INNER JOIN word
ON obj_word.word_id = word.id
WHERE word.word IN ('word1', 'word2', 'word3') -- generated dynamically, word1 etc are typed by the user
GROUP BY obj.id
HAVING COUNT(DISTINCT word.id) = 3 -- the 3 is generated, because user typed 3 words.
This works fairly well for me, but I don't have huge amounts of data to work with.
(the actual implementation is slightly fancier, beause the last word is matched with LIKE to allow partial matches)
EDIT:
I just learned that the myisam_ft_dump utility may be used to extract a list of words from the index file. The command line goes something like this:
myisam_ftdump -d film_text 1 > D:\tmp\out.txt
Here, -d means dump (get a list of all entries), film_text is the name of a MyISAM table with a full text index, 1 is one, and ordinal identifying which index you want to dump.
I must say, the utility works, but I am not surer it is fast enough to use this for pulling a live list for autocompletion. You could of course have a periodical job that runs the command and dumps it to file. Unfortunately this dumps index entries not individual, unique words.
My hunch is you could use this utility as a means to extract the words, but it will need processing to turn it into a proper autocomplete list.

Related

How to save people with multiple names in database and keep search fast?

I need to parse a table of names, create a mysql table for them and insert them. Some people have 2 or maybe even more first names/last names. The file only has full names as long strings and no information where the first name stops and last name begins. Initially I wanted to create only 2 VARCHAR fields but now I am not sure anymore. Should I go with 1 field and FULLTEXT index on it instead? I want to be able to search fast by individual first or last names. I use 10.1.38 MariaDB.

Count the number of times keywords (in a table) appear in a field in another table

I will simplify my problem in order to explain it.
I have a table which contains text messages posted by users and another table which contains keywords.
I want to display, for each user, the number of times keywords are found in text messages.
I don't want the result to display a keyword if it's not found in text messages.
I wan't it to be case INSENSITIVE. All keywords are lowered but in messages, you can find lower & upper chars.
Because I'm not sure that my explanation is clear enough, here comes the SQLFiddle.
http://sqlfiddle.com/#!2/c402a
Hope anyone can help me.
I found what I was looking for. It wasn't easy for me but here is my query :
SELECT t_msg.msg_usr,
t_list.list_word,
count(t_list.list_word),
t_msg.msg_text
FROM t_msg
INNER JOIN t_list
ON LOWER(t_msg.msg_text) LIKE CONCAT("%", t_list.list_word, "%")
GROUP BY t_msg.msg_usr, t_list.list_word;
The SQLFiddle is there : http://sqlfiddle.com/#!2/ba052/8
The recommendation would be to not try solving this with a query. It's possible to write a query that will do it, such query will scan the messages table for each keyword separately, and produce a count (or a row that you can group by), but this won't scale, or be reliable in sense of language search.
Here is what you might want to do:
Create a table to map (user_id, keyword_id) to a count of this keyword in messages of this user. Let's call it t_keyword_count.
Each time you receive a message, before you save the message into the database, search it for all the keywords you care about (using whatever good text search libraries that account for misspellings, etc.). You should know the (user_id) for this message.
You will, at that point, be ready to add the message to the database, and will have an array of (keyword_id) with keywords that this message will have.
In a transaction, insert the message into the t_msg table, and run update/insert for (user_id,keyword_id) to have value=value+1 (or +n, if you need to count the same keyword more than once in the same message) for the t_keyword_count table.
If you are trying to solve the problem of having to do the above on existing data, you can do this manually, just to build up that t_keyword_count table first (depends on how many keywords you have in total, but even if there are a lot, this can be scripted). But you should change (or mirror) the t_msg.msg_text field to be a field suitable for text search, and use SQL text search functionality to find the keywords.

mysql - extract specific words from text field using full text search

My question is a little simillar to Extract specific words from text field in mysql, but now the same.
I have a text field with words inside. In my language word can have many different endings. I need to find this endings.
I use fulltext search of mysql, but I would need to have access to the index database where all the field is "cut" to words and words are counted. I could then search for "test*" and I could quickly find "test", "tested", "testing". I need the list of all endigns that exist in my database, that is my primary goal.
As it is I can get the records with specific "test*" words in it, but I need not only to locate the occurence in the field, but to group somehow so I get the list of all the words that for example start with "test". I don't need location in which record they are, just a list, grouped so that "testing" is not written 10 times but only once (maybe a counter of how many times it is found but not necessary).
Is there a way to extract this info from fulltextsearch field or should I explode all this fields to words and make a index table full of words and just do a "like "word%" and group by the different results? I am not sure how to do that either in practice, but just to point me to the right direction please.
So to summarize: I have a text fied and I need to find out which words are inside that start with "test", like "tested", "test", "testing" etc... It doesn't make sense in English but in my language it does as we have same word on different endigns and there are so many of them, somethimes 20, I need to find out which ones are there so I can make a synonims table ;-)
UPDATE:
Database has columns ID (int), ingredients (text) and recipe (text).
Data in ingredients are cooking ingredients with different endings like:
1 egg
2 eggs
etc.
You can dump all words that are present in an index. And that would also show frequency of each word. E.g. test is used 200 times and testing is used 300 times.
Manual for that: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html

Generate number id from text/url for fast "SELECT"

I have the following problem:
I have a feed capturer that captures news from different sources every half an hour.
I only insert entries that don't have their URLs already in the database (URL is used to see if the record is already in database).
Even with that, I get some repeated entries, because some sites report the same news (that usually are from a news source like Reuters). I could look for these repeated entries during insertion, but i think this would slow the insertion time even more.
So, I can later find these repeated entries by the title. But I think this search is slow. Then, my idea is to generate a numeric field from the title and then search by this number for repeated titles.
What kind of encoding could I use (I thought in something reverse to base64) to encode the titles?
I'm suposing that searching for repeated numbers is a lot faster than searching for repeated words. Is that true or not?
Do you suggest a better solution for this problem?
Well, I don't care to have the repeated entries in the database, I just don't want to show then to the user. Like google, that filters the repeated results, but shows then if you want.
I hope I explained It well. Thanks in advance.
Fill the MD5 hash of the URL and title and build a UNIQUE index on it:
CREATE UNIQUE INDEX ux_mytable_title_url ON (title_hash, url_hash)
INSERT
INTO mytable (url, title, url_hash, title_hash)
VALUES ('url', 'title', MD5('url'), MD5('title'))
To select like Google (one result per title), use this query:
SELECT *
FROM (
SELECT DISTINCT title_hash
FROM mytable
) md
JOIN mytable mo
ON mo.url_title = md.title_hash
AND mo.url_hash =
(
SELECT url_hash
FROM mytable mi
WHERE mi.title_hash = md.title_hash
ORDER BY
mi.title_hash, mi.url_hash
LIMIT 1
)
so you can use a new table containing only the encoded keys based on title and url, you have then to add a key on it to accelerate search. But i don't think that you can use an effecient algorytm to transform strings to numbers ..
for the encryption use
SELECT MD5(CONCAT('title', 'url'));
and before every insertion you test if the encoded concatenation of title and url exists on this table.
#Quassnoi can explain better than I, but I think there is no visible difference in performance if you use a VARCHAR/CHAR or INT in a index to use it later for GROUPing or other method to find the duplicates. That way you could use the solution proposed by him but use a normal INDEX instead of a UNIQUE index and keep the duplicates in the database, filtering out only when showing to users.

Save a list of user ids to a mysql table

I need to save a list of user ids who viewed a page, streamed a song and / or downloaded it. What I do with the list is add to it and show it. I don't really need to save more info than that, and I came up with two solutions. Which one is better, or is there an even better solution I missed:
The KISS solution - 1 table with the primary key the song id and a text field for each of the three interactions above (view, download, stream) in which there will be a comma separated list of user ids. Adding to it will be just a concatenation operation.
The "best practice" solution - Have 3 tables with the primary key the song id and a field of user id that did the interaction. Each row has one user id and I could add stuff like date and other stuff.
One thing that makes me lean towards options 2 is that it may be easier to check whether the user has already voted on a song?
tl;dr version - Is it better to use a text field to save arrays as comma separated values, or have each item in the array in a separate table row.
Definitely the 2nd:
You'll be able to scale your application as it grows
It will be less programming language dependent
You'll be able to make queries faster and cleaner
It will be less painful for any other programmer coding / debugging your application later
Additionally, I'd add a new table called "operations" with their ID, so you can add different operations if you need later, storing the operation ID instead of a string on each row ("view", "download", "stream").
It's definitely better to have each item in a separate row. Manipulating text fields has performance disadvantages by itself. But if ever you want to find out which songs user 1234 has viewed/listened to/etc., you'd have to do something like
SELECT * FROM songactions WHERE userlist LIKE '%,1234,%' OR userlist LIKE '1234,%' OR userlist LIKE '%,1234' OR userlist='1234';
It'd be just horribly, horribly painful.