Create words based on input like "otb" into "bot" - mysql

I want to show a list of all words that can be created given a certain input. For this, I use all words that contain 12 characters.
I thought of SQL be able to see if letters are within a word. So jecobvizedti can become objectivized but if I would give jecobvizeti (without the d) it can also become objectivizes. This would also be the desired result.
Is there a way to let SQL determine which words match as they contain the characters of the input string. At first, I tried 12 separate fields and store 1 character into each field for the 12 character words. But that didn't make sense at all.
This is case-insensitive as I make sure the input is in lower case.
It should only support letters. No digits, spaces, or punctuation.
As for the performance, a max of 10 sec would be nice. I think SQL can manage that.
Edit: Solution of Ruud Helderman and the question linked doesn't work 100%. If I have the following letters: "beergtoltrot" it comes with a word that contains the letter "c" or with a word that contains the letter "v" which are not in the input string.

If you are doing this for a lot of words and you are not, for example saying that 'otb' should match 'brought'. That is you want exactly the same set of letters.
And how about getting the answer in a few milliseconds?
CREATE TABLE x (
sorted VARCHAR(12) NOT NULL,
real_word VARCHAR(12) NOT NULL,
PRIMARY KEY(real_word),
INDEX(sorted, real_word),
) ENGINE=InnoDB;
sorted contains the letters of real_word but rearranged in sorted order.
`sorted` => `real_word`
'bot' => 'bot' -- this happens to be in order to start with.
'bcdeeiijotvz' => 'objectivized'
'aest' => 'seat'
'aest' => 'teas'
'aest' => 'eats'
'aest' => 'east'
But... You must do this sorting in two places:
As you enter the data.
As you look up the data.
And it is best to do the sorting in your client code. The lookup in the single table a hundred thousand words will be very fast.

You can do this as:
where word like '%o%' and
word like '%b%' and
word like '%t%'
There are tricks for splitting an input string into individual characters. But if you can do this in the application, this shoudl work.

Related

How do I create a SELECT conditional in MySQL where the conditional is the character length of the LIKE match?

I am working on a search function, where the matches are weighted based on certain conditions. One of the conditions I want to add weight to is matches where the character length of the query string in a LIKE match is longer than 4.
This is what I want to the query to look like, roughly. %s is meant to represent the actual match found by LIKE, but I don't think it does. I'm wondering if there is a special variable in MySQL that does represent the precise character match found by LIKE.
SELECT help.*,
IF(CHAR_LENGTH(%s) > 4, 2, 0) w
FROM help
WHERE (
(title LIKE '%this%' OR title LIKE '%testy%' OR title LIKE '%test%') OR
(content LIKE '%this%' OR content LIKE '%testy%' OR content LIKE '%test%')
) LIMIT 1000
edit: I could in the PHP split the search string array into two arrays based on the character length of the elements, with two separate queries that return different values for 'w', then combine the results, but I'd rather not do that, as it seems to me that would be awkward, messy, and slow.
Check out FULLTEXT as another way to discover rows. It will be faster, but won't address your question.
This probably has the effect you want.
SELECT ....
IF ( (title LIKE '%testy%' OR
content LIKE '%testy%'), 2, 0)
....
Note that the "match" in your LIKEs includes the %, so it is the entire length of the string. I don't think that is what you wanted.
REGEXP "(this|testy|that)" will match either 4 or 5 characters (in this example). It may be possible to do something with REGEXP_REPLACE to replace that with the empty string, then see how much it shrank.
I think the answer to my question is that what I wanted to do isn't possible. There is no special variable in MySQL representing the core character match in a WHERE condtional where LIKE is the operator. The match is the contents of the returned data row.
What I did to reach my objective was took the original dynamic list of search tokens, iterated through that list, and performed a search on each token, with the SQL tailored to the conditions that matched each token.
As I did this I built an array of the search results, using the id for the database row as the index for the array. This allowed me to perform calculations with the array elements, while avoiding duplicates.
I'm not posting the PHP code because the original question was about the SQL.

MySql Specific Search - Replace String

I need to search words that contain multiple number prefixes.
Example:
0119
0129
0139
0149
But there is other prefixes, 0155859, 0128889
Etc.
If i search 0%9 it'll come up with all the results i don't want, it'll include the 0155859, 0128889 ones
I need to search and list ONLY the ones that have 0119, etc
How do i do it ?
0XX9 ( Where XX is any strings that matches, so 0119, 0129, etc. % Lists all other characters till a 9 appears, i don't want that. )
I'm trying on my english, correct me if i did'nt expressed myself right !
In a LIKE pattern, the _ character matches any single character. So you can do:
WHERE word LIKE '0__9%'
This matches a word that begins with 0, then any two characters, then 9, then anything after that.
My gut feeling at seeing your question was to consider using REGEXP, which is MySQL's regex matching operator. Try the following query:
SELECT *
FROM yourTable
WHERE word REGEXP '0[0-9][0-9]9'
The pattern used would match any word containing a zero, followed by any two numbers, followed by a 9.

How to find words by partially strings

I have been trying to solve this problem for hours, but I dont know how to approach it, so I would need a push to a right direction.
I want to create a page where users can find the appropriate word, by providing word length and characters.
For example, user wants to find all the 5 letter words, where the second letter is R and fourth V, like this:
_R_V_
I have a table with column WORDS with words "letter", "moon", "drive", "mrive" and the query should return: "drive" and "mrive".
Is it possible to do it in MySQL?
While I was looking for the direction I found that I should create a trie structure. I dont know how to do that, but I will learn it if there is no easier way.
Yes, you can use LIKE :
SELECT * FROM YourTable t
WHERE t.word_col LIKE '_R_V_'
_ Wildcard stands for any single character. This will also force the string to be 5 characters in length, since % wildcard is not used.
You can find a great explanation about LIKE wildcards in the link above.

Delete all characters before and after quotation marks

I have a CSV file, which has two columns and 4500 rows. In one column, I have several phrases that are surrounded in quotation marks. I need to delete all the text that comes before and after the quotations marks.
For example:
How would you say "Hello, my Friend" when speaking outside?
should become "Hello, my Friend"
I also have several rows that have the word NULL in the second column. I need these rows deleted in full.
What's the best way of doing something like this? I have been looking at regular expressions, but I'm not sure if they are flexible enough to do what I want to do, or how you would use them on a CSV file (I need the table structure to remain).
EDIT:
1) At the moment I am just using Apple Numbers, but I know that wont don't it, so I am happy to any suggestions. It must support Kanji characters.
2) I have removed all the NULL rows, so that is no longer needed (I simply added a column of numbers, sorted the table so all the NULLs were together, deleted them and the sorted back by the column of numbers).
Find a text editor that supports regular expression search and replace.
Something like this would match ,NULL in the second column: ^.*,NULL.*$. Replace it with "DELETEMEDELETEME" to mark the line, or as an empty string or find a way to have it match on `\n' or '\r' to catch the line break and remove the entire line completely.
Stripping out parts of the quoted string might work like this:
^(.*,){n}(.*)(\".\")(.*)(,.*)$ replaced with \1\3\5 where n is the number of columns preceding the one you want to edit. Repeat (.*,) if that's not available. It will depend on the regex flavor of your tool.

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)