Selecting X words from a text field in MySQL - mysql

I'm building a basic search functionality, using LIKE (I'd be using fulltext but can't at the moment) and I'm wondering if MySQL can, on searching for a keyword (e.g. WHERE field LIKE '%word%') return 20 words either side of the keyword, as well?

You can do it all in the query using SUBSTRING_INDEX
CONCAT_WS(
' ',
-- 20 words before
TRIM(
SUBSTRING_INDEX(
SUBSTRING(field, 1, INSTR(field, 'word') - 1 ),
' ',
-20
)
),
-- your word
'word',
-- 20 words after
TRIM(
SUBSTRING_INDEX(
SUBSTRING(field, INSTR(field, 'word') + LENGTH('word') ),
' ',
20
)
)
)

Use the INSTR() function to find the position of the word in the string, and then use SUBSTRING() function to select a portion of characters before and after the position.
You'd have to look out that your SUBSTRING instruction don't use negative values or you'll get weird results.
Try that, and report back.

I don't think its possible to limit the number of words returned, however to limit the number of chars returned you could do something like
SELECT SUBSTRING(field_name, LOCATE('keyword', field_name) - chars_before, total_chars) FROM table_name WHERE field_name LIKE "%keyword%"
chars_before - is the number of
chars you wish to select before the
keyword(s)
total_chars - is the
total number of chars you wish to
select
i.e. the following example would return 30 chars of data staring from 15 chars before the keyword
SUBSTRING(field_name, LOCATE('keyword', field_name) - 15, 30)
Note: as aryeh pointed out, any negative values in SUBSTRING() buggers things up considerably - for example if the keyword is found within the first [chars_before] chars of the field, then the last [chars_before] chars of data in the field are returned.

I think your best bet is to get the result via SQL query and apply a regular expression programatically that will allow you to retrieve a group of words before and after the searched word.
I can't test it now, but the regular expression should be something like:
.*(\w+)\s*WORD\s*(\w+).*
where you replace WORD for the searched word and use regex group 1 as before-words, and 2 as after-words
I will test it later when I can ask my RegexBuddy if it will work :) and I will post it here

Related

How to find variable pattern in MySql with Regex?

I am trying to pull a product code from a long set of string formatted like a URL address. The pattern is always 3 letters followed by 3 or 4 numbers (ex. ???### or ???####). I have tried using REGEXP and LIKE syntax, but my results are off for both/I am not sure which operators to use.
The first select statement is close to trimming the URL to show just the code, but oftentimes will show a random string of numbers it may find in the URL string.
The second select statement is more rudimentary, but I am unsure which operators to use.
Which would be the quickest solution?
SELECT columnName, SUBSTR(columnName, LOCATE(columnName REGEXP "[^=\-][a-zA-Z]{3}[\d]{3,4}", columnName), LENGTH(columnName) - LOCATE(columnName REGEXP "[^=\-][a-zA-Z]{3}[\d]{3,4}", REVERSE(columnName))) AS extractedData FROM tableName
SELECT columnName FROM tableName WHERE columnName LIKE '%___###%' OR columnName LIKE '%___####%'
-- Will take a substring of this result as well
Example Data:
randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz123&hello_world=us&etc_etc
In this case, the desired string is "xyz123" and the location of said pattern is variable based on each entry.
EDIT
SELECT column, LOCATE(column REGEXP "([a-zA-Z]{3}[0-9]{3,4}$)", column), SUBSTR(column, LOCATE(column REGEXP "([a-zA-Z]{3}[0-9]{3,4}$)", column), LENGTH(column) - LOCATE(column REGEXP "^.*[a-zA-Z]{3}[0-9]{3,4}", REVERSE(column))) AS extractData From mainTable
This expression is still not grabbing the right data, but I feel like it may get me closer.
I suggest using
REGEXP_SUBSTR(column, '(?<=[&?]random_code=[^&#]{0,256}-)[a-zA-Z]{3}[0-9]{3,4}(?![^&#])')
Details:
(?<=[&?]random_code=[^&#]{0,256}-) - immediately on the left, there must be & or &, random_code=, and then zero to 256 chars other than & and # followed with a - char
[a-zA-Z]{3} - three ASCII letters
[0-9]{3,4} - three to four ASCII digits
(?![^&#]) - that are followed either with &, # or end of string.
See the online demo:
WITH cte AS ( SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz123&hello_world=us&etc_etc' val
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz4567&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz89&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-xyz00000&hello_world=us&etc_etc'
UNION ALL
SELECT 'randomwebsite.com/3982356923abcd1ab?random_code=12480712_ABC_DEF_ANOTHER_CODE-aaaaa11111&hello_world=us&etc_etc')
SELECT REGEXP_SUBSTR(val,'(?<=[&?]random_code=[^&#]{0,256}-)[a-zA-Z]{3}[0-9]{3,4}(?![^&#])') output
FROM cte
Output:
I'd make use of capture groups:
(?<=[=\-\\])([a-zA-Z]{3}[\d]{3,4})(?=[&])
I assume with [^=\-] you wanted to capture string with "-","\" or "=" in front but not include those chars in the result. To do that use "positive lookbehind" (?<=.
I also added a lookahead (?= for "&".
If you'd like to fidget more with regex I recommend RegExr

MS SQL Query to remove prefixes

I have below like column values and would like to exclude the characters as well as the hyphen and only return digits. The replace function is not entirely helpful as sometimes the character length is 3 and sometimes its 4, see below as the digit length changes as well.
abc-1234567
sdfr-9876540
try-12345678
case-098765
If you want the part after the last hyphen, you can use substring_index():
select substring_index(col, '-', -1)
You can also extract the digits at the end using regexp_substr():
select regexp_substr(col, '[0-9]+$')

MySql Substring Index, Find and replace characters

I need to find the first and second "_" and extract whatever is between.
example data
doc_856_abc_123
doc_876_xyz_999
So far I have the following substring query. But I need help
select SUBSTRING_INDEX( column, '_', 2 )
It is outputting
doc_856
doc_867
How do I combine the above query to maybe another substring go get the desired results. Which would be.
856
867
Just apply SUBSTRING_INDEX again on the resulted string
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(column, '_', 2 ), '_', -1)

SQL count number of words in field

I'd like to make an SQL query where the condition is that column1 contains three or more words. Is there something to do that?
maybe try counting spaces ?
SELECT *
FROM table
WHERE (LENGTH(column1) - LENGTH(replace(column1, ' ', ''))) > 1
and assume words is number of spaces + 1
If you want a condition that a column contains three or more words and you want it to work in a bunch of databases and we assume that words are separated by single spaces, then you can use like:
where column1 like '% % %'
I think David nailed it above. However, as a more complete answer:
LENGTH(RTRIM(LTRIM(REPLACE(column1,' ', ' ')))) - LENGTH(REPLACE(RTRIM(LTRIM(REPLACE(column1, ' ', ' '))), ' ', '')) + 1 AS number_of_words
This will remove double spaces, as well as leading and trailing spaces in your string.
Of course, you may go further by adding replacements for more than 2 spaces in a row...
In Postgres you can use regexp_split_to_array() for this:
select *
from the_table
where array_length(regexp_split_to_array(the_column, '\s+'), 1) >= 3;
This will split the contents of the column the_column into array elements. One ore more whitespace are used as the delimiter. It won't respect "quoted" spaces though. The value 'one "two three" four' will be counted as four words.
The best way to do this, is to NOT do this.
Instead, you should use the application layer to count the words during INSERT and save the word count into its own column.
While I like, and upvoted, some of the answers here, all of them will be very slow and not 100% accurate.
I know people want a simple answer to SELECT the word count, but it just is NOT POSSIBLE with accuracy and speed.
If you want it to be 100% accurate, and very fast, then use this solution.
Steps to solve:
Add a column to your table and index it: ALTER TABLE tablename ADD COLUMN wordcount INT UNSIGNED NULL, ADD INDEX idxtablename_count (wordcount ASC);.
Before doing your INSERT, count the number of words using your application. For example in PHP: $count = str_word_count($somevalue);
During the INSERT, include the value of $count for the column wordcount like insert into tablename (col1, col2, col3, wordcount) values (val1, val2, val3, $count);
Then your select statement becomes super easy, clean, uber-fast, and 100% accurate.
select * from tablename where wordcount >= 3;
Also remember when you are updating any rows that you will need to recount the words for that column.
For "n" or more words
select *
from table
where (length(column)- length(replace(column, " ", "")) + 1) >= n
PS: This would not work if words have multiple spaces between them.
With ClickHouse DB You can use splitByWhitespace() function.
Refer : https://clickhouse.com/docs/en/sql-reference/functions/splitting-merging-functions#splitbywhitespaces
None of the other answers seem to take multiple spaces into account. For example, a lot of people use two spaces between sentences; these space-counters would count an extra word per sentence. "Also, scenarios such as spaces around a hyphen - like that. "
For my purposes, this was far more accurate:
SELECT
LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}',' ')) -
LENGTH(REGEXP_REPLACE(myText, '[ \n\t\|\-]{1,}', '')) wordCount FROM myTable;
It counts any sets of 1 or more consecutive characters from any of: [space, linefeed, tab, pipe, or hyphen] and counts it as one word.
This can work:
SUM(LENGTH(a) - LENGTH(REPLACE(a, ' ', '')) + 1)
Where a is the string column. It will count the number of spaces, which is 1 less than the number of words.
To handle multiple spaces too, use the method shown here
Declare #s varchar(100)
set #s=' See how many words this has '
set #s=ltrim(rtrim(#s))
while charindex(' ',#s)>0
Begin
set #s=replace(#s,' ',' ')
end
select len(#s)-len(replace(#s,' ',''))+1 as word_count
https://exploresql.com/2018/07/31/how-to-count-number-of-words-in-a-sentence/

Sort by the number of words in a MySQL field

I would like to sort a result from a MySQL query by the number of words in a specified column.
Something like this:
SELECT bar FROM foo ORDER BY WordCountFunction(bar)
Is it possible?
As far as I know, there is no word count function in MySQL, but you can count the number of spaces and add one if your data is formatted properly (space separator for words, no spaces at beginning/end of entry).
Here is the query listing longest words first:
SELECT bar FROM foo ORDER BY (LENGTH(bar) - LENGTH(REPLACE(bar, ' ', ''))+1) DESC
Using this method to count the number of words in a column, your query would look like this:
SELECT bar FROM foo ORDER BY (LENGTH(bar) - LENGTH(REPLACE(bar, ' ', ''))+1) DESC
Yes, sort of. It won't be 100% accurate though:
SELECT SUM(LENGTH(bar) - LENGTH(REPLACE(bar, ' ', ''))+1) as count
FROM table
ORDER BY count DESC
But this assumes the words are separated by a space ' ' and doesn't account for punctuation. You can always replace it with another char and it doesn't account for double spaces or other chars either.
For complete accuracy, you could always pull the result out and word-count in your language of choice - where accurate word-count function do exist!
Hope this helps.