I am optimizing my query since full text search returns irrevelant result when there is numbers and repetitive keywords in text.
What I want to do is to extract numbers on text and add X amount of point to relevance when sorting the result.
Everything works smoothly besides one thing;
When I want to extract and prioritize result with number Z, it also counts other numbers that includes number Z in any part of it.
For Example;
Sample Data
###############
Text 55.A
Text 55_B
Text #55ABC
Text 551234.
Text 55677#
Text 556
Query
###############
... CASE WHEN (myTable.title like "% 55%") THEN ...
Expected output
###############
Text 55.A
Text 55_B
Actual output
###############
Text 55.A
Text 55_B
Text #55ABC
Text 551234.
Text 55677#
Text 556
How could I use REGEXP with LIKE, there can be symbols and characters after number I have given.
Thanks in advance
You may use
REGEXP '([[:<:]]|_)55([[:>:]]|_)'
If you are using MySQL 8.x and newer that use ICU regex library use
REGEXP '(\\b|_)55(\\b|_)'
See the regex demo
The (\\b|_) matches a word boundary or a _, the ([[:<:]]|_) matches a starting word boundary or _ and ([[:>:]]|_) matches a trailing word boundary or _.
Related
I'm building a word unscrambler using MySQL, Think about it like the SCRABBLE game, there is a string which is the letter tiles and the query should return all words that can be constructed from these letters, I was able to achieve that using this query:
SELECT * FROM words
WHERE word REGEXP '^[hello]{2,}$'
AND NOT word REGEXP 'h(.*?h){1}|e(.*?e){1}|l(.*?l){2}|l(.*?l){2}|o(.*?o){1}'
The first part of the query makes sure that the output words are constructed from the letter tiles, the second part takes care of the words occurrences, so the above query will return words like: hello, hell, hole, etc..
My issue is when there is a blank tile (a wildcard), so for example if the string was: "he?lo", the "?" Can be replaced with any letter, so for example it will output: helio, helot.
Can someone suggest any modification on the query that will make it support the wildcards and also takes care of the occurrence. (The blank tiles could be up to 2)
I've got something that comes close. With a single blank tile, use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*$'
AND word not REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
with 2 blank tiles use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*.[acre]*$'
AND word NOT REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
The . in the first regexp allows a character that isn't one of the tiles with a letter on it.
The only problem with this is that the second regexp prevents duplicates of the lettered tiles, but a blank should be allowed to duplicate one of the letters. I'm not sure how to fix this. You could add 1 to the counts in {}, but then it would allow you to duplicate multiple letters even though you only have one blank tile.
A possible starting point:
Sort the letters in the words; sort the letters in the tiles (eg, "ehllo", "acer", "aerr").
That will avoid some of the ORing, but still has other complexities.
If this is really Scrabble, what about the need to attach to an existing letter or letters? And do you primarily want to find a way to use all 7 letters?
I'm trying to write a query to identify what rows have special characters in them, but I want it to ignore spaces
So far I've got
SELECT word FROM `games_hangman_words` WHERE word REGEXP '[^[:alnum:]]'
Currently this matches those that use all special characters, what I want is to ignore if the special character is space
So if I have these rows
Alice
4 Kings
Another Story
Ene-tan
Go-Busters Logo
Lea's Request
I want it to match
Ene-tan, Go-Busters Logo and Lea's Request
Simply extend your class.
... WHERE word REGEXP '[^[:alnum:] ]' ...
for only a "regular" space (ASCII 32) or
... WHERE word REGEXP '[^[:alnum:][:space:]]' ...
for all kind of white space characters.
I try to extract the last date from a string with regexp_substr.
How can I do it?
select regexp_substr('08.09.11 some text around
10.10.13 AP ab 16.10.13 some text around
13.08.2014 some text around.
01.09.2014 some text around
07.11.2014 some text. around
10.02.15 some text. around
11.02.15 some text around . (tp)',
'[0-9]+.[0-9]+.[0-9]+') as test
My actually result is the first date (08.09.11).
Thanks a lot!
One approach here is to do use REGEXP_REPLACE with a capture group:
SELECT REGEXP_REPLACE('08.09.11 some text around
10.10.13 AP ab 16.10.13 some text around
13.08.2014 some text around.
01.09.2014 some text around
07.11.2014 some text. around
10.02.15 some text. around
11.02.15 some text around . (tp)',
'^[\\s\\S]*\\s(\\d+\.\\d+\.\\d+)[\\s\\S]*$',
'\\1') AS test
Demo
Here is an explanation of the regex pattern used:
^ from the start of the input
[\\s\\S]* match all content, across newlines
\\s until reaching the LAST whitespace
(\\d+\.\\d+\.\\d+) which is followed by the last date
(and capture this date in \1)
[\\s\\S]* consume remainder of input
$ end of the input
I would like to replace the text in a google doc. At the moment I have place markers as follows
Invoice ##invoiceNumber##
I replace the invoice number with
body.replaceText('##invoiceNumber##',invoiceNumber);
Which is fine but I can only run the script once as obviously ##invoiceNumber## is no longer in the document. I was thinking I could replace the text after Invoice as this will stay the same, appendParagraph looks like it might to the trick but I can't figure it out. I think something like body.appendParagraph("Invoice") would select the area? Not sure how to append to this after that.
You could try something like this I think:
body.replaceText('InvoiceNumber \\w{1,9} ','InvoiceNumber ' + invoicenumber);
I don't know how big your invoice numbers are but that will except from 1 to 9 word characters preceeded by a space and followed by a space. That pattern might have to be modified depending upon your textual needs.
Word Characters [A-Za-z0-9_]
If your invoice numbers are unique enough perhaps you could just replace them.
Reference
Regular Expression Syntax
Note: the regex pattern is passed as a string rather than a regular expression
When you export response data from Qualtrics as a CSV, the 2nd row of the data contains strings with the question stem (shortened if necessary), followed by a dash, followed by that response column's corresponding choice. As an example, if my question were "Please select all of the fruit you enjoy:", in my response data the second row of a response column to this question might contain something like "Please select all of the fruit you enjoy:-Blueberries".
Qualtrics shortens the question stem if it is longer than 100 characters. If it is more than 100 characters, the stem is cut off after the 99th character, "..." is appended, and then the dash, and then the choice text.
I am trying to retrieve the text that is after this dash. However, that's difficult, because both the choice text and the question text could contain dashes. I have thought of two different approaches I could take in attempting to select just the choice text:
I have the question text, and can reliably programmatically retrieve it based on the response column name. However, the question text doesn't always match exactly, because Qualtrics removes any HTML styling in the Question text in the response data, but not in the Qualtrics survey file that I am getting the question text from. For questions that don't have any HTML styling, I was thinking about trying to use the question text to somehow match up to and including the dash between the question text and the choice text. I think regex could handle this case fine, but this clearly doesn't work without heavy modification for any questions that have HTML components.
The alternative I think might be more reliable. Strip the question text from the QSF file of any HTML tags, and then count how many "-" characters appear in the question text. Call that n, and then match the 2nd-row-response-entry for up to the n+1th dash, remove it, and what's remaining is my choice text.
I think the 2nd option is much more likely to work consistently, since the first option leaves me with a case where I have to try and strip html from the question text in exactly the same way Qualtrics does, unless I use fuzzy matching (which I know nothing about). However, the second option is also unclear to me.
an example csv response set
For example, the first question's question text looks like this in the QSF:
"<div style=\"text-align: center;\">Click to write the question text
<span style=\"font-size: 10.8333px;\">thsi<sup>tasdf<em>werasfd</em></sup>
<em>sdfad</em></span><br />\n </div>"
I would appreciate both of the following: advice on which option (or a suggestion for another) you think has the most chance for success, and help with the regex in R for matching the text up to the n+1th "-" character.
Here's a solution that counts the dashes in the question, locates the nth dash in the text (if any) and drops the preceding characters, and then keeps the substring that follows the next dash in the text.
stem_text <- "Please--select your extracurriculars"
s <- "<em>Please</em>--select your extracurriculars-student-athletics"
# count dashes in question stem
stem_dash_n <- length(gregexpr("-", stem_text)[[1]])
# locate dashes in string
s_dashes <- gregexpr("-", s)[[1]]
sub_start <- ifelse(length(s_dashes), s_dashes[stem_dash_n], 1)
s_sub <- substr(s, sub_start + 1, nchar(s))
sub("[^\\-]*\\-(.*)", "\\1", s_sub, perl = TRUE)
# [1] "student-athletics"
Assumptions: based on your description, length(s_dashes) >= stem_dash_n, so s_dashes[stem_dash_n] exists; the same number of dashes appear in the known stems and their representations in the text; and there is always a dash separating the stem and response choice.