I am writing a custom search engine for my website. I am trying to make use of MySQL REGEXP feature. I would like to be able to search for a word separated by spaces to avoid the chances of getting suffixes or prefixes on a word. For example I am trying to search for "appreciate" I want appreciate, not appreciated or unappreciate or unappreciated. Any ideas on how I could do this with MySQL's REGEXP? My idea for this was to look for spaces like maybe so:
^appreciate$|^appreciate[:space:]|[:space:]appreciate$|[:space:]appreciate[:space:]
I am sure they is a better way of doing it and I have no idea if that even works
I think what you want is something like this:
SELECT 'I appreciate you' REGEXP '[[:<:]]appreciate[[:>:]]'; /* matches */
[[<:]] and [[>:]] are word boundaries. From the manual:
These markers stand for word boundaries. They match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_).
Edit: just to clarify, this also deals with situations where there's a newline character after the word, or a comma, etc
What about:
^\s*appreciate(\s+.*)*$
Between the start and the word there may be 0+ whitespace parts
then comes the word
then if something comes after that, it has to start with whitespace
You can seek for non-alphabetic characters:
[^[:alpha:]]+
... or just word boundaries:
[[:<:]]foo[[:>:]]
Before making a choice, don't forget to make some tests with commas, dots and non-English chars. Also, take into account that MySQL does not fully support regular expressions in multi-byte strings (such as UTF-8).
Related
I'm building a word unscrambler using MySQL, Think about it like the SCRABBLE game, there is a string which is the letter tiles and the query should return all words that can be constructed from these letters, I was able to achieve that using this query:
SELECT * FROM words
WHERE word REGEXP '^[hello]{2,}$'
AND NOT word REGEXP 'h(.*?h){1}|e(.*?e){1}|l(.*?l){2}|l(.*?l){2}|o(.*?o){1}'
The first part of the query makes sure that the output words are constructed from the letter tiles, the second part takes care of the words occurrences, so the above query will return words like: hello, hell, hole, etc..
My issue is when there is a blank tile (a wildcard), so for example if the string was: "he?lo", the "?" Can be replaced with any letter, so for example it will output: helio, helot.
Can someone suggest any modification on the query that will make it support the wildcards and also takes care of the occurrence. (The blank tiles could be up to 2)
I've got something that comes close. With a single blank tile, use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*$'
AND word not REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
with 2 blank tiles use:
SELECT * FROM words
WHERE word REGEXP '^[acre]*.[acre]*.[acre]*$'
AND word NOT REGEXP 'a(.*?a){1}|r(.*?r){1}|c(.*?c){1}|e(.*?e){1}'
The . in the first regexp allows a character that isn't one of the tiles with a letter on it.
The only problem with this is that the second regexp prevents duplicates of the lettered tiles, but a blank should be allowed to duplicate one of the letters. I'm not sure how to fix this. You could add 1 to the counts in {}, but then it would allow you to duplicate multiple letters even though you only have one blank tile.
A possible starting point:
Sort the letters in the words; sort the letters in the tiles (eg, "ehllo", "acer", "aerr").
That will avoid some of the ORing, but still has other complexities.
If this is really Scrabble, what about the need to attach to an existing letter or letters? And do you primarily want to find a way to use all 7 letters?
What is the correct pattern for a text input to only allow uppercase letters, lowercase letters, and commas?
I know that this is correct for the letters:
pattern="[a-zA-Z]"
but I dont know how to allow commas.
Thanks for any help!
Short answer:
pattern="^[a-zA-Z,]*$"
A couple of comment:
* means zero or more characters which means this patter will allow empty fields as well. If you want to guarantee that it will contain at least one character, use + instead of *.
^ means beginning of the string and $ is the end. If you don't use them then something like this would be possible "!#123asdSDADS,,,21312312(2"
Often in coding languages, there is an escape character which either
makes the next character interpreted literally or
makes the next character interpreted as code within a string.
Is there such an escape character in HTML, or do I need Javascript to do so?
Searching both the internet and stackoverflow yielded no results.
I assume what you're talking about is the difference from including something like "<" as a part of a tag such as <div> and as just a string to symbolize 'less than'. That is, the escape for "<" would be <. If so, you can find a full list of escapes here. No JavaScript is required.
Hope this helped.
As far as I know, all escape characters begin with &# and end with ;, however the actual escape character varies depending on what you're writing. Here are some references for you:
Further explanation: http://www.w3.org/International/questions/qa-escapes
List of escape characters: http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php?PHPSESSID=8cbbddde9a9c9825467546f1c98fe119
This question already has answers here:
Regex to match only letters
(20 answers)
Closed 8 years ago.
I'm trying to create a regex for a HTML5 input so a user can only insert alpha characters that may be in a name. So characters from a-z, but also including ö,ü,â,æ ... and so on whilst also allowing whitespace and hyphens .
I have played around with some pattens but nothing seems to work correctly, this is what I have so far: <input type="text" name="firstname" pattern="[a-zA-Z\x7f-\xff] " title="">
Does anyone have a quick answer for this?
Since the HTML5 pattern attribute uses the same regex syntax as JavaScript, there is no simple way to refer to all alphabetic characters. You would need to write a rather huge expression (and to update it as new alphabetic characters are added to Unicode). You would need to start from the Unicode character database and the definition of General Category of characters there, or rely on someone having done that for you.
However, for your practical purposes, testing for “alpha characters that may be in a name” is even more complex. There are non-alphabetic characters used in names, such as left single quotation mark (‘) in addition to normal quotation mark (’), and who knows what characters there might be? If this is about people’s real names, it is very difficult to impose restrictions that do not discriminate. If this is about user names in a system, for example, you can define the repertoire as you like, but [a-zA-Z\x7f-\xff] does not look adequate (it includes some control characters and some non-alphabetic characters and excludes many Latin letters commonly used in Europe).
There is a very simple method to apply all you RegEx logic(that one can apply easily in English) for any Language using Unicode.
For matching a range of Unicode Characters like all Alphabets [A-Za-z] we can use
[\u0041-\u005A] where \u0041 is Hex-Code for A and \u005A is Hex Code for Z
'matchCAPS leTTer'.match(/[\u0041-\u005A]+/g)
//output ["CAPS", "TT"]
In the same way we can use other Unicode characters or their equivalent Hex-Code according to their Hexadecimal Order (eg: \u0100–\u017FF) provided by unicode.org
Try: [À-ž] as an example of Range. Modify your Range according to your requirement.
It will match all characters between À and ž.
Sample regEx would be
/[A-Za-zÀ-ž\-\s]+/
For more Ref: Latin Unicode Character
I have a regex '^[A0-Z9]+$' that works until it reaches strings with 'special' characters like a period or dash.
List:
UPPER
lower
UPPER lower
lower UPPER
TEST
test
UPPER2.2-1
UPPER2
Gives:
UPPER
TEST
UPPER2
How do I get the regex to ignore non-alphanumeric characters also so it includes UPPER2.2-1 also?
I have a link here to show it 'real-time': http://www.rubular.com/r/ev23M7G1O3
This is for MySQL REGEX
EDIT: I didn't specify I wanted all non-alphanumeric characters (including spaces), but with the help of others here it led me to this: '^[A-Z-0-9[:punct:][:space:]]+$' is there anything wrong with this?
Try
'^[A-Z0-9.-]+$'
You just need to add the special characters to the group, optionally escaping them.
Additionally if you choose not to escape the -, be aware that it should be placed at the start or the end of the grouping expression to avoid the chance that it may be interpreted as delimiting a range.
To your updated question, if you want all non-whitespace, try using a group such as:
^[^ ]+$
which will match everything except for a space.
If instead what you wanted is all non-whitespace and non-lowercase, you likely will want to use:
^[^ a-z]+$
The 'trick' used here is adding a caret symbol after the opening [ in the group expression. This indicates that we want the negation of the match.
Following the pattern, we can also apply this 'trick' to get everything but lowercase letters like this:
^[^a-z]+$
I'm not really sure which of the 3 above you want, but if nothing else, this ought to serve as a good example of what you can do with character classes.
I believe you are looking for (one?) uppercase-word match, where word is pretty much anything.
^[^a-z\s]+$
...or if you want to allow more words with spaces, then probably just
^[^a-z]+$
You just need to put in the . and -. In theory, you don't need to escape because they are inside the brackets, but I like to to remind myself to escape when I have to.
'^[A-Z0-9\.\-]+$'
Try regular expression as below:
'^[A0-Z0\\.\\-]+$'