How to select data with REGEXP without containing Extended ASCII

How to select data with REGEXP without containing Extended ASCII - mysql

I just want to display data with :
containing a-z
containing A-Z
contain numbers 0-9
contain printable ASCII charachter on the following link http://www.theasciicode.com.ar/extended-ascii-code/latin-diphthong-ae-uppercase-ascii-code-146.html
I stuck using code like this, please help point -4 above:
select * from Delin where the address REGEXP '^ [A-Za-z0-9]'
with sample raw data below:
and i wanna output like this(where these images show a-Z and only printable symbols) :

Your items 1–3 (a–z, A–Z, and 0–9) are all subsets of item 4 (printable ASCII characters), so you need only concern yourself with the latter. The following query satisfies that criterion:
SELECT * FROM Delin
WHERE alamat REGEXP '^[ -~]+$';
The character class [ -~], indicates the ASCII characters from space to tilde inclusive, which happens to be all of the printable ASCII characters and no others.
You can see it in an SQL Fiddle here: http://sqlfiddle.com/#!9/6c7b8/1
Terminology note: There is no such thing as "Extended ASCII." The ASCII character set corresponds to the numbers 0–127 inclusive. Any character corresponding to a number greater than 127 is not ASCII. The term "Extended ASCII" is often mistakenly applied to various non-ASCII encodings, none of which is an "extension" of ASCII in any official sense.

Related

MySQL REGEXP word boundary detection with german umlauts when using BINARY Operator

I made a strange discovery. If I execute the following SQL-Command:
SELECT 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
it gives me the result - as expected - 0
But if I do the same together with the BINARY operator:
SELECT BINARY 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
the result ist 1, so I think there is a MySQL problem with the regexp word boundary detection and german umlauts (like here the "ä") in conjunction with the BINARY Operator. As another example I can do this query:
SELECT BINARY 'Konzessionsäre' REGEXP '[[:<:]]Konzession[[:>:]]'
So here the result is 0 - as I would expect. So how can I solve this? Is this probably a bug in MySQL?
Thanks

By casting your string as BINARY you have stripped its associated character set property. So it's unclear how the word-boundary pattern should match. I'd guess it matches only ASCII values A-Z, a-z, 0-9, and also _.
When casting the string as BINARY, MySQL knows nothing about any other higher character values that also should be considered alphanumeric, because which characters should be alphanumeric depends on the character set.
I guess you are using BINARY to make this a case-sensitive regular expression search. Apparently, this has the unintended consequence of spoiling the word-boundary pattern-match.
You should not use BINARY in this comparison. You could do a secondary comparison to check for case-sensitive matching, but not with word boundaries.
SELECT (BINARY 'Konzessionäre' REGEXP 'Konzession') AND ('Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]')

MySQL's REGEXP works with bytes, not characters. So, in CHARACTER SET utf8, ä is 2 bytes. It is unclear what the definition of "word boundary" in such a situation.
Recent versions of MariaDB have a better regexp engine.

Can somebody help me understand this regex pattern? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I have some knowledge of MySQL regexp syntax but I am not proficient. I was looking for a way to construct a pattern that selects all names from a mysql table that contain strange characters due to international input such as spanish names that have the symbol on top of the n.
I came upon this following pattern which I tried and it worked.
[^a-zA-Z0-9#:. \'\-`,\&]
The query is:
SELECT *
FROM orders_table
WHERE customers_name REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
However I would like to understand how this pattern was constructed and what each part means.

The idea is that anything between [...] is a character class, which matches any single character that's in the set between the [ and ].
Adding the ^ to the start of the list of characters means (as noted above) it's negated, which means it matches any character NOT in the set. Putting a ^ anywhere but the start of the [ ... ] means it's just a regular ^ character to match, and in no case does ^ inside a character class mean a start-of-line anchor.
Ranges work, such as a-z, and if you want a literal dash in the set, you either put it first (possibly after the ^), or quote it with \
Edit: the other special characters - # : etc. - are not special in this context, they just match as regular characters.

[a-zA-Z0-9#:. \'-`,\&] - Let us consider all the important parts of this pattern.
[] - denotes a character class which matches any single character within [].
[a-zA-Z0-9] -- One character (lowercase or uppercase) that is in the range of a-z,A-Z OR 0-9.
All other characters which are present within the [] match for a particular single character.
single quotes "'" and ampersand & are escaped by "\" because in sql, they are used for specific purposes.
To match a pattern which has a character other than any of these symbols within [], we use a negation "^" at the beginning.
Thus, as you have told ,your pattern selects all names from a mysql table that contain other strange characters.

Regular Expression to match single, whole words from MySQL Database

I require a Regular Expression to match single words of [a-zA-Z] only with 6 or more characters.
The list of words:
a
27
it was good because
it
cat went to
four
eight is bigger
peter
pepper
objects
83
harrison99
adjetives
atom-bomb
The following would be matched
pepper
objects
adjectives
It isn't too bad if the following are matched too as I can strip out the duplicates after
it was good because
eight is bigger
atom-bomb
So far I have this
/[a-zA-z]{6,}/g
Which matches 6 character or more words according to RegEx 101 but when I use it in my MySQL database to filter a table on matches RegExp.
Currently it includes words containing punctuation and other non a-zA-Z characters. I don't want it to include results with numbers or punctuation.
In an ideal world I don't want it to include a row that contains more than one word as usually the other words are already duplicates.
Actual MySQL results
If a row contains a number or punctuation and/or contain one or more words I don't want it included.
I want only results that are single, whole words of 6+ characters

Drop the delimiters and add anchors for start/end:
SELECT 'abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 1
SELECT 'a abcdef' REGEXP '^[a-zA-Z]{6,}$'
-> 0
Also I changed [a-zA-z] in the character class to [a-zA-Z].
Instead of [a-zA-Z] you can also use [[:alpha:]] POSIX style bracket extension. So it becomes:
^[[:alpha:]]{6,}$
(As a side note: MySql uses Henry Spencer's implementation of regular expressions. if word boundaries needed, use [[:<:]]word[[:>:]] See manual for further syntax differences)

how to detect thai language in SQL query

I have a column in a table which is a string, and some of those strings have thai language in it, so an example of a thai string is:
อักษรไทย
Is there such way to query/find a string like this in a column?

You could search for strings that start with a character in the Thai Unicode block (i.e. between U+0E01 and U+0E5B):
WHERE string BETWEEN 'ก' AND '๛'
Of course this won't include strings that start with some other character and go on to include Thai language, such as those that start with a number. For that, you would have to use a much less performant regular expression:
WHERE string RLIKE '[ก-๛]'
Note however the warning in the manual:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

You can do some back and forth conversion between character sets.
where convert(string, 'AL32UTF8') =
convert(convert(string, 'TH8TISASCII'), 'AL32UTF8', 'TH8TISASCII' )
will be true if string is made only of thai and ASCII, so if you add
AND convert(string, 'AL32UTF8') != convert(string, 'US7ASCII')
you filter out the strings made only of ASCII and you get the strings made of thai.
Unfortunately, this will not work if your strings contain something outside of ASCII and Thai.
Note: Some of the convert may be superfluous depending on your database default encoding.

How can I query for text containing Asian-language characters in MySQL?

I have a MySQL table using the UTF-8 character set with a single column called WORDS of type longtext. Values in this column are typed in by users and are a few thousand characters long.
There are two types of rows in this table:
In some rows, the WORDS value has been composed by English speakers and contains only characters used in ordinary English writing. (Not all are necessarily ASCII, e.g. the euro symbol may appear in some cases.)
Other rows have WORDS values written by speakers of Asian languages (Korean, Chinese, Japanese, and possibly others), which include a mix of English words and words in the Asian languages using their native logographic characters (and not, for example, Japanese romaji).
How can I write a query that will return all the rows of type 2 and no rows of type 1? Alternatively, if that's hard, is there a way to query most such rows (here it's OK if I miss a few rows of type 2, or include a few false positives of type 1)?
Update: Comments below suggest I might do better to avoid the MySQL query engine altogether, as its regex support for unicode doesn't sound too good. If that's true, I could extract the data into a file (using mysql -B -e "some SQL here" > extract.txt) and then use perl or similar on the file. An answer using this method would be OK (but not as good as a native MySQL one!)

In theory you could do this:
Find the unicode ranges that you want to test for.
Manually encode the start and end into UTF-8.
Use the first byte of each of the encoded start and end as a range for a REGEXP.
I believe that the CJK range is far enough removed from things like the euro symbol that the false positives and false negatives would be few or none.
Edit: We've now put theory into practice!
Step 1: Choose the character range. I suggest \u3000-\u9fff; easy to test for, and should give us near-perfect results.
Step 2: Encode into bytes. (Wikipedia utf-8 page)
For our chosen range, utf-8 encoded values will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.
Thus, we want to mach bytes in the range 11100011 to 11101001, or 0xe3 to 0xe9.
Step 3: Make our regexp using the very handy (and just now discovered by me) UNHEX function.
SELECT * FROM `mydata`
WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')
Just tried it out. Works like a charm. :)

You can also use the HEX value of the character. SELECT * FROM table WHERE <hex code>
Try it out with SELECT HEX(column) FROM table
This might help as well http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to select data with REGEXP without containing Extended ASCII - mysql

Related

MySQL REGEXP word boundary detection with german umlauts when using BINARY Operator

Can somebody help me understand this regex pattern? [duplicate]

Regular Expression to match single, whole words from MySQL Database

how to detect thai language in SQL query

How can I query for text containing Asian-language characters in MySQL?

Categories

Resources