Find non-ascii spaces in mysql table - mysql

I have a large database where things like Trim and functions I made to count words don't always work (some records still have 'spaces' and multi-word fields get a count of 1). Leading me to believe I have non-ascii spaces.
I tried this to find offending records:
SELECT * FROM TABLE WHERE FIELD NOT REGEXP '[A-Za-z0-9 ;,]'
in other words all letters, digits, characters I used and space.
Returns zero-set.
Is there a better way to do this (i.e. one that works)?

Your regex will match rows that have one or more characters in the set {A-Z, a-z, 0-9, space, semicolon, colon}.
Better to look specifically for non-printable characters using the POSIX [:cntrl:] character class:
SELECT * FROM TABLE WHERE FIELD REGEXP '[[:cntrl:]]'

Related

How to match any letter in SQL?

I want to return rows where certain fields follow a particular pattern such as whether a particular character in a string is a letter or number. To test it out, I want to return fields where the first letter is any letter. I used this code.
SELECT * FROM sales WHERE `customerfirstname` like '[a-z]%';
It returns nothing. So I would think that the criteria is the first character is a letter and then any following characters do not matter.
The following code works, but limits rows where the first character is an a.
SELECT * FROM sales WHERE `customerfirstname` like 'a%';
Am I not understanding pattern matching? Isn't it [a-z] or [A-Z] or [0-9] for any letter or number?
Also if I wanted to run this test on the second character in a string, wouldn't I use
SELECT * FROM `sales` WHERE `customerfirstname` like '_[a-z]%'
This is for SQL and MySQL. I am doing this in phpmyadmin.
You want to use regular expressions:
SELECT s.*
FROM sales s
WHERE s.customerfirstname REGEXP '^[a-zA-Z]';
This can be achieved with a regular expression.
SELECT * FROM sales WHERE REGEXP_LIKE(customerfirstname, '^[[:alpha:]]');
^ denotes the start of the string, while the [:alpha] character class matches any alphabetic character.
Just in case, here are a few others character classes that you may find useful :
alnum : dlphanumeric characters
digit: digit characters
lower : lowercase alphabetic characters
upper: uppercase alphabetic characters
See the mysql regexp docs for many more...

MySQL - Select * where a substring is before char

I have a dictionary database in which every entery looks like this
column
word; synonym; synonym | example of usage; example of usage
I want to make a select function that will only get the row if it appears in the first part of the data (words and synonyms) and not in the examples of usage (as there are more words there)
I have been trying to do it with REGEXP
SELECT * FROM dictionary WHERE column REGEXP '[^\|]*word.*\|.*'
But for some reason this matches everything in the tables - even where the word doesn't appear at all.
What am I doing wrong?
You need to use double slashes to escape special characters in MySQL regex. Thus, your \| is treated as a | and you match every empty space before each "character" in each string.
I suggest just checking if word appears before | with
'^[^|]*word'
or - if you need a whole word check:
'^[^|]*[[:<:]]word[[:>:]]'
The regex matches...
^ - start of string
[^|]* - 0 or more characters other than |
[[:<:]] - leading word boundary
word - literal sequence of letters
[[:>:]] - trailing word boundary.
Also, this regex is case-insensitive by default. To make it case-sensitive, use BINARY keyword.
SELECT * FROM dictionary WHERE column REGEXP BINARY '^[^|]*[[:<:]]word[[:>:]]'
What am I doing wrong?
You don't have a database. You have 3 types of information (word, synonym, usage) jammed into one column in one row.
You need a table of word - synonym pairs
You need a table of word - usage pairs
You probably need a word - definition pair table
Do all the parsing before inserting the data into the tables.

How to detect rows with chinese characters in MySQL?

How can I detect and delete rows with Chinese characters in MySQL?
Here is the Table "Chinese_Test" Contains the Chinese Character on my PhpMyAdmin
Data:
Structure
notice my type of Collation is utf8, thus let's take a look at the Chinese Characters in utf8 table.
http://www.ansell-uebersetzungen.com/gbuni.html
Notice the Chinese Character is from E4 to E9, hence we use the code
select number
from Chinese_Test
where HEX(contents) REGEXP '^(..)*(E[4-9])';
and here is the result:
If all the other rows have alphanumeric values try the following:
DELETE FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
Do check the results before deletion, using the following:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
I don't have an answer, but to provide you with a starting point: Chinese characters will occupy certain blocks in the UTF-8 character set. Example
You would have to query for rows that contain characters between the first and the last point of that block. I can't think of a way to automate this though (i.e. to query for characters inside a certain range without naming each character explicitly).
Another untested idea that comes to mind is using iconv() to convert the string to a specifically Chinese encoding, using //IGNORE, and seeing whether any data is left. If anything is left, the string may contain chinese characters.... although this would probably be disrupted by any numbers inside the string,
It's an interesting problem.

How to allow fulltext searching with hyphens in the search query

I have keywords like "some-or-other" where the hyphens matter in the search through my mysql database. I'm currently using the fulltext function.
Is there a way to escape the hyphen character?
I know that one option is to comment out #define HYPHEN_IS_DELIM in the myisam/ftdefs.h file, but unfortunately my host does not allow this. Is there another option out there?
Here's the code I have right now:
$search_input = $_GET['search_input'];
$keyword_safe = mysql_real_escape_string($search_input);
$keyword_safe_fix = "*'\"" . $keyword_safe . "\"'*";
$sql = "
SELECT *,
MATCH(coln1, coln2, coln3) AGAINST('$keyword_safe_fix') AS score
FROM table_name
WHERE MATCH(coln1, coln2, coln3) AGAINST('$keyword_safe_fix')
ORDER BY score DESC
";
From here http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
One solution to find a word with a dashes or hyphens in is to use FULL TEXT SEARCH IN BOOLEAN MODE, and to enclose the word with the hyphen / dash in double quotes.
Or from here http://bugs.mysql.com/bug.php?id=2095
There is another workaround. It was recently added to the manual:
"
Modify a character set file: This requires no recompilation. The true_word_char() macro
uses a “character type” table to distinguish letters and numbers from other
characters. . You can edit the contents in one of the character set XML
files to specify that '-' is a “letter.” Then use the given character set for your
FULLTEXT indexes.
"
Have not tried it on my own.
Edit: Here is some more additional info from here http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
A phrase that is enclosed within double quote (“"”) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Prior to MySQL 5.0.3, the engine then performed a substring search for the phrase in the records that were found, so the match must include nonword characters in the phrase. As of MySQL 5.0.3, nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase" in MySQL 5.0.3, but not before.
If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.
Some people would suggest to use the following query:
SELECT id
FROM texts
WHERE MATCH(text) AGAINST('well-known' IN BOOLEAN MODE)
HAVING text LIKE '%well-known%';
But by that you need many variants depending on the used fulltext operators. Task: Realize a query like +well-known +(>35-hour <39-hour) working week*. Too complex!
And do not forget the default len of ft_min_word_len so a search for up-to-date returns only date in your results.
Trick
Because of that I prefer a trick so constructions with HAVING etc aren't needed at all:
Instead of adding the following text to your database table: "The Up-to-Date Sorcerer" is a well-known science fiction short story. copy the hyphen words without hypens to the end of the text inside a comment: "The Up-to-Date Sorcerer" is a well-known science fiction short story.<!-- UptoDate wellknown -->
If the users searches for up-to-date remove the hyphen in the sql query:
MATCH(text) AGAINST('uptodate ' IN BOOLEAN MODE)
By that you're user can find up-to-date as one word instead of getting all results that contain only date (because ft_min_word_len kills up and to).
Of course before you echo the texts you should remove the <!-- ... --> comments.
Advantages
the query is simpler
the user is able to use all fulltext operators as usual
the query is faster.
If a user searches for -well-known +science MySQL treats that as not include *well*, could include *known* and must include *science*. This isn't what the user expected. The trick solves that, too (as the sql query searches for -wellknown +science)
Maybe simpler to use the Binary operator.
SELECT *
FROM your_table_name
WHERE BINARY your_column = BINARY "Foo-Bar%AFK+LOL"
http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html#operator_binary
The BINARY operator casts the string following it to a binary string. This is an easy way to force a column comparison to be done byte by byte rather than character by character. This causes the comparison to be case sensitive even if the column is not defined as BINARY or BLOB. BINARY also causes trailing spaces to be significant.
My preferred solution to this is to remove the hyphen from the search term and from the data being searched. I keep two columns in my full-text table - search and return. search contains sanitised data with various characters removed, and is what the users' search terms are compared to, after my code has sanitised those as well.
Then I display the return column.
It does mean I have two copies of the data in my database, but for me that trade-off is well worth it. My FT table is only ~500k rows, so it's not a big deal in my use case.

Match beginning of words in Mysql for UTF8 strings

I m trying to match beignning of words in a mysql column that stores strings as varchar. Unfortunately, REGEXP does not seem to work for UTF-8 strings as mentioned here
So,
select * from names where name REGEXP '[[:<:]]Aandre';
does not work if I have name like Foobar Aándreas
However,
select * from names where name like '%andre%'
matches the row I need but does not guarantee beginning of words matches.
Is it better to do the like and filter it out on the application side ? Any other solutions?
A citation from tha page you mentioned:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
select * from names where name like 'andre%'
select * from names where name like 'andre%' is not solution for eg:
name = 'richard andrew', because the string begining with richa... and not with andre...
for the moment, the temporaly solution, for search words (words != string) starting with a string
select * from names where name REGEXP '[[:<:]]andre';
But it no matching with accented words, eg: ándrew.
Any other solution, with regular expressions (mysql) to search in accented words?