ilike search working irregularly with characters "ÅÄÖ" - mysql

Alright, so I've got a pretty weird error.
Say I have the string "Ön Åsa Äpple föll ner i sjön"
query |result
Ön |Found
ÖN |Found
ön |Found
<same pattern works with Åsa and Äpple as well>
föll |Found
Föll |Found
FÖLL |Not found
This doesn't make any sense to me. Clearly searching with capital ÅÄÖ works, but for some reason not when the ÅÄÖ-character is not the first letter of a word?
I have a Rails application and a MYSQL database.
This is the corresponding code:
dataset = DB[category_class.table_name.to_sym]
dataset = dataset.where(:headline.ilike("%" + headline + "%")) if headline.present?
I'd be very grateful for any comments or answers that leads me in the right direction to solving this problem.
Regards, Emil

This is probably something to do with the way the utf8_unicode_ci collation compares strings. A collation is a collection of rules that determine when two strings are considered "the same", and when they are not; it also determines the order in which strings are sorted.
To see the collations for the utf8 character set, issue SHOW COLLATION LIKE 'utf8'. You will see that there are a variety of language-specific collations. These don't limit the values you can store in the column* but rather impose string comparison rules tailored towards text that is expected to be primarily in a specific language. Alternatively, you can use the utf8_general_ci collation which ignores accents entirely.
* Although if unique keys are involved, they might change whether one value is considered a duplicate of another or not.

Related

In SQL tables, should I, for example, have "é" or should I have "e´"?

I have tried in vain to look up relevant questions. They are beyond my pay scale. I am not a professional. To explain this a bit more: in the HTML that I wrote, the em dash would be "& #151;" (that space inserted so it would not show up as an actual em dash). It ended up in the tables (someone else was doing that work) as "—". Those are not showing up correctly when searches are done using PHP. I only get the image with a question mark. I do have my SQL account set to Unicode.
Take a philosophical stand: The datastore (database table) should contain data, not some special encoding of the data.
The "data" is é
When you display that in HTML, you might need to convert it to e´. However, all modern browsers don't have a problem if é is encoded UTF-8.
If you choose to use "html entities", then have your application do the conversion after fetching é from the table. PHP has the function htmlentities() specifically for that task.
But, I still have not addressed what byte(s) are in the table to represent é. These days, you 'should' use UTF-8 (aka MySQL's utf8mb4). That would be two hex bytes C3A9, which can be discovered using SELECT HEX(col) .... If you use the old default, latin1, the hex would show C9.
A related question is whether you should store html 'tags' or construct the html on the fly after fetching the data. So, let me give you three philosophies; you pick which to apply:
The table contains pure data; formatting, etc, is done after fetching and before delivering to the user's browser.
The table contains an 'opaque' image of what needs to be sent to the browser -- complete with tags, entities, etc. With this approach, you may as well call it a BLOB, not TEXT.
Some compromise between those. Note: The use of CSS can avoid too much hard-coding of formatting before storing into the database.
Also, the first choice is much cleaner for searching. This may lead you to pick it. However, another approach is to have two columns -- one aimed at delivering mostly-formatted ouput; the other for searching (tags removed, no entities, etc); it would be mostly text, but you probably could not generate a web page (with links, paragraphs, etc) from it.
é -- different strokes for different folks
é in latin1 (not advised) hex E9, 1 byte
é in utf8 C3A9 2 bytes
\u00E9 -- Unicode codepoint -- 6 bytes
é -- html entity (see PHP's htmlentities()) -- 8 bytes
%C3%A9 -- PHP's urlencode() (for URLs) -- 6 bytes
Responding to Comments
If entries_lists, entries_languages, and authors_entries are many:many mapping tables, please consider the several optimizations mentioned here.
Do not use utf8_encode. Instead, figure out what caused them not to be encoded correctly, and/or not displayed correctly. Start by
echo bin2hex($record['author']);
SELECT name, HEX(name) FROM authors WHERE ...
for some author with an accented letter.

Is N prefix allowed in MySQL

I am converting views from MSSQL to MySQL. I came across a statement like
SELECT CODE, DESCRIPTION
FROM PARAMETERMASTER
WHERE (CODETYPE = N'CASTTYPE')
Does MySQL support N prefix? Is it necessary to keep "N" prefix in MySQL also?
The following articles have some good information on the question. The short answer is just that there's a type mismatch between the unicode column and non-unicode string literal you're using. From the KB article, it looks like omitting the N prefix might still work in some cases, but it would depend on the code page and collation settings of the database. That might explain the change in behavior, if you were previously having success with the no-prefix method.
http://databases.aspfaq.com/general/why-do-some-sql-strings-have-an-n-prefix.html
http://support2.microsoft.com/kb/239530/en-us
There is no N prefix for MySQL literal strings. There is something similar, but I have never seen it used. Simply quote whatever text you want.

Mysql multiple maching patterns in substring_index

Can I use something like case to give multiple matching patterns in substring_index?
More specifically in my case, can I matching a set of chars according to their ascii?
Add some examples:
中文Q100
中文T800
中文中文K999
The strings start with some Chinese characters, then following by some numbers or latin letters, what I want is to split the string into two parts: one contains the Chinese characters(from leftmost to the first western letter), the other is from the first western letter to the rightmost.
Like these:
中文, Q100
中文, T800
中文中文, K999
There are multiple ways to resolve a matter. I'll give you 3 of them, starting from most right.
Architecture solution
Using application
Your question is about - replacing by regular expression. And that has a weak support in MySQL (to say precisely, there's no support for replacing by regex). Thus, you may do: select whole record, then split it in applicaition, using a-zA-Z0-9 mask, for example.
Or may be change table structure?
Well, alternative is: may be you should just separate this data to 2 columns? If your intention is to work with separate parts of data, then may be it's a sign to change your DB architecture?
Using MySQL
Second way is to use MySQL. To do it - yes, you'll use REPLACE() as it is. For instance, to get rid of all alphanumeric symbols, you'll do:
SELECT [...REPLACE(REPLACE(str, 'z', ''), 'y', '')...]
that is a pseudo-SQL, since posting whole 26+26+10 instances of REPLACE would be mad (however, using this is also mad). But that will resolve your issue, of course.
Using external REGEXP solution
This is third way and it has two subcases. You may either use UDF or stored routines.
Using UDF
There are third-party libraries which provide regular expression replacement functionality. Then all you need to do is to include those libraries into your server build. Example: lib_mysqludf_preg This, however, will require additional actions to use those libraries.
Using stored routines
Well, you can use stored routines to create your own replacement function. Actually, I have already written such library, it's called mysql-regexp and it provides REGEXP_REPLACE() function, which allows you to do replacements in strings by regular expressions. It's not well-tested, so if you'll decide to use it - do that on your own risk. Sample would be:
mysql> SELECT REGEXP_REPLACE('foo bar34 b103az 98feo', '[^a-z]', '');
+--------------------------------------------------------+
| REGEXP_REPLACE('foo bar34 b103az 98feo', '[^a-z]', '') |
+--------------------------------------------------------+
| foobarbazfeo |
+--------------------------------------------------------+
1 row in set (0.00 sec)
Since it's completely written with stored code, you won't need to re-build your server or whatever.

Getting MySQL to properly distinguish Japanese characters in SELECT calls

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.
Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.
The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:
ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)
It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.
What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:
キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.
I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?
You can use utf8_bin to get a collation that compares characters by their Unicode code points.
The utf8_general_ci collation also distinguishes キョウ and ギョウ.
when saving to database
save it as binary
and when calling back change it to Japanese
same problem accorded with me with Arabic language

MySQL regexp with Japanese furigana

I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.
Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:
I present these phonetic readings to the user with the <ruby> tag in the following format:
<ruby>
<rb>勉強</rb> <!-- the kanji -->
<rp>(</rp> <!-- define where the phonetic part starts in the string -->
<rt>べんきょう</rt> <!-- the phonetic kana itself -->
<rp>)</rp> <!-- define the end of the phonetic part -->
</ruby>する <!-- the last part is already phonetic so needs no ruby -->
The strings are stored in my database like this:
勉強(べんきょう)する
where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).
All of this is fine, but the problem comes when a user is searching. If they search for
勉強
Then it will show up. But if they try to search for
勉強する
it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.
The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.
Is there a way to do what I want?
As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).
Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.
Or, you can keep an optional field in your table that potentially contains a version with furigana.
I would advise against using LIKE queries beause you would have to have a % between every single character (since you don't know WHEN furigana will occur) and that could end up creating false positives (like if a valid character appeared between 勉 and 強).
As #Jill-Jênn Vie breifly mentioned, I'd suggest adding a new column to hold the text with furigana.
I'm working on an application which performs searches on Korean text. The problem is that Korean conjugation changes the characters. For example:
하다 + 아요 = 해요
"하다" is the verb "to do" in dictionary form and "아요" is the standard polite-form conjugation. Presumably you are a Japanese speaker, so you know how common such polite forms can be! Note how the 하 changes to 해. Obviously, if users try to search for "하다" in the string "해요", they won't find it. But if users want to see all instances of "하다" in the corpus, we need to be able to return it.
Our solution was two columns: "form" (conjugated form) and "analytic_string" which would represent "해요" as "하다+아요". You could take a similar approach and make a second column containing your sentence without furigana.
The main disadvantages of this approach is that you're effectively doubling your database size and you need to pay special attention when inputting data that the two column have the same data (I found a few rows in my database where the form and the analytic string have different words in them). The advantage is you can easily search your data while ignoring furigana.
It's your standard "size vs. performance" trade-off. Which is more important: size of the database or execution time? Any other solution I can think of involves returning too many rows and then individually analyzing them.