Getting MySQL to properly distinguish Japanese characters in SELECT calls

Getting MySQL to properly distinguish Japanese characters in SELECT calls - mysql

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.
Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.
The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:
ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)
It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.
What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:
キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.
I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?

You can use utf8_bin to get a collation that compares characters by their Unicode code points.
The utf8_general_ci collation also distinguishes キョウ and ギョウ.

when saving to database
save it as binary
and when calling back change it to Japanese
same problem accorded with me with Arabic language

Related

MSAccess - CSV TransferText Import Spec, non-ascii delimiter?

I receive a CSV file from a 3rd party I need to IMPORT into Access. They claim they are unable to add any sort of Text Qualifier; all my common delimiter options (comma, tabs, pipe, $, ~, ^, etc.) all seem to appear in the data, so not reliable to use in an Import Spec. I cannot edit the data, but we can adjust the delimiter. Record counts are in 500K range x 50 columns (250MB).
I tried a non-ascii char as a delimiter (i.e., ÿ), I can add to an Import Spec, the sample data appears to delimit OK, but get a error (Subscript out of Range) when attempting the actual Import. Also tried a multi-character delimiter, but no-go.
Any suggestions to permit me to receive these csv tables? Daily task, many low-skilled users, remote locations, import function behind a button.
Sample Raw Data, truncated for width (June7, not sure if this helps the discussion)
9798ÿ9798ÿ451219417ÿ9033504ÿ9033504ÿPUNCH BIOPSY 4MM UNI-PUNCH SS SEAMLS RAZOR SHARP BLADE...
9798ÿ9798ÿ451219418ÿ1673BXÿ1673BXÿCLEANER INST 1GL KLENZYME LATEXÿSTERIS PLCÿ1673BXÿ1673BX...
9798ÿ9798ÿ451219419ÿA4823PRÿA4823PRÿBAG BIOHAZ THK1.3 MIL 24X23IN RED LDPE PRINT INF WASTE...
9798ÿ9798ÿ451219420ÿCUR9225ÿCUR9225ÿGLOVE EXAM CURAD MEDIUM LATEX FREEÿMEDLINE INDUSTRIES,...
9798ÿ9798ÿ451219421ÿCUR9226ÿCUR9226ÿGLOVE EXAM CURAD LARGE LATEX FREEÿMEDLINE INDUSTRIES, ...
9798ÿ9798ÿ451219422ÿ90176101ÿ90176101ÿDRAPE CONSUMABLE PK EQUIP OEC UROVIEW 2800 STERILE L...

Try another extended-ASCII character (128 - 254). The chosen delimiter ÿ (255) apparently doesn't work, but it's already a suspicious character since it has all bits set and sometimes has special meaning for that reason.
It's also good to consider the code page. If you're in the US using standard English version of Windows, its likely that Access is using the default "Western European (Windows)" (Windows-1252) code page. But if you're outside the US or have other languages installed, it could be that the particular default code page will treat certain characters differently. For reference, I'm using Access 2013 on Windows 10. In the Access text import wizard, clicking on the [Advanced...] button shows more options, including the selection of the import code page. Since you're having problems with the import, it is worth inspecting that settings.
For the record, I had similar results as you and others using the sample data and delimiter ÿ (255).
Next I tried À (192) which is a standard letter character in various code pages, so it should likely work even if the default were not Windows-1252. Indeed, it worked on my system and resulted in no errors.
To get the import working without errors at first, I would choose all Short Text and Long Text fields before specifying integer, date or other non-text types. If all text columns work, then try specific fields types. In this way, you can at least differentiate between delimiter errors and other data errors.
This isn't to discourage other options like fixed-width text, especially since in that case you won't have to worry about the delimiter at all.

In SQL tables, should I, for example, have "é" or should I have "e´"?

I have tried in vain to look up relevant questions. They are beyond my pay scale. I am not a professional. To explain this a bit more: in the HTML that I wrote, the em dash would be "& #151;" (that space inserted so it would not show up as an actual em dash). It ended up in the tables (someone else was doing that work) as "—". Those are not showing up correctly when searches are done using PHP. I only get the image with a question mark. I do have my SQL account set to Unicode.

Take a philosophical stand: The datastore (database table) should contain data, not some special encoding of the data.
The "data" is é
When you display that in HTML, you might need to convert it to e´. However, all modern browsers don't have a problem if é is encoded UTF-8.
If you choose to use "html entities", then have your application do the conversion after fetching é from the table. PHP has the function htmlentities() specifically for that task.
But, I still have not addressed what byte(s) are in the table to represent é. These days, you 'should' use UTF-8 (aka MySQL's utf8mb4). That would be two hex bytes C3A9, which can be discovered using SELECT HEX(col) .... If you use the old default, latin1, the hex would show C9.
A related question is whether you should store html 'tags' or construct the html on the fly after fetching the data. So, let me give you three philosophies; you pick which to apply:
The table contains pure data; formatting, etc, is done after fetching and before delivering to the user's browser.
The table contains an 'opaque' image of what needs to be sent to the browser -- complete with tags, entities, etc. With this approach, you may as well call it a BLOB, not TEXT.
Some compromise between those. Note: The use of CSS can avoid too much hard-coding of formatting before storing into the database.
Also, the first choice is much cleaner for searching. This may lead you to pick it. However, another approach is to have two columns -- one aimed at delivering mostly-formatted ouput; the other for searching (tags removed, no entities, etc); it would be mostly text, but you probably could not generate a web page (with links, paragraphs, etc) from it.
é -- different strokes for different folks
é in latin1 (not advised) hex E9, 1 byte
é in utf8 C3A9 2 bytes
\u00E9 -- Unicode codepoint -- 6 bytes
é -- html entity (see PHP's htmlentities()) -- 8 bytes
%C3%A9 -- PHP's urlencode() (for URLs) -- 6 bytes
Responding to Comments
If entries_lists, entries_languages, and authors_entries are many:many mapping tables, please consider the several optimizations mentioned here.
Do not use utf8_encode. Instead, figure out what caused them not to be encoded correctly, and/or not displayed correctly. Start by
echo bin2hex($record['author']);
SELECT name, HEX(name) FROM authors WHERE ...
for some author with an accented letter.

MySQL regexp with Japanese furigana

I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.
Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:
I present these phonetic readings to the user with the <ruby> tag in the following format:
<ruby>
<rb>勉強</rb> <!-- the kanji -->
<rp>（</rp> <!-- define where the phonetic part starts in the string -->
<rt>べんきょう</rt> <!-- the phonetic kana itself -->
<rp>）</rp> <!-- define the end of the phonetic part -->
</ruby>する <!-- the last part is already phonetic so needs no ruby -->
The strings are stored in my database like this:
勉強（べんきょう）する
where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).
All of this is fine, but the problem comes when a user is searching. If they search for
勉強
Then it will show up. But if they try to search for
勉強する
it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.
The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.
Is there a way to do what I want?

As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).
Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.
Or, you can keep an optional field in your table that potentially contains a version with furigana.

I would advise against using LIKE queries beause you would have to have a % between every single character (since you don't know WHEN furigana will occur) and that could end up creating false positives (like if a valid character appeared between 勉 and 強).
As #Jill-Jênn Vie breifly mentioned, I'd suggest adding a new column to hold the text with furigana.
I'm working on an application which performs searches on Korean text. The problem is that Korean conjugation changes the characters. For example:
하다 + 아요 = 해요
"하다" is the verb "to do" in dictionary form and "아요" is the standard polite-form conjugation. Presumably you are a Japanese speaker, so you know how common such polite forms can be! Note how the 하 changes to 해. Obviously, if users try to search for "하다" in the string "해요", they won't find it. But if users want to see all instances of "하다" in the corpus, we need to be able to return it.
Our solution was two columns: "form" (conjugated form) and "analytic_string" which would represent "해요" as "하다+아요". You could take a similar approach and make a second column containing your sentence without furigana.
The main disadvantages of this approach is that you're effectively doubling your database size and you need to pay special attention when inputting data that the two column have the same data (I found a few rows in my database where the form and the analytic string have different words in them). The advantage is you can easily search your data while ignoring furigana.
It's your standard "size vs. performance" trade-off. Which is more important: size of the database or execution time? Any other solution I can think of involves returning too many rows and then individually analyzing them.

Characters "ي" and "ی" and the difference in persian - Mysql

I'm working on a UTF-8 Persian website with integrated mysql database. All the content in the website are imported through an admin panel and it's all persian.
As you might know arabic language has the same letters as persian except some.
The problem is when a person tries to type on a keyboard with arabic layout it writes "ي" as an character and if he tries to type by a keyboard with persian layout it types "ی" as character.
So if a person searches for 'بازی' the mysql won't find 'بازي' as the result.
Important Note: 'ی' is not the only character with this property, there are lots of them and they are very similar.
How can I fix this issue?
One simple naive solution seems to be replace all "ي" with "ی" before importing the data into database, but i'm searching for a better robust solution than this.

Dear EBAG, We have a single Arabic block in Unicode which contains both Arabic & Persian characters.
06CC is Persian ی and 064A is Arabic ي
Default windows keyboard uses code page 1256 for arabic characters which put 064A as default ي for bothPersian and Arab users because Arab users are much more than Persian.
ISIRI make an standard keyboard ISIRI 9147 and put both Arabic and Persian Yeh on it but Perisan ی is the default characters. Persian users which are using standard keyboard will put ( and use ) standard Persian ی‍ while the rest of them use arabicي`.
As you told usually while we are saving a data to database we change arabic ي to Persian ‍ی and when we are reading from it we just go for Persian so everything is true.
the second approach is to use a JavaScript file in web application to control user input. most of the persian websites use this approach to save characters to database. In this method user don't need to install any Keyboard layout for Persian or Arabic keyboard. He/she just put the keyboard on English and then in JavaScript file developer check that which character is equevalent for him. Here you can find ISIRI 9147 javascript for web application and a Persian Guid to use it.
the Third approach is to use a On-Screen Keyboard that work just like the previous one with a user interface and is usually good for thise who are not familiar with Persian keyboard.
The forth approach is to search both dialect. As you know when you install MySql or SQL Server you can set the collation and also you have an option to support dialect ( and case sensivity). if you enable arabic collation with dialect you can get result for both of them and usually this works fine in sql server I don't test it in MySql. This is the best solution yet.
but if I were you, I implement a simple sql function which get nvarchar and return nvarchar. then I call it when I wanted to write data. and whenever you want to read, you can go for the standard one.
Sorry for the long tail.

update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,NCHAR(1610),NCHAR(1740))
or
update TABLENAME set COLUMNNAME=REPLACE(COLUMNNAME,'ي',N'ی')

This is called a collation. It's what MySQL uses to compare two different characters. I'm afraid I don't know anything about persian or arabic, but the concept is the same. Essentially you've got two characters which map to the same base value. You need to find a collation which maps ي to ی. I'm afraid that's as helpful as I can be without knowing more about the language.

The first letter (ي) is Yāʾ in the arabic alphabet.
The second letter (ی) is ye in the perso-arabic alphabet.
More on the perso-arabic alphabet here:
http://en.wikipedia.org/wiki/Perso-Arabic_alphabet
"Two dots are removed in the final ye (ی). Arabic differentiates the final yāʾ with the two dots and the alif maqsura (except in Egyptian Arabic), which is written like a final yāʾ without two dots.
Because Persian drops the two dots in the final ye, the alif maqsura cannot be differentiated from the normal final ye. For example, the name Musâ (Moses) is written موسی. In the final letter in Musâ, Persian does not differentiate between ye or an alif maqsura."
Seems to be an interesting problem...

I was struggling with the similar situation 5-6 years ago, when Lucene was not an option for MySQL and there were no Sphinx (Never tried Sphinx result on this), but what I did was I found pretty much most of the possible alternations and put them in an array in PHP.
So if the input keyword contained any of those characters, I generated all the possible alternates of that.
So for the input of 'بازی' I would have generated {'بازي' , 'بازی' } and then I would query the MySQL for both, like the simplest query below :
SELECT title,Describtion FROM Games WHERE Description LIKE '%بازي%' OR Description LIKE '%بازی%'
The primary list of alternatives is not very long though.

If you've the possibility to switch DB engine, you might want to look into the full text search functionality of PostgreSQL:
http://www.postgresql.org/docs/9.0/static/textsearch.html
Among other things, you can configure it so that it indexes/searches unaccented characters, and you can define all sorts of additional dictionaries (e.g. stop words, thesaurus, synonyms, etc.).
If not, consider using Sphinx or Lucene instead of like statements for your searches.

I know answering this topic is like digging a corpse from its grave since it's really old but I'd like to share my experience IMHO, the best way is to wrap your request and apply your replacement . it's more portable than other ways. here is a java sample
public class FarsiRequestWrapper extends HttpServletRequestWrapper{
#Override
public String getParameter(String name) {
String parameterValue = super.getParameter(name);
parameterValue.replace("ی", "ي");
parameterValue.replace("\\s+", " ");
parameterValue.replace("ک","ک");
return parameter.trim();
}
}
then you only need to setup a filter servlet
public class FarsiFilter implements Filter{
public void doFilter(ServletRequest request, ServletResponse response,
FilterChain chain) throws IOException, ServletException {
HttpServletRequest req = (HttpServletRequest) request;
FarsiRequestWrapper rw = new FarsiRequestWrapper(req);
chain.doFilter(rw, response);
}
}
although this approach only works in Java, I found it simpler and better.

You must use N (meaning uNicode) before non-English characters, for example:
REPLACE(COLUMNNAME, N'ي', N'ی')

ilike search working irregularly with characters "ÅÄÖ"

Alright, so I've got a pretty weird error.
Say I have the string "Ön Åsa Äpple föll ner i sjön"
query |result
Ön |Found
ÖN |Found
ön |Found
<same pattern works with Åsa and Äpple as well>
föll |Found
Föll |Found
FÖLL |Not found
This doesn't make any sense to me. Clearly searching with capital ÅÄÖ works, but for some reason not when the ÅÄÖ-character is not the first letter of a word?
I have a Rails application and a MYSQL database.
This is the corresponding code:
dataset = DB[category_class.table_name.to_sym]
dataset = dataset.where(:headline.ilike("%" + headline + "%")) if headline.present?
I'd be very grateful for any comments or answers that leads me in the right direction to solving this problem.
Regards, Emil

This is probably something to do with the way the utf8_unicode_ci collation compares strings. A collation is a collection of rules that determine when two strings are considered "the same", and when they are not; it also determines the order in which strings are sorted.
To see the collations for the utf8 character set, issue SHOW COLLATION LIKE 'utf8'. You will see that there are a variety of language-specific collations. These don't limit the values you can store in the column* but rather impose string comparison rules tailored towards text that is expected to be primarily in a specific language. Alternatively, you can use the utf8_general_ci collation which ignores accents entirely.
* Although if unique keys are involved, they might change whether one value is considered a duplicate of another or not.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008