Mysql multiple maching patterns in substring_index - mysql

Can I use something like case to give multiple matching patterns in substring_index?
More specifically in my case, can I matching a set of chars according to their ascii?
Add some examples:
中文Q100
中文T800
中文中文K999
The strings start with some Chinese characters, then following by some numbers or latin letters, what I want is to split the string into two parts: one contains the Chinese characters(from leftmost to the first western letter), the other is from the first western letter to the rightmost.
Like these:
中文, Q100
中文, T800
中文中文, K999

There are multiple ways to resolve a matter. I'll give you 3 of them, starting from most right.
Architecture solution
Using application
Your question is about - replacing by regular expression. And that has a weak support in MySQL (to say precisely, there's no support for replacing by regex). Thus, you may do: select whole record, then split it in applicaition, using a-zA-Z0-9 mask, for example.
Or may be change table structure?
Well, alternative is: may be you should just separate this data to 2 columns? If your intention is to work with separate parts of data, then may be it's a sign to change your DB architecture?
Using MySQL
Second way is to use MySQL. To do it - yes, you'll use REPLACE() as it is. For instance, to get rid of all alphanumeric symbols, you'll do:
SELECT [...REPLACE(REPLACE(str, 'z', ''), 'y', '')...]
that is a pseudo-SQL, since posting whole 26+26+10 instances of REPLACE would be mad (however, using this is also mad). But that will resolve your issue, of course.
Using external REGEXP solution
This is third way and it has two subcases. You may either use UDF or stored routines.
Using UDF
There are third-party libraries which provide regular expression replacement functionality. Then all you need to do is to include those libraries into your server build. Example: lib_mysqludf_preg This, however, will require additional actions to use those libraries.
Using stored routines
Well, you can use stored routines to create your own replacement function. Actually, I have already written such library, it's called mysql-regexp and it provides REGEXP_REPLACE() function, which allows you to do replacements in strings by regular expressions. It's not well-tested, so if you'll decide to use it - do that on your own risk. Sample would be:
mysql> SELECT REGEXP_REPLACE('foo bar34 b103az 98feo', '[^a-z]', '');
+--------------------------------------------------------+
| REGEXP_REPLACE('foo bar34 b103az 98feo', '[^a-z]', '') |
+--------------------------------------------------------+
| foobarbazfeo |
+--------------------------------------------------------+
1 row in set (0.00 sec)
Since it's completely written with stored code, you won't need to re-build your server or whatever.

Related

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/
Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.
For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

MySQL regexp with Japanese furigana

I have a large database (~2700 entries) of vocabulary. Each row contains an English word, the Japanese equivalent, and other data not relevant to this problem. I have created a facility to search and display the results in a table, but I'm having a small problem with the furigana.
Japanese sentences are written with a mix of Chinese characters (kanji) and the phonetic scripts (kana). Not everyone can read every kanji, and sometimes the same kanji has multiple readings. In those cases, the phoetic kana is placed above the kanji - this is called furigana:
I present these phonetic readings to the user with the <ruby> tag in the following format:
<ruby>
<rb>勉強</rb> <!-- the kanji -->
<rp>(</rp> <!-- define where the phonetic part starts in the string -->
<rt>べんきょう</rt> <!-- the phonetic kana itself -->
<rp>)</rp> <!-- define the end of the phonetic part -->
</ruby>する <!-- the last part is already phonetic so needs no ruby -->
The strings are stored in my database like this:
勉強(べんきょう)する
where anything between the parentheses is the reading for the kanji immediately preceeding it. Storing the strings this way allows fallback for browsers that don't support ruby tags (such as, amazingly, Firefox).
All of this is fine, but the problem comes when a user is searching. If they search for
勉強
Then it will show up. But if they try to search for
勉強する
it won't work, because in the database there is a string defining the phonetic pronunciation in the middle.
The full-width parentheses in the above example are used only to denote this phonetic script. Given this, I am looking for a way to essentially tell the MySQL search to ignore anything it finds between rounded parentheses. I have a basic knowledge of how to do most simple queries in MySQL, but I'm certainly not an expert. I have looked at the docs, but (to me, at least) they are not very user-friendly. Perhaps not very beginner-friendly. I thought it might be possible with some sort of construction involving a regular expression, but I can't figure out how.
Is there a way to do what I want?
As said in How to do a regular expression replace in MySQL?, there seems to be impossible without an user-defined function (you can only replace explicit sequences).
Rather dirty solution: you can tolerate anything between two consecutive Japanese characters, LIKE '勉%強%す%る'. I never suggested that.
Or, you can keep an optional field in your table that potentially contains a version with furigana.
I would advise against using LIKE queries beause you would have to have a % between every single character (since you don't know WHEN furigana will occur) and that could end up creating false positives (like if a valid character appeared between 勉 and 強).
As #Jill-Jênn Vie breifly mentioned, I'd suggest adding a new column to hold the text with furigana.
I'm working on an application which performs searches on Korean text. The problem is that Korean conjugation changes the characters. For example:
하다 + 아요 = 해요
"하다" is the verb "to do" in dictionary form and "아요" is the standard polite-form conjugation. Presumably you are a Japanese speaker, so you know how common such polite forms can be! Note how the 하 changes to 해. Obviously, if users try to search for "하다" in the string "해요", they won't find it. But if users want to see all instances of "하다" in the corpus, we need to be able to return it.
Our solution was two columns: "form" (conjugated form) and "analytic_string" which would represent "해요" as "하다+아요". You could take a similar approach and make a second column containing your sentence without furigana.
The main disadvantages of this approach is that you're effectively doubling your database size and you need to pay special attention when inputting data that the two column have the same data (I found a few rows in my database where the form and the analytic string have different words in them). The advantage is you can easily search your data while ignoring furigana.
It's your standard "size vs. performance" trade-off. Which is more important: size of the database or execution time? Any other solution I can think of involves returning too many rows and then individually analyzing them.

How to compile a complete list of MySQL "Words"

Really getting into MySQL and one thought I've had on mastering one aspect of it is to gather a complete listing of MySQL words. One example of this might be the Reserved Words list, though it appears that's not a complete list; example: CONCAT, CRC32, etc.
Bizarre as it may seem, I was thinking that such a list might exist, or that there might even be a query that would yield it, and/or a way to extract it from the source code of MySQL.
It is a non-scientific method, but what I would do is:
extract all strings from Native_func_registry func_array. Lookup for it sql/item_create.cc , e.g in
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/item_create.cc
Those should cover builtin functions.
extract strings from 'symbols' and 'functions' in lexer :
http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/lex.h
extract symbols from bison input http://bazaar.launchpad.net/~mysql/mysql-server/mysql-trunk/view/head:/sql/sql_yacc.yy from lines
%token SOMETOKEN
except when tokens have _SYM suffix (they are covered by sql/lex.h)
Combine all of those, and the resulting set might come near :)

MySQL integer comparison ignores trailing alpha characters

So lets just say I have a table with just an ID is an int. I have discovered that running:
SELECT *
FROM table
WHERE ID = '32anystring';
Returns the row where id = 32. Clearly 32 != '32anystring'.
This seems very strange. Why does it do this? Can it be turned off? What is the best workaround?
It is common behavior in most programming languages to interpret leading numerals as a number when converting a string to a number.
There are a couple of ways to handle this:
Use prepared statements, and define the placeholder where you are putting the value to be of a numeric type. This will prevent strings from being put in there at all.
Check at a higher layer of the application to validate input and make sure it is numeric.
Use the BINARY keyword in mysql (I'm just guessing that this would work, have never actually tried it as I've always just implemented a proper validation system before running a query) -
SELECT *
FROM table
WHERE BINARY ID = '32anystring';
You need to read this
http://dev.mysql.com/doc/refman/5.1/en/type-conversion.html
When you work on, or compare two different types, one of them will be converted to the other.
MySQL conversion of string->number (you can do '1.23def' * 2 => 2.46) parses as much as possible of the string as long as it is still a valid number. Where the first letter cannot be part of a number, the result becomes 0

Separating arguments to a function, in locales with comma as decimal marker

In locales, e.g. French, with comma as decimal indicator (where "5,2" means five and two-tenths), how do users separate function arguments from each other?
For example, in many programming/scripting languages, I could specify MAX(1.5, X) in an EN-US locale. How do you avoid the ambiguity between the comma as decimal indicator, and as argument separator?
In particular, I'm interested in how software that's perceived as user-friendly in the foreign locale does it. Obviously, it's a no-brainer to say, "though shalt use decimal POINTs", but that's not particularly "friendly".
We struck this exact problem when putting together queries for our product using DB2.
We had a query along the lines of:
select fn (COLUMN_NAME,5.2) from TABLE_NAME
and the 5.2 was actually inserted dynamically by our software, based on a user selection (forget about injection attacks for now, they were prevented via another means).
In locales where "five and two tenths" was written as 5,2 (darn those Europeans), we ended up with:
select fn (COLUMN_NAME,5,2) from TABLE_NAME
which DB2 complained about bitterly.
The IBM-given solution was to ensure that separator commas had spaces on either side and decimal commas did not:
select fn (COLUMN_NAME , 5,2) from TABLE_NAME
But you'll find that many languages solve this problem in exactly the non-friendly way you mention.
None of the ISO C-like standards (well, none that I know of, anyway) allow a constant number to be declared as 5,2, despite the fact that ISO is very much an "internationalised" organisation.
They also don't require that keywords like if, while and for be localised into Arabic, or Mongolian, or Klingon, which would lead to code such as:
chugh (rIn)
nobHa';