When using MySQL full text search in boolean mode there are certain characters like + and - that are used as operators. If I do a search for something like "C++" it interprets the + as an operator. What is the best practice for dealing with these special characters?
The current method I am using is to convert all + characters in the data to _plus. It also converts &,#,/ and # characters to a textual representation.
There's no way to do this in nicely using MySQL's full text search. What you're doing (substituting special characters with a pre-defined string) is the only way to do it.
You may wish to consider using Sphinx Search instead. It apparently supports escaping special characters, and by all reports is significantly faster than the default full text search.
MySQL is fairly brutal in what tokens it ignores when building its full text indexes. I'd say that where it encountered the term "C++" it would probably strip out the plus characters, leaving only C, and then ignore that because it's too short. You could probably configure MySQL to include single-letter words, but it's not optimised for that, and I doubt you could get it to treat the plus characters how you want.
If you have a need for a good internal search engine where you can configure things like this, check out Lucene which has been ported to various languages including PHP (in the Zend framework).
Or if you need this more for 'tagging' than text search then something else may be more appropriate.
Related
My immediate need is to do an accent-insensitive comparison in MS Access. I am using Office 365 Access.
This is not strictly speaking a Unicode question as the European accented characters are present in all of Windows-1252 (sometimes misleadingly called "ANSI" in Microsoft products and documentation), "modern" Unicode and UCS-2.
The Access "Data Types" page I found mentioned "two bytes per character", which makes it sound like UCS-2, but with no details. Similarly, the "sort order" drop-downs list a number of values that are also undocumented.
Actual example: compare "Dvorak" to "Dvořák". These are not equal in MS Access.
It is NOT my goal today to find a work-around (I can do that myself) - it is to better understand MS Access capabilities in 2023.
Having gone through the incremental support improvements for SQL Server and .NET strings, my first thought was "surely MS Access can handle collations by now (2023)".
My bottom line questions are: "exactly" what encodings ("sort orders") is Office 365 Access supporting in its most recent releases, and is VBA using the same character set, or will working with accented characters in VBA experience translations or issues when being used within MS Access?
You're not giving me a whole lot to go on, so I'll just go over the basics. It's important to note new features rarely make it to VBA and Access, and breaking changes are extremely rare, in contrast to new versions SQL server or C#.
Regarding charsets and encodings (how strings are stored):
Strings in tables, queries and application objects are stored in UTF-16. They may be compressed (unicode compression option for text fields). This is independent of sort orders.
The VBA code itself is stored in the local charset (which may not support certain characters). It's generally recommended to avoid non-ASCII characters in VBA code, as this may cause issues on different computers and different charsets. See this post for some trickery if you need non-ASCII characters in VBA literals.
VBA strings are always a BSTR which uses UTF-16 characters.
Regarding sort orders/collations (how strings are compared):
Access has no full support for collations, and no specific case sensitive/case insensitive and accent sensitive/accent insensitive collations.
It does support different sort orders, which determines how strings should be sorted and which characters are equal. An outdated list can be found here. Using the object browser in Access, you can navigate to LanguageConstants and check the list. In recent builds of Office 365, there are some new options that appear to use codepage 65001 (= UTF-8) but I haven't seen docs or experimented with it.
In VBA, string comparisons and sorts are determined by an Option Compare statement at the top of the module. Nearly all VBA applications only support two: Option Compare Binary, any inequality is an unequal string and sorts are case sensitive, and Option Compare Text, use the local language settings to compare strings. For Access, there's a third, Option Compare Database, use the database sort order to compare strings.
Note that not all functions support all unicode characters. Functions with limited support include MsgBox and Debug.Print. This can make it especially hard to debug code when working with characters not in the system code page.
Further notes
VBA does allow (relatively) easy access to the Windows API. Instead of rolling your own string comparison function, you could use CompareStringEx which has options to do case-insensitive diacritic-insensitive comparisons.
Note that for external functions, you need to pass string pointers using StrPtr, passing strings as a string will automatically convert them from a BSTR to a pointer to a null-terminated string in the system codepage. See this answer for a basic example how to call a WinAPI function for a unicode string. You will also have to look up and declare all the constants, e.g. Public Const NORM_IGNORECASE As Long = &H1, etc.
I am using MySql database and for a field I chose varchar(200) . To prevent issues I set on my html page maxlength 200 . So ideally there should be no problem . But If I let the user input 200 characters I get exception. So i tried and tried and only at 190 characters I can be sure it also fits the database. So in future I will to prevent issues always make size of varchar() 20% bigger than what user can input in html page.
May be carriage returns are considered 2 characters each when it comes to maxlength. Can you make sure, you have any carriage returns.
1\r\n
1\r\n
1\r\n
1
A varchar(200) should be able to store 200 characters. You shouldn't need to increase the size, but if you do, an arbitrary 10% increase won't guarantee to solve the problem unless you know what is causing it. The danger of an overflow will remain.
Some possible reasons that spring to mind:
As noted by #VigneshKumarA, it could be carriage returns being stored as two characters.
It could also be multibyte unicode characters -- ie anything other than the basic ASCII character set. If you're entering accented letters or symbols, or non-Latin scripts, they will take up more than one byte per character.
Escaped/encoded characters, if you are sanitising your data. For example if you're running htmlentities() or similar on the input string, you may be getting single characters from the input being converted into entity codes like &. This will obviously make the string longer than it was when input.
What I would recommend is that you use a database tool to examine the stored data and check to see why it is storing more characters than you expected. Understand what the discrepancy is caused by, and then either fix it or adapt your system to handle it so that you can be sure it will never overflow.
I need to test the working of Box Net search in my application. For this I need more information about the search pattern. I see search results are compared with both file title and content.
Search is showing different behaviour when I have file names with special characters? Will search work when I have special characters as file names?
Following is the query I am using
boxSearch = client.getSearchManager().search(searchFileName, boxDefaultRequestObject);
Can you share me the pattern used during search and characters allowed and in what character combination results are seen?
Here are some resources on search:
https://support.box.com/hc/en-us/articles/200519888-How-do-I-search-for-files-and-folders-in-Box-
Box's search returns folder/file names and content, and it also accepts booleans. Just don't use mixed case (aNd is NOT okay, while AND or and is okay).
Box also accepts special characters in uploads and search. See the description here, as this was a fairly recent product update that came in mid-2013.
Additional special character support – Box will add support for more types of special characters across the Box website, desktop and mobile apps. Once the change is live, Box products will support almost all printable characters (except / \ or empty file names; also will not support leading or trailing spaces on files and folders).
I use regular expressions in MySQL on multibyte-encoded (utf-8) data, but I need it to be match case-insensitively. As MySQL has bug (for many years unresolved) that it can't deal properly with matching multibyte-encoded strings case-insensitively, I am trying to simulate the "insensitiveness" by lowercasing the value and the regexp pattern. Is it safe to lowercase regexp pattern such way? I mean, are there any edge cases I forgot?
Could following cause any problems?
LOWER('šárKA') = REGEXP LOWER('^Šárka$')
Update: I edited the question to be more concrete.
MySQL documentation:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
It is their bug filed in 2007 and unsolved until now. However, I can't just change database to solve this issue. I need MySQL somehow to consider 'Š' equal to 'š', even if it is by hacking it with not-so-elegant solution. Other characters than accented (multi-byte) match well and with no issues.
The i option for the Regex will make sure it matches case insensitively.
Example:
'^(?i)Foo$' // (?i) will turn on case insensitivity for the rest of the regex
'/^Foo$/i' // the i options turns off case sensitivity
Note that these may not work in your particular Flavour of Regex (which you haven't hinted upon) so make sure you consult your manual for the correct syntax.
Update:
From here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
REGEXP is not case sensitive, except when used with binary strings.
As noone actually answered my original question, I made my own research and realized it is not safe to lowercase or uppercase regular expression without any other processing. To be precise, it is safe to do this with theoretically pure regular expressions, but their every sane implementation adds some character classes and special directives, which can be vulnerable to case changing:
Escape sequences like \n, \t, etc.
Character classes like \W (non-alphanumeric) and \w (alphanumeric).
Character classes like [.characters.], [=character_class=], or [:character_class:] (MySQL regular expressions dialect).
Lowercasing or uppercasing \W and \w could completely change regular expression's meaning. This leads to following conclusion:
Presented solution is no-go.
Presented solution is possible, but the regular expression must be lowercased in more sophisticated way than just by using LOWER or something similar. It has to be parsed and the case has to be changed carefully.
I want to write a C-program that gets some strings from input. I want to save them in a MySQL database.
For security I would like to check, if the input is a (possible) UTF-8 string, count the number of characters and also use some regular expressions to validate the input.
Is there a library that offers me that functionality?
I thought about to use wide characters, but as far as I understood, the fact if they are supporting UTF-8 depends on the implementation and ist not defined by a standard.
And also I would be missing the regular expressions.
PCRE supports UTF-8. To validate the string before any processing, the W3C suggests this expression, which I re-implemented in plain C, but PCRE already automatically checks for UTF-8 in accordance to RFC 3629.