I'm about to do a search/replace in a MySql database. I want to replace one word in a field containing html text with another word. The problem is that the word that I'm looking for in some cases is part of a path. Eg.:
<... src="../SEARCHTERM/img.jpg" />
Obviously I do not want to replace this instance. So my question is: What is the best way to do this? How do I replace only if the word is not part of a path?
Select all records that contain the word to replace.
Process the records using the programming language of your choice.
Update the records in the database with your processed records.
The problematic part of this is item 2. Depending on your processing needs, a regular expression replacement will do it (like preg_replace in PHP) or you need a fully fledged HTML parser.
MySQL can match strings using regular expressions, but no built-in way exists to do replacement based on regular expressions. You can, however, use User Defined Functions to do that, e. g. MySQL Regular Expression UDFs. Then again, question is whether regular expression are sufficient for your replacement needs. In a lot of cases involving HTML, it is not.
Related
I have a table colum with general text values ex:
"This is Gerald's Sample Text: With some special chars"
I need to convert this text to:
"this-is-geralds-sample-text-with-some-special-chars"
with MySQL InnoDB and save the value in a separate unique column in the same table. Is there a simpler way of achieving this with a query without using procedures?
The short answer is "No". You're looking for something that behaves exactly like a regular expression, and MySQL does not support regex replace natively.
The longer answer is "No, but there are workarounds." You have a couple of options, and I don't terribly like either. The first is to create a function like in this question. The second is to come up with a list of bad characters and then use a set of REPLACE calls. It's ugly, but it will work.
On a side note: you might consider creating this value with your application and then just store along with the original. That would be cleaner in some ways than using a custom MySQL function.
Acronyms are a pain in my database, especially when doing a search. I haven't decided if I should accept periods during search queries. These are the problems I face when searching:
'IRQ' will not find 'I.R.Q.'
'I.R.Q' will not find 'IRQ'
'IRQ.' or 'IR.Q' will not find 'IRQ' or 'I.R.Q.'
etc...
The same problem goes for ellipses (...) or three series of periods.
I just need to know what directions should I take with this issue:
Is it better to remove all periods when inserting the string to the database?
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
My responses for each question:
Is it better to remove all periods when inserting the string to the database?
Yes and no. You want the database to have the original text. If you want, create a separate field that is "cleaned up" to search against. Here, you can remove periods, make everything lowercase, etc.
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
/\.+/
That finds one or more periods in a given spot. But you'll want to integrate it with your search formula.
Note: regex on a database isn't known to have high performance. Be cautious with this.
Other note: you may want to use FullText search in MySQL. This also, isn't known to have high performance with data sets over 1000+ entries. If you have big data and need fulltext search, use Sphinx (available as a MySQL plug-in and RAM-based indexing system).
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
Yes, by having the 2 fields I described in the first bullet's answer.
You need to consider the sanctity of your input. If it is not yours to alter then don't alter it. Instead you should have a separate system to allow for text searching, and that can alter the text as it sees fit to be able to handle these types of issues.
Have a read up on Lucene, and specifically Lucene's standard analyzer, to see the types of changes that are commonly carried out to allow successful searching of complex text.
I think you can use the REGEXP function of MySQL to send an acronym :
SELECT col1, col2...coln FROM yourTable WHERE colWithAcronym REGEXP "#I\.?R\.?Q\.?#"
If you use PHP you can build your regexp by this simple loop :
$result = "#";
foreach($yourAcronym as $char){
$result.=$char."\\.?";
}
$result.="#";
The functionality you are searching for is a fulltext search. Mysql supports this for myisam-tables, but not for innodb. (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html)
Alternatively you could go for an external framework that provides that funcionality. Lucene is a popular open-source one. (lucene.apache.org)
There would be 2 methods,
1. save data -removing symbols from text and match accordingly,
2. you can make a regex ,like this for eg.
select * from table where acronym regexp '^[A-Z]+[.]?[A-Z]+[.]?[A-Z]+[.]?$';
Please note, however, that this requires the acronym to be stored in uppercase. If you don't want the case to matter, just change [A-Z] to [A-Za-z].
I am having trouble figuring out if its possible to force mysql to consider certain specified strings as identical when choosing results in a select query.
For example i have a column containing the word "trachiotomy", but due to the nature of the language it is very likely that the search query will be "trahiotomy" (notice the c missing).
Is there any way I can force the query to recognize any pattern of letters to another ?
For example to match any instance within words of the "ach" sequence of letters to "ah" also - and vice versa. In essence force it regardless of how it was written.
Another example would be the word Archon - which I would like to match with Arhon as well.
So that if a user input was Archon it would match the database data Arhon and vice versa.
I experimented with soundex a bit and it does match some instances, but it seems that due to the way the algorithm works it cant do it in cases where the desired matched string is in the beginning of the word.
For instance the word "Chorevo" cant match the word "Horevo" unless i can somehow force it to consider that "chor" is equal to "hor" and vice versa in any word.
I am reading into REGEXP to see if it can be matched thus somehow. (something like
REGEXP 'arch', 'arh')
At this point i am using a full text match query, but could change that if that proves to be a problem.
I am not sure I have made this clear but would appreciate any help possible.
This is known as phonetic matching. MySQL implements a relatively primitive version of this in the soundex(str) function and a SOUNDS_LIKE b clause (which is just shorthand for soundex(a) = soundex(b). By nature such matching is language-specific, and the MySQL implementation is designed for English words and thus may not work in your situation.
Alternatively you could research/write your own transformation that does what you want and apply it to the data before saving in the database (in a separate column or table).
Is it possible to make some characters stored in mysql invisible for search queries?
Of course, I can do this in application, but is there maybe some setting option in mysql for this?
I am still not sure I am following what you want. It sounds like a query like
SELECT * FROM `table` WHERE REPLACE(string_field, "#", "") = "user query"
might be what you are looking for.
See REPLACE. For more complicated matching, there's also regular expressions, although that would probably be rather messy for what you are describing.
EDIT: Just saw your comment. It sounds like you want to blacklist certain characters from the user's query as they are special to your system. No, there's no way to do that. Somewhere you are going to want a string replace operation to remove those characters; either in your application or in a stored procedure/function if you want to put it in the database.
Our customer's data (SQL Server 2005) has html entities in it (é -> é).
We need to search inside those fields, so a search for "équipe" will find "équipe".
We can't change the data, because our customer's customers can edit those fields as will (with a HTML editor), so if we remove the entities, on the next edit they might reappear, and the problem will still be there.
We can't use a .net server-side function, because we need to find the rows before they are returned to the server.
I would use a function that replaces the entities by their UTF-8 counterparts, but it's kind of tiresome, and I think it seriously drops the search performances (something about full table scan if I recall correctly).
Any idea ?
Thanks
You would only need to examine and encode the incoming search term.
If you convert "équipe" to "équipe" and use that in your WHERE/FTS clause then any index on that field could still be used, if the optimizer deems it appropriate.