Search in a field with html entities - html

Our customer's data (SQL Server 2005) has html entities in it (é -> é).
We need to search inside those fields, so a search for "équipe" will find "équipe".
We can't change the data, because our customer's customers can edit those fields as will (with a HTML editor), so if we remove the entities, on the next edit they might reappear, and the problem will still be there.
We can't use a .net server-side function, because we need to find the rows before they are returned to the server.
I would use a function that replaces the entities by their UTF-8 counterparts, but it's kind of tiresome, and I think it seriously drops the search performances (something about full table scan if I recall correctly).
Any idea ?
Thanks

You would only need to examine and encode the incoming search term.
If you convert "équipe" to "équipe" and use that in your WHERE/FTS clause then any index on that field could still be used, if the optimizer deems it appropriate.

Related

How to Save and search content (made with wysiwyg html editor) in mysql database?

I want to use a wysiwyg html editor (like this) and save it to my mysql database.
What is the best way to store the content (I guess that in a Text type field all html code)?
so then you can search content. (Like:blogs,taringa,stackoberflow....)
If you store html code in the database, how can you do the query so it only search text content and not html tags?
Note:I have a Laravel 4 project. (preference using Eloquent).
So now you're getting into search engine type of searching. You can go for DB simplicity or performance based searching. This answer will assume you have space to spare and you're not trying to condense as much space as possible.
DB Simplicity:
For this method you can really just throw the text (sanitized) into the DB and upon getting it back out you can print it with no sanitation {{{ $txt }}}. As for searching you just do a full text search on the entirety of the column for whatever you're searching for "%query%". You'll need to look into some raw querying as you can optimize it a bit.
Performance:
Upon entry you have two editions of the text which can help with printing and searching. Since you don't care about tags just have all of those stripped out (a rough regex would just delete anything in between angle brackets and the brackets themselves replace("<*>",""))
Might as well also remove punctuation as that can mess around with searching (your, you're). After you sanitize your text for search optimization you can then just have the printable column and do your searches on the search column. It'll still be slow as you're doing full-text searching but faster than also having to deal with tags.
Another strategy is to have yet another column as a unique word column, these usually can be called tags. So in this case you can also pull your data through another filter from the previous search-text and drop all the common words that don't have much meaning to them (the, or, by, it, is). You now probably have a list of semi related words to the article that possibly have duplicates, merge the duplicates with a count and order them from greatest to least in another column.
You now have multiple granularities of search depending on what you're goal is. You can also have this be enhanced with fuzzy searching, which does increase your search time and you'll probably have to create specific index tables to help decrease time spent searching.

To keep periods in acronyms or not in a database?

Acronyms are a pain in my database, especially when doing a search. I haven't decided if I should accept periods during search queries. These are the problems I face when searching:
'IRQ' will not find 'I.R.Q.'
'I.R.Q' will not find 'IRQ'
'IRQ.' or 'IR.Q' will not find 'IRQ' or 'I.R.Q.'
etc...
The same problem goes for ellipses (...) or three series of periods.
I just need to know what directions should I take with this issue:
Is it better to remove all periods when inserting the string to the database?
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
My responses for each question:
Is it better to remove all periods when inserting the string to the database?
Yes and no. You want the database to have the original text. If you want, create a separate field that is "cleaned up" to search against. Here, you can remove periods, make everything lowercase, etc.
If so what regex can I use to identify periods (instead of ellipses or three series of periods) to identify what needs to be removed?
/\.+/
That finds one or more periods in a given spot. But you'll want to integrate it with your search formula.
Note: regex on a database isn't known to have high performance. Be cautious with this.
Other note: you may want to use FullText search in MySQL. This also, isn't known to have high performance with data sets over 1000+ entries. If you have big data and need fulltext search, use Sphinx (available as a MySQL plug-in and RAM-based indexing system).
If it is possible to keep the periods in acronyms, how can it be scripted in a query to find 'I.R.Q' if I input 'IRQ' in the search field, through MySQL using regex or maybe a MySQL function I don't know about?
Yes, by having the 2 fields I described in the first bullet's answer.
You need to consider the sanctity of your input. If it is not yours to alter then don't alter it. Instead you should have a separate system to allow for text searching, and that can alter the text as it sees fit to be able to handle these types of issues.
Have a read up on Lucene, and specifically Lucene's standard analyzer, to see the types of changes that are commonly carried out to allow successful searching of complex text.
I think you can use the REGEXP function of MySQL to send an acronym :
SELECT col1, col2...coln FROM yourTable WHERE colWithAcronym REGEXP "#I\.?R\.?Q\.?#"
If you use PHP you can build your regexp by this simple loop :
$result = "#";
foreach($yourAcronym as $char){
$result.=$char."\\.?";
}
$result.="#";
The functionality you are searching for is a fulltext search. Mysql supports this for myisam-tables, but not for innodb. (http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html)
Alternatively you could go for an external framework that provides that funcionality. Lucene is a popular open-source one. (lucene.apache.org)
There would be 2 methods,
1. save data -removing symbols from text and match accordingly,
2. you can make a regex ,like this for eg.
select * from table where acronym regexp '^[A-Z]+[.]?[A-Z]+[.]?[A-Z]+[.]?$';
Please note, however, that this requires the acronym to be stored in uppercase. If you don't want the case to matter, just change [A-Z] to [A-Za-z].

Search a list of word in SQL server 2008 without full text search

I've a weird request but we must deal with these kinds of situation sometimes : I've to implements a word search in a SQL Server 2008 database. Good fit for full text search. But there's the trick : I can't use full text search because it's on a server I don't own and this feature is not installed (and probably will not be). So, basically I want to:
Receive a comma separated list of words (easy !)
I will check (with like) if the record contains the key word (easy too but I'm open to any suggestion to improve the performance of that operation)
I want to have a count of these matches so I can order the result appropriately (???)
Thanks for your help
I would use scripting in my favorite language to make a separate table with three columns: word, count, recordID (pointing to a record in the first table). then all future searches could use that table for a faster search.

Ignoring HTML entity for ampersand in MySQL Full-Text Search

I have a lot of data that is being entered into records with the HTML entity &. A full-text search for the word "amp" will result in records containing & to be shown, which is highly undesirable.
Presumably this is because MySQL ignores the '&' and the ';'. So does anyone know of any way within MySQL to force it to treat special characters as part of the word so that my search for "amp" doesn't include all results with & in them - ideally without some form of subquery or extra WHERE clause?
My solution so far (not yet implemented) is to decode the entities on INSERT and re-encode them when displaying on the web. This would be ok, but adds some overhead to everything that I'd like to avoid if possible. Also it works well for new entries, but I would need to backdate it to nearly 7 million records... which I kinda don't want to have to do if I can help it.
--
I updated my my.cnf file with the following:
ft_stopword_file = /etc/mysql/custom-stopwords
Does there need to be any special permissions on this file?
Your "decode HTML entities on INSERT and encode them on output" is your best bet, that'll take care of things like " as well. You'd probably want to strip out HTML tags too along the way to keep MySQL from finding things in attribute values.
If speed and formatting is an issue then you could stuff the text/plain version in a separate column and put your full text index on that and let everything else use the text/html version. Of course, you'd have to maintain both columns at the same time and your storage requirement would go up; OTOH, this approach would let you add tags, author names, and other extra bits of interesting data to the index without messing up your displayed text.
In the mean time, did you rebuild your full text index after you added the ft_stopword_file to your config file? AFAIK, the stopwords are applied on the way into the index rather than while the index is consulted.
perhaps you need to specifically ignore these. try to include -& to your fulltext query. Another option and I am unsure if it requires a MySql source code change is to add amp and & to the stop words list of MySql
You added it to the stopwords file and it's not working? Sounds like either a bug in MySQL or your stopwords list isn't being used. Have you reviewed this? Quote:
False hits or misses may occur for
stopword lookups if the stopword file
or columns used for full-text indexing
or searches have a character set or
collation different from
character_set_server or
collation_server.
Case sensitivity of stopword lookups
depends on the server collation. For
example, lookups are case insensitive
if the collation is latin1_swedish_ci,
whereas lookups are case sensitive if
the collation is latin1_general_cs or
latin1_bin.
Could any of those possibility be impacting your stopword entry of & not being read?

Order By varbinary column that holds docx files

I'm using MS SQL 2008 server, and I have a column that stores a word document ".docx".
Within the word document is a definition (ie: a term). I need to sort the definitions upon returning a dataset.
so basically...
SELECT * FROM DocumentsTable
Order By DefinitionsColumn ASC.
So my problem is how can this be accomplished, the binary comlumn only sorts on the binary value and not the word document content?
I was wondering if fulltext search/index would work. I already have that working, just not sure if I can use it with ORDER BY.
-Thanking all in advance.
I think you'd need to add another column, and populate this with the term from inside the docx. If it's possible at all to get SQL to read the docx (maybe with a custom .net function?) then it's going to be pretty slow.
Better to populate and maintain another column.
You have a couple options that may or may not be acceptable.
store the string definition contents
of the file in a field along side
the binary file column in the
record.
Only store the string definition in the record, and build the .docx
file at runtime for use within your
application.