How to realize a context search based on synomyns? - mysql

Lets say an internet user searches for "trouble with gmail".
How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?
I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.
Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.
But this won't cover all synonyms like the "gmail"-example.
Which other options do I have? Maybe something based on the given data and logged search phrases of the past?

You have to think of it ignoring the language.
When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.
You type "problem with gmail".
Two choices:
Your search give results: you click on one item.
The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".
Your search give poor results:
We will search in our history for a matching search:
We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.
When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.
If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.
And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.
I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.
But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.

This is a bit long for a comment.
What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)
The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).
Pseudo-code for getting the final word list with synonyms would be something like:
select #finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, #words) > 0;

Seems to me that you have two problems on your hands:
Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services
Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.
So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.
I think you have bitten off a very large problem for yourself.

If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.

Seeing your revised question, what about using a public API?
http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

Related

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>
I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.
It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

Banned words checking algo

I am building a text chat system. I want to add the ability to check for banned words/phrases.
The only technique I can think of, and can't believe it could possibly be the best approach is to do a FOR loop through all the words and search for matches in the text. This seems like it would be unbelievably slow once lots of words are added.
I'm using AS3, but an answer in most any language would probably be useful.
take care,
lee
use an AS3 dictionary or a dict in python and just check if the word is in the dict. there is no way I can see to not go over all the words.
Consider concatenating all the entries in your Dictionary into a single RegExp, with which you have to parse the text only once. I've done some testing, and it's going to be way faster than replacing word for word.
function censorWithDictionary ( dict:Dictionary, text:String ) : String {
var reg : String = "";
for (var key:Object in dict)
{
reg += reg=="" ? "" : "|"; // add an "or" for multiple search words
reg += "\\b"+dict[key]+"\\b"; // only whole words
}
var regExp : RegExp = new RegExp ( reg, "gi" );
return text.replace ( regExp, "----" );
}
I had a similar problem - we run a gaming site and wanted to introduce a chat system which was not manually moderated. We went the "banned word" route and it's working really well.
I just counted them and we now have a list of (just) 79 banned words which originated from something I found on-line to which we have added words over time when chat messages crept through.
The way we check things is that we concatenate an entire chat message by removing all spaces and none alpha characters and then search for banned words in what's left.
The key decisions we made are:
Don't tell people why you rejected their messages
Don't let people post chat until you trust them a bit (on our site they have
to have played 3 games)
5 "Bad" messages and we automatically block you
We email a report out daily with all the chat which got through which we scan through
We allow other users to complain about posted messages - if that happens the message is automatically removed so we can check it later.
1+3+5 Hardly ever happen now and it works wonderfully even though - sometimes messages like
"I wish it was hot!"
Are rejected (the clue is the "sh" part of wish and "it") but even that doesn't happen often.
This is more a comment than an answer, but comments are limited in length and there're big issues here.
I believe you are fundamentally asking the wrong question!
Certainly dictionaries and blacklist would highlight words or phrases that you want to ban but would that list be acceptable to users of your system? Would there be text that users of your system find offensive but you do not. Who decides?
For example, would people living here have trouble or indeed people living here. What if you supported this football/soccer team. This person probably never visits the UK.
Then you get into the issue of anagrams and slang. FCUK is a high street brand in the UK (and elsewhere I'm sure). And then there's pr0n (no link!) or NAMBLA.
The real question is - How do I stop people using the system from using language that is generally unacceptable? And that's more a design / social engineering problem than a programming problem. I don't think this site has word / phrase filtering and yet there's nothing here that would cause offense to anyone.
Here's an idea - let your users decide what is acceptable! Use a reputation based system. Allow users to vote up users who behave and vote down users that cause offense (with the option of allowing users to give feedback on the vote to give them a chance to mend their ways) and then have an option to filter out users with low / negative reputations.

should I write more descriptive function names or add comments?

This is a language agnostic question, but I'm wandering what people prefer in terms of readability and maintainability... My hypothetical situation is that I'm writing a function which given a sequence will return a copy with all duplicate element removed and the order reversed.
/*
*This is an extremely well written function to return a sequence containing
*all the unique elements of OriginalSequence with their order reversed
*/
ReturnSequence SequenceFunction(OriginalSequence)
{...}
OR
UniqueAndReversedSequence MakeSequenceUniqueAndReversed(OriginalSequence)
{....}
The above is supposed to be a lucid example of using comments in the first instance or using very verbose function names in the second to describe the actions of the function.
Cheers,
Richard
I prefer the verbose function name as it make the call-site more readable. Of course, some function names (like your example) can get really long.
Perhaps a better name for your example function would be ReverseAndDedupe. Uh oh, now it is a little more clear that we have a function with two responsibilities*. Perhaps it would be even better to split this out into two functions: Reverse and Dedupe.
Now the call-site becomes even more readable:
Reverse(Dedupe(someSequence))
*Note: My rule of thumb is that any function that contains "and" in the name has too many responsibilities and needs to be split up in to separate functions.
Personally I prefer the second way - it's easy to see from the function name what it does - and because the code inside the function is well written anyway it'll be easy to work out exactly what happens inside it.
The problem I find with comments is they very quickly go out of date - there's no compile time check to ensure your comment is correct!
Also, you don't get access to the comment in the places where the function is actually called.
Very much a subjective question though!
Ideally you would do a combination of the two. Try to keep your method names concise but descriptive enough to get a good idea of what it's going to do. If there is any possibility of lack of clarity in the method name, you should have comments to assist the reader in the logic.
Even with descriptive names you should still be concise. I think what you have in the example is overkill. I would have written
UniqueSequence Reverse(Sequence)
I comment where there's an explanation in order that a descriptive name cannot adequately convey. If there's a peculiarity with a library that forced me to do something that appears non-standard or value in dropping a comment inline, I'll do that but otherwise I rely upon well-named methods and don't comment things a lot - except while I'm writing the code, and those are for myself. They get removed when it is done, typically.
Generally speaking, function header comments are just more lines to maintain and require the reader to look at both the comment and the code and then decide which is correct if they aren't in correspondence. Obviously the truth is always in the code. The comment may say X but comments don't compile to machine code (typically) so...
Comment when necessary and make a habit of naming things well. That's what I do.
I'd probably do one of these:
Call it ReverseAndDedupe (or DedupeAndReverse, depending which one it is -- I'd expect Dedupe alone to keep the first occurrence and discard later ones, so the two operations do not commute). All functions make some postcondition true, so Make can certainly go in order to shorten a too-long name. Functions don't generally need to be named for the types they operate on, and if they are then it should be in a consistent format. So Sequence can probably be removed from your proposed name too, or if it can't then I'd probably call it Sequence_ReverseAndDedupe.
Not create this function at all, make sure that callers can either do Reverse(Dedupe(x)) or Dedupe(Reverse(x)), depending which they actually want. It's no more code for them to write, so only an issue of whether there's some cunning optimization that only applies when you do both at once. Avoiding an intermediate copy might qualify there, but the general point is that if you can't name your function concisely, make sure there's a good reason why it's doing so many different things.
Call it ReversedAndDeduped if it returns a copy of the original sequence - this is a trick I picked up from Python, where l.sort() sorts the list l in place, and sorted(l) doesn't modify a list l at all.
Give it a name specific to the domain it's used in, rather than trying to make it so generic. Why am I deduping and reversing this list? There might be some term of art that means a list in that state, or some function which can only be performed on such a list. So I could call it 'Renuberate' (because a reversed, deduped list is known as a list "in Renuberated form", or 'MakeFrobbable' (because Frobbing requires this format).
I'd also comment it (or much better, document it), to explain what type of deduping it guarantees (if any - perhaps the implementation is left free to remove whichever dupes it likes so long as it gets them all).
I wouldn't comment it "extremely well written", although I might comment "highly optimized" to mean "this code is really hard to work with, but goes like the clappers, please don't touch it without running all the performance tests".
I don't think I'd want to go as far as 5-word function names, although I expect I have in the past.

Which is a better long term URL design?

I like Stack Overflow's URLs - specifically the forms:
/questions/{Id}/{Title}
/users/{Id}/{Name}
It's great because as the title of the question changes, the search engines will key in to the new URL but all the old URLs will still work.
Jeff mentioned in one of the podcasts - at the time the Flair feature was being announced - that he regretted some design decisions he had made when it came to these forms. Specifically, he was troubled by his pseudo-verbs, as in:
/users/edit/{Id}
/posts/{Id}/edit
It was a bit unclear which of these verb forms he ended up preferring.
Which pattern do you prefer (1 or 2) and why?
I prefer pattern 2 for the simple reason is that the URL reads better. Compare:
"I want to access the USERS EDIT resource, for this ID" versus
"I want to access the POSTS resource, with this ID and EDIT it"
If you forget the last part of each URL, then in the second URL you have a nice recovery plan.
Hi, get /users/edit... what? what do you want to edit? Error!
Hi, get /posts/id... oh you want the post with this ID hmm? Cool.
My 2 pennies!
My guess would be he preferred #2.
If you put the string first it means it always has to be there. Otherwise you get ugly looking urls like:
/users//4534905
No matter what you need the id of the user so this
/user/4534905/
Ends up looking better. If you want fakie verbs you can add them to the end.
/user/4534905/edit
Neither. Putting a non-English numeric ID in the URL is hardly search engine friendly. You are best to utliize titles with spaces replaced with dashes and all lowercase. So for me the correct form is:
/question/how-do-i-bake-an-apple-pie
/user/frank-krueger
I prefer the 2nd option as well.
But I still believe that the resulting URLs are ugly because there's no meaning whatsoever in there.
That's why I tend to split the url creation into two parts:
/posts/42
/posts/42-goodbye-and-thanks-for-all-the-fish
Both URLs refer to the same document and given the latter only the id is used in the internal query. Thus I can offer somewhat meaningful URLs and still refrain from bloating my Queries.
I like number 2:
also:
/questions/foo == All questions called "foo"
/questions/{id}/foo == A question called "foo"
/users/aiden == All users called aiden
/users/{id}/aiden == A user called aiden
/users/aiden?a=edit or /users/aiden/edit == Edit the list of users called Aiden?
/users/{id}/edit or /users/{id}?a=edit is better
/rss/users/aiden == An RSS update of users called aiden
/rss/users/{id} == An RSS feed of a user's activity
/rss/users/{id}/aiden == An RSS feed of Aiden's profile changes
I don't mind GET arguments personally and think that /x/y/z should refer to a mutable resource and GET/POST/PUT should act upon it.
My 2p
/question/how-do-i-bake-an-apple-pie
/question/how-do-i-bake-an-apple-pie-2
/question/how-do-i-bake-an-apple-pie-...

Recognize Missing Space

How can I recognize when a user has missed a space when entering a search term? For example, if the user enters "usbcable", I want to search for "usb cable". I'm doing a REGEX search in MySQL to match full words.
I have a table with every term used in a search, so I know that "usb" and "cable" are valid terms. Is there a way to construct a WHERE clause that will give me all the rows where the term matches part of the string?
Something like this:
SELECT st.term
FROM SearchTerms st
WHERE 'usbcable' LIKE '%' + st.term + '%'
Or any other ideas?
Text Segmentation is a part of Natural Language Processing, and is what you're looking for in this specific example. It's used in search engines and spell checkers, so you might have some luck with example source code looking at open source spell checkers and search engines.
Spell checking might be the correct paradigm to consider anyway, as you first need to know whether it's a legitimate word or not before trying to pry it apart.
-Adam
Posted in the comments, but I thought it important to bring up as an answer:
Does that query not work? – Simon Buchan
Followed by:
Well, I should've tested it before I
posted. That query does not work, but
with a CONCAT it does, like so: WHERE
'usbcable' LIKE Concat('%', st.term,
'%'). I think this is the simplest,
and most relevant (specific to my
site), way to go. – arnie0674
Certainly far easier than text segmentation for this application...
-Adam