Inverted search: Phrases per document - mysql

I have a database full of phrases (80-100 characters), and some longish documents (50-100Kb), and I would like a ranked list of phrases for a given document; rather than the usual output of a search engine, list of documents for a given phrase.
I've used MYSQL fulltext indexing before, and looked into lucene, but never used it.
They both seem geared to compare the short (search term), with the long (document).
How would you get the inverse of this?

I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.
Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.
First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.
Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.
When you find a match, you do the actual string comparison to see if you found a match.
Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:
HashSet<Long> foundHashes = new HashSet<Long>();
LinkedList<String> words = new LinkedList<String>();
for(int i=0; i<params.maxPhrase; i++) words.addLast("");
StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
Token t = new Token();
while(st.next(t) != null) {
String token = new String(t.termBuffer(), 0, t.termLength());
words.addLast(token);
words.removeFirst();
for(int len=params.minPhrase; len<params.maxPhrase; len++) {
String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");
long hash = Utils.longHash(term);
if(params.lexicon.isTermHash(hash)) {
foundHashes.add(hash);
}
}
}
for(long hash : foundHashes) {
if(count.containsKey(hash)) {
count.put(hash, count.get(hash) + 1);
} else {
count.put(hash, 1);
}
}

Would it be too slow to turn each phrase into a regex and run each one on the document, counting the number of occurrences?
If that doesn't work, maybe you can combine all the phrases into one huge regex (using |), and compile it. Then, run that huge regex starting from every character in the document. Count the number of matches as you go through the characters.

How large is the database of phrases? I am assuming that it is very large.
I would do the following:
Index the phrases by one of the words in it. You might choose the least common word in each phrase.You might make the search better by assuming that the word is at least e.g. 5 characters long, and padding the word to 5 chars if it is shorter. The padding can be the space after the word, followed by the subsequent word, to reduce matches, or some default character (e.g. "XX") if the word occurs at the end of the phrase.
Go through your document, converting each word (common ones can be discarded) to a key by padding if necessary, retrieving phrases.
Retrieve the relevant phrases by these keywords.
Use an in-memory text search to find the number of occurrences of each of the retrieved phrases.
I am assuming that phrases cannot cross a sentence boundary. In this case, you can read each sentence of the document into a substring in an array, and use the substring function to search through each sentence for each of the phrases and count occurrences, keeping a running sum for each phrase.

Maybe reading Peter Turney on keyphrase extraction will give you some ideas. Overall, his approach has some similarity to what itsadok has suggested.

Related

How can I go about querying a database for non-similar, but almost matching items

How can I go about querying a database for items that are not only exactly similar to a sample, but also those that are almost similar? Almost as search engines work, but only for a small project, preferably in Java. For example:
String sample = "Sample";
I would like to retrieve all the following whenever I query sample:
String exactMatch = "Sample";
String nonExactMatch = "S amp le";
String nonExactMatch_2 = "ampls";
You need to define what similar means in terms that your database can understand.
Some possibilities include Levenshtein distance, for example.
In your example, sample matches...
..."Sample", if you search without case sensitivity.
..."S amp le", if you remove a set of ignored characters (here space only) from both the query string and the target string. You can store the new value in the database:
ActualValue SearchFor
John Q. Smith johnqsmith%
When someone searches for "John Q. Smith, Esq." you can boil it down to johnqsmithesq and run
WHERE 'johnqsmithesq' LIKE SearchFor
"ampls" is more tricky. Why is it that 'ampls' is matched by 'sample'? A common substring? A number of shared letters? Does their order count (i.e. are anagrams valid)? Many approaches are possible, but it is you who must decide. You might use Levenshtein distance, or maybe store a string such as "100020010003..." where every digit encodes the number of letters you have, up to 9 (so 3 C's and 2 B's but no A's would give "023...") and then run the Levenshtein distance between this syndrome and the one from each term in the DB:
ActualValue Search1 Rhymes abcdefghij_Contains anagramOf
John Q. Smith johnqsmith% ith 0000000211011... hhijmnoqst
...and so on.
One approach is to ask oneself, how must I transform both searched value and value searched for, so that they match?, and then proceed and implement that in code.
You can use match_against in myisam full text indexes columns.

Perl search lookup efficiency

I have a batch of urls that I have to search through the database for a match or rather if the url contains the url in the database.
An example of a url is
http://www.foodandnuts.com/login.html
The database has a table filled with urls
Currently my script created a array at the start that has all the urls in my database
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
foreach my $j (keys %$results) {
push(#urldb, $j);
}
It will then go through the array to see if the url contains the url from the database
foreach(#urldb){
if($searchedurl=~ /$_/){
#do things here
}}
The problem is that is extremely slow as the array has more than 10000 urls so each searched url has to go through that array. Is there any way to make this faster?
The question can be answered differently depending on which of 3 kinds of URL matches you wish:
Exact full matches only (string equality). E.g. if DB url is "google.com", then search string "http://google.com" will NOT match, nor will "google.com/q=a".
In this case, drop using regexps, and either simply do SELECT * FROM urls WHERE url="$search", or do a hash lookup as Andreas' answer details.
Both search URL and URLs in DB are valid URLs (e.g. start with http://) and therefore MUST match starting with beginning of string, but the search URL can contain a DB URL+suffix to match. E.g. if DB URL is "http://google.com", then search strings "http://google.com" AND "http://google.com/q=a" match.
In this case, either do a start-anchored RegEx, or start-anchored "LIKE" DB match - see details in the next part of the answer.
Any substring match. E.g. if DB URL is "google", then any URL containing "google" string matches anywhere.
In this case, either do word-lookup table, or even smarter substring lookups algoritrhms; or do a batched regex matches using "|" to join multiple DB urls. See details in the last part of the answer.
This part of the answer assumes your URLs in DB can be substrings of search URL but they all start with "http", meaning they always match at the beginning of the string; but are not exact matches.
Solution 1 for start-anchored match (Perl):
Fix your RegExes to be anchored at the beginning: if($searchedurl=~ /^$_/){
Solution 2 for start-anchored match (DB):
Index your URL table by URL field, and do (Sybase syntax)
$query = qq[SELECT * FROM urls WHERE url LIKE "$searchurl\%"];
This will do a very efficient DB search for start-anchored substrings.
NOTE: the tradeoff between doing matches in DB vs Perl is:
If you have 1 DB and 100s of clients, you don't want to overload the DB doing string matching. Distribute the CPU load onto clients.
If you only have 1-2 clients, DB may be better as you will transfer less data from disk IO in DB (index on the table will help) and over network.
This part of the answer assumes your URLs in DB can be full substrings of search URL, not necessarily exact or even anchored matches.
Solution 1 for random substring match (Perl):
One purely Perl way you can make this faster is by combining your search strings into batches:
Split off first N elements from #urldb, in a loop
my $N = 10;
my $start = 0;
my $end = $N - 1;
while ($start < #urldb ) {
search_with($searchedurl, #urldb[$start..$end]); # see next bullet
$start += $N;
$end += $N;
$end = #urldb if $end > #urldb;
}
For each of length-N arrays, join the elements with "|" and create a regex
sub search_with {
my $searchedurl = shift;
my $regex_string = join("|", #_);
if ($searchedurl =~ /($regex_string)/) {
# Do stuff, $1 will contain what matched.
}
}
Solution 2 for random substring match (DB):
Another more algorithmic way to do it is to build an "word lookup" table (aka index, but I'd rather not use the term index to avoid confusion with database indices).
Split off each URL into words.
In the DB, add a unique ID to URL table
In the DB, add an "word lookup" table mapping (1-to-N) URL ID to every individual word (1 per row) in that URL
Use the "word lookup" table to narrow down the list of URLs to query out.
You can use a database index on "word lookup" table to make that search VERY fast.
You will of course need to split search URL into words as well.
Further speed up/narrow down by separately indexing domain name words from paths.
NOTE: using a simple "WHERE" clause in-database to search your URL table is a VERY bad idea if the URLS can be substrings that don't match on the first character - this way, you can't use and index and will simply scan the table.
NOTE2: For even more efficient substring matching against arrays of strings, there are more advanced algorithms based on graphs of substrings.
NOTE3: Tradeoff between doing matching in Perl and DB is same as in the first half of the answer.
#DVK is right about the fact that it is usually more efficient if you can anchor the match at the beginning. That way you can use a standard btree index to search against (MySQL doesn't have PostgreSQL's richer range of index types afaik).
I'd disagree with him/her about where to do the matching. It almost always makes sense to do this in the database itself. That's what a database is for.
The most efficient way is probably something like:
Create a TEMPORARY TABLE to hold your target urls
Bulk insert your targets to that temporary table
Create an index on them (assuming indexes will help here)
Join from your main url table to your targets using a LIKE match.
Even if you can't use indexes, the database should really be quicker than your perl. You're reading the entire table, packaging up the raw data into the transport protocol, transferring it, parsing that into perl values, assembling a hash and then checking it. Assuming your list of target urls is much smaller than the full list in the database you'll win just by not transferring so much data.
Note: OP asked for a solution where the search string should contain the url. I've changed my solution to try to normalize the urls so that hash matches are exact lookups after
getting comments of this.
This code is not tested, it should serve as some form of pseudo code that might work
Create a hash instead of an array. Hashes are ordered and better suited as lookups.
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
my %urldb = map { normalize($_) => 1 } keys %$results;
sub normalize {
my $url = shift;
$url =~ s|http://||; # strip away http:// if present
$url =~ s|www\.||; # strip away www if present
$url =~ s|/.*||; # strip away anything after and including /
return $url;
}
Then you would search with
if (exists($urldb{normalize($searchedurl)})) {
#do things here
}

Synonym dictionary implementation?

How should I approach this problem? I basically need to implement a dictionary of synonyms. It takes as input some "word/synonim" pairs and I have to be able to "query" it for the list of all synonims of a word.
For example:
Dictionary myDic;
myDic.Add("car", "automobile");
myDic.Add("car", "autovehicle");
myDic.Add("car", "vehicle");
myDic.Add("bike", "vehicle");
myDic.ListOSyns("car") // should return {"automobile","autovehicle","vehicle" ± "car"}
// but "bike" should NOT be among the words returned
I'll code this in C++, but I'm interested in an overall idea of the implementation, so the question is not exactly language-specific.
PS: The main idea is to have some groups of words (synonyms). In the example above there would be two such groups:
{"automobile","autovehicle","vehicle", "car"}
{"bike", "vehicle"}
"vehicle" belongs to both, "bike" just to the second one, the others just to the first
I would implement it as a Graph + hash table / search tree
each keyword would be a Vertex, and each connection between 2 keywords would be an edge.
a hash table or a search tree will connect from each word to its node (and vice versa).
when a query is submitted - you find the node with your hash/tree and do BFS/DFS of the required depth. (meaning you cannot continue after a certain depth)
complexity: O(E(d)+V(d)) for searching graph (d = depth) (E(d) = number of edges in the relevant depth, same for V(d))
O(1) for creating an edge (not including searching for the node, detailed below its search)
O(logn) / O(1) for finding node (for tree/hash table)
O(logn) /O(1) for adding a keyword to the tree/hash table and O(1) to add a Vertex
p.s. as mentioned: the designer should keep in mind if he needs a directed or indirected Graph, as mentioned in the comments to the question.
hope that helps...
With the clarification in the comments to the question, it's relatively simple since you're not storing groups of mutual synonyms, but rather separately defining the acceptable synonyms for each word. The obvious container is either:
std::map<std::string, std::set<std::string> >
or:
std::multi_map<std::string, std::string>
if you're not worried about duplicates being inserted, like this:
myDic.Add("car", "automobile");
myDic.Add("car", "auto");
myDic.Add("car", "automobile");
In the case of multi_map, use the equal_range member function to extract the synonyms for each word, maybe like this:
struct Dictionary {
vector<string> ListOSyns(const string &key) const {
typedef multi_map<string, string>::const_iterator constit;
pair<constit, constit> x = innermap.equal_range(key);
vector<string> retval(x.first, x.second);
retval.push_back(key);
return retval;
}
};
Finally, if you prefer a hashtable-like structure to a tree-like structure, then unordered_multimap might be available in your C++ implementation, and basically the same code works.

Parsing and formatting search results

Search:
Scripting+Language Web+Pages Applications
Results:
...scripting language originally...producing dynamic web pages. It has...graphical applications....purpose scripting language that is...d creating web pages as output...
Suppose I want a value that represents the amount of characters to allow as padding on either side of the matched terms, and another value that represents how many matches will be shown in the result (ie, I want to see only the first 5 matches, nothing more).
How exactly would you go about doing this?
This is pretty language-agnostic, but I will be implementing the solution in a PHP environment, so please restrict answers to options that do not require a specific language or framework.
Here's my thought process: create an array from the search words. Determine which search word has the lowest index regarding where it's found in the article-body. Gather that portion of the body into another variable, and then remove that section from the article-body. Return to step 1. You might even add a counter to each word, skipping it when the counter reaches 3 or so.
Important:
The solution must match all search terms in a non-linear fashion. Meaning, term one should be found after term two if it exists after term two. Likewise, it should be found after term 3 as well. Term 3 should be found before term 1 and 2, if it happens to exist before them.
The solution should allow me to declare "Only allow up to three matches for each term, then terminate the summary."
Extra Credit:
Get the padding-variable to optionally pad words, rather than chars.
My thought process:
Create a results array that supports non-unique name/value pairs (PHP supports this in its standard array object)
Loop through each search term and find its character starting position in the search text
Add an item to the results array that stores this character position it has just found with the actual search term as the key
When you've found all the search terms, sort the array ascending by value (the character position of the search term)
Now, the search results will be in order that they were found in the search text
Loop through the results array and use the specified word padding to get words on each side of the search term while also keeping track of the word count in a separate name/value pair
Pseudocode, or my best attempt at it:
function string GetSearchExcerpt(searchText, searchTerms, wordPadding = 0, searchLimit = 3)
{
results = new array()
startIndex = 0
foreach (searchTerm in searchTerms)
{
charIndex = searchText.FindByIndex(searchTerms, startIndex) // finds 1st position of searchTerm starting at startIndex
results.Add(searchTerm, charIndex)
startIndex = charIndex + 1
}
results = results.SortByValue()
lastSearchTerm = ""
searchTermCount = new array()
outputText = ""
foreach (searchTerm => charIndex in results)
{
searchTermCount[searchTerm]++
if (searchTermCount[searchTerm] <= searchLimit)
{
// WordPadding is a simple function that moves left or right a given number of words starting at a specified character index and returns those words
outputText += "..." + WordPadding(-wordPadding, charIndex) + "<strong>" + searchTerm + "</strong>" + WordPadding(wordPadding, charIndex)
}
}
return outputText
}
Personally I would convert the search terms into Regular Expressions and then use a Regex Find-Replace to wrap the matches in strong tags for the formatting.
Most likely the RegEx route would be you best bet. So in your example, you would end up getting three separate RegEx values.
Since you want a non-language dependent solution I will not put the actual expressions here as the exact syntax varies by language.

A StringToken Parser which gives Google Search style "Did you mean:" Suggestions

Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".