Perl search lookup efficiency

Perl search lookup efficiency - mysql

I have a batch of urls that I have to search through the database for a match or rather if the url contains the url in the database.
An example of a url is
http://www.foodandnuts.com/login.html
The database has a table filled with urls
Currently my script created a array at the start that has all the urls in my database
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
foreach my $j (keys %$results) {
push(#urldb, $j);
}
It will then go through the array to see if the url contains the url from the database
foreach(#urldb){
if($searchedurl=~ /$_/){
#do things here
}}
The problem is that is extremely slow as the array has more than 10000 urls so each searched url has to go through that array. Is there any way to make this faster?

The question can be answered differently depending on which of 3 kinds of URL matches you wish:
Exact full matches only (string equality). E.g. if DB url is "google.com", then search string "http://google.com" will NOT match, nor will "google.com/q=a".
In this case, drop using regexps, and either simply do SELECT * FROM urls WHERE url="$search", or do a hash lookup as Andreas' answer details.
Both search URL and URLs in DB are valid URLs (e.g. start with http://) and therefore MUST match starting with beginning of string, but the search URL can contain a DB URL+suffix to match. E.g. if DB URL is "http://google.com", then search strings "http://google.com" AND "http://google.com/q=a" match.
In this case, either do a start-anchored RegEx, or start-anchored "LIKE" DB match - see details in the next part of the answer.
Any substring match. E.g. if DB URL is "google", then any URL containing "google" string matches anywhere.
In this case, either do word-lookup table, or even smarter substring lookups algoritrhms; or do a batched regex matches using "|" to join multiple DB urls. See details in the last part of the answer.
This part of the answer assumes your URLs in DB can be substrings of search URL but they all start with "http", meaning they always match at the beginning of the string; but are not exact matches.
Solution 1 for start-anchored match (Perl):
Fix your RegExes to be anchored at the beginning: if($searchedurl=~ /^$_/){
Solution 2 for start-anchored match (DB):
Index your URL table by URL field, and do (Sybase syntax)
$query = qq[SELECT * FROM urls WHERE url LIKE "$searchurl\%"];
This will do a very efficient DB search for start-anchored substrings.
NOTE: the tradeoff between doing matches in DB vs Perl is:
If you have 1 DB and 100s of clients, you don't want to overload the DB doing string matching. Distribute the CPU load onto clients.
If you only have 1-2 clients, DB may be better as you will transfer less data from disk IO in DB (index on the table will help) and over network.
This part of the answer assumes your URLs in DB can be full substrings of search URL, not necessarily exact or even anchored matches.
Solution 1 for random substring match (Perl):
One purely Perl way you can make this faster is by combining your search strings into batches:
Split off first N elements from #urldb, in a loop
my $N = 10;
my $start = 0;
my $end = $N - 1;
while ($start < #urldb ) {
search_with($searchedurl, #urldb[$start..$end]); # see next bullet
$start += $N;
$end += $N;
$end = #urldb if $end > #urldb;
}
For each of length-N arrays, join the elements with "|" and create a regex
sub search_with {
my $searchedurl = shift;
my $regex_string = join("|", #_);
if ($searchedurl =~ /($regex_string)/) {
# Do stuff, $1 will contain what matched.
}
}
Solution 2 for random substring match (DB):
Another more algorithmic way to do it is to build an "word lookup" table (aka index, but I'd rather not use the term index to avoid confusion with database indices).
Split off each URL into words.
In the DB, add a unique ID to URL table
In the DB, add an "word lookup" table mapping (1-to-N) URL ID to every individual word (1 per row) in that URL
Use the "word lookup" table to narrow down the list of URLs to query out.
You can use a database index on "word lookup" table to make that search VERY fast.
You will of course need to split search URL into words as well.
Further speed up/narrow down by separately indexing domain name words from paths.
NOTE: using a simple "WHERE" clause in-database to search your URL table is a VERY bad idea if the URLS can be substrings that don't match on the first character - this way, you can't use and index and will simply scan the table.
NOTE2: For even more efficient substring matching against arrays of strings, there are more advanced algorithms based on graphs of substrings.
NOTE3: Tradeoff between doing matching in Perl and DB is same as in the first half of the answer.

#DVK is right about the fact that it is usually more efficient if you can anchor the match at the beginning. That way you can use a standard btree index to search against (MySQL doesn't have PostgreSQL's richer range of index types afaik).
I'd disagree with him/her about where to do the matching. It almost always makes sense to do this in the database itself. That's what a database is for.
The most efficient way is probably something like:
Create a TEMPORARY TABLE to hold your target urls
Bulk insert your targets to that temporary table
Create an index on them (assuming indexes will help here)
Join from your main url table to your targets using a LIKE match.
Even if you can't use indexes, the database should really be quicker than your perl. You're reading the entire table, packaging up the raw data into the transport protocol, transferring it, parsing that into perl values, assembling a hash and then checking it. Assuming your list of target urls is much smaller than the full list in the database you'll win just by not transferring so much data.

Note: OP asked for a solution where the search string should contain the url. I've changed my solution to try to normalize the urls so that hash matches are exact lookups after
getting comments of this.
This code is not tested, it should serve as some form of pseudo code that might work
Create a hash instead of an array. Hashes are ordered and better suited as lookups.
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
my %urldb = map { normalize($_) => 1 } keys %$results;
sub normalize {
my $url = shift;
$url =~ s|http://||; # strip away http:// if present
$url =~ s|www\.||; # strip away www if present
$url =~ s|/.*||; # strip away anything after and including /
return $url;
}
Then you would search with
if (exists($urldb{normalize($searchedurl)})) {
#do things here
}

Related

How to replace a character by another in a variable

I want to know if there is a way to replace a character by another in a variable. For example replacing every dots with underscores in a string variable.

I haven't tried it, but based on the Variables specification, the way I'd try to approach this would be to try to match on the text before and after a dot, and then make new variables based on the matches. Something like:
set "value" "abc.def";
if string :matches "${value}" "*.*" {
set "newvalue" "${1}_${2}
}
This will, of course, only match on a single period because Sieve doesn't include any looping structures. While there's a regex match option, I'm not aware of any regex replacement Sieve extensions.
Another approach to complex mail filtering you can do with Dovecot (if you do need loops and have full access to the mail server) is their Dovecot-specific extensions like vnd.dovecot.pipe which allows the mail administrator to define full programs (written in whatever language one wishes) to process mail on its way through.

Following #BluE's comment, if your use case is to store e-mails in folders per recipient address or something like that, perhaps you don't actually want a generic character replace function but some way to create mailboxes with dots in their names. In the case of dovecot, there seem to be a solution: [Dovecot] . (dot) in maildir folder names
https://wiki2.dovecot.org/Plugins/Listescape
Ensure one of the files in /etc/dovecot/conf.d contains this line:
mail_plugins = listescape
Then you can filter mailing lists into separate boxes based on their IDs.
This Sieve script snippet picks the ID from the x-list-id header:
if exists "x-list-id" {
if header :regex "x-list-id" "<([\.#a-z_0-9-]+)" {
set :lower "listname" "${1}";
fileinto :create "mailing_list\\${listname}";
} else {
keep;
}
stop;
}

How to compare condition with multiple values in TCL

I am trying to compare using if condition
xorg != "t8405" or "t9405" or "t7805" or "t8605" or "t8705"
I want to compare if xorg is not equal to all of these values on the right side then perform Y.
I am trying to figure out how can I have more smart comparison better or shell I compare xorg with one by one value?
Regards

I think the in and ni (not in) operators are what you should look at. They test for membership (or non-membership) of a list. In this case:
if {$xorg ni {"t8405" "t9405" "t7805" "t8605" "t8705"}} {
puts "it wasn't in there!"
}
If you've got a lot of these things and are testing frequently, you're actually better off putting the values into the keys of an array and using info exists:
foreach key {"t8405" "t9405" "t7805" "t8605" "t8705"} {
set ary($key) 1
}
if {![info exists ary($xorg)]} {
puts "it wasn't in there!"
}
It takes more setup doing it this way, but it's actually faster per test after that (especially from 8.5 onwards). The speedup is because arrays are internally implemented using fast hash tables; hash lookups are quicker than linear table scans. You can also use dictionaries (approximately dict set instead of set and dict exists instead of info exists) but the speed is similar.
The final option is to use lsearch -sorted if you put that list of things in order, since that switches from linear scanning to binary search. This can also be very quick and has potentially no setup cost (if you store the list sorted in the first place) but it's the option that is least clear in my experience. (The in operator uses a very simplified lsearch internally, but just in linear-scanning mode.)
# Note; I've pre-sorted this list
set items {"t7805" "t8405" "t8605" "t8705" "t9405"}
if {[lsearch -sorted -exact $items $xorg] < 0} {
puts "it wasn't in there!"
}
I usually use either the membership operators (because they're easy) or info exists if I've got a convenient set of array keys. I often have the latter around in practice...

Rails advanced search query

I have a database with Lab models. I want to be able to search them using multiple different methods.
I chose to use one input field and separate query into words array:
search = search.split(/[^[[:word:]]]+/).map{|val| val.downcase}
I use Acts-as-taggable gem so it would be nice to include those tags in search to:
tag_results = self.tagged_with(search, any: true, wild: true)
For methods down below it seemed to be necessary to use:
search = search.map{|val| "%#{val}%"}
Sunspot seemed also a great way to go for full-text search so
full_text_search = self.search {fulltext search}
full_text_results = full_text_search.results
I decided to go also with simple database query searching for a Lab name:
name_results = self.where("LOWER(name) ILIKE ANY ( array[?] )", search)
Lastly I need all of the results in one array so:
result = (tag_results + name_results + full_text_results).uniq
It works perfectly (what I mean is that the result is what I expect) but it returns a simple array and not ActiveRecord::Relation so there is no way for me to use method like .select() or .order() on the results.
I want to ask is there is some better way to implement such search? I was searching for search engines but it seems like there is nothing that would fit my idea.
If there is not - is there a way to convert an array into ActiveRecord::Relation? (SO says there is no way)

Answering this one:
is there a way to convert an array into ActiveRecord::Relation? (SO
says there is no way)
You can convert an array ob ActiveRecord objects into ActiveRecord::Relation by fetching ids from array and querying your AR model for objects with these ids:
Model.where(id: result.map(&:ids)) # returns AR, as expected.
It is the only way I am aware of.

Inverted search: Phrases per document

I have a database full of phrases (80-100 characters), and some longish documents (50-100Kb), and I would like a ranked list of phrases for a given document; rather than the usual output of a search engine, list of documents for a given phrase.
I've used MYSQL fulltext indexing before, and looked into lucene, but never used it.
They both seem geared to compare the short (search term), with the long (document).
How would you get the inverse of this?

I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.
Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.
First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.
Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.
When you find a match, you do the actual string comparison to see if you found a match.
Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:
HashSet<Long> foundHashes = new HashSet<Long>();
LinkedList<String> words = new LinkedList<String>();
for(int i=0; i<params.maxPhrase; i++) words.addLast("");
StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
Token t = new Token();
while(st.next(t) != null) {
String token = new String(t.termBuffer(), 0, t.termLength());
words.addLast(token);
words.removeFirst();
for(int len=params.minPhrase; len<params.maxPhrase; len++) {
String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");
long hash = Utils.longHash(term);
if(params.lexicon.isTermHash(hash)) {
foundHashes.add(hash);
}
}
}
for(long hash : foundHashes) {
if(count.containsKey(hash)) {
count.put(hash, count.get(hash) + 1);
} else {
count.put(hash, 1);
}
}

Would it be too slow to turn each phrase into a regex and run each one on the document, counting the number of occurrences?
If that doesn't work, maybe you can combine all the phrases into one huge regex (using |), and compile it. Then, run that huge regex starting from every character in the document. Count the number of matches as you go through the characters.

How large is the database of phrases? I am assuming that it is very large.
I would do the following:
Index the phrases by one of the words in it. You might choose the least common word in each phrase.You might make the search better by assuming that the word is at least e.g. 5 characters long, and padding the word to 5 chars if it is shorter. The padding can be the space after the word, followed by the subsequent word, to reduce matches, or some default character (e.g. "XX") if the word occurs at the end of the phrase.
Go through your document, converting each word (common ones can be discarded) to a key by padding if necessary, retrieving phrases.
Retrieve the relevant phrases by these keywords.
Use an in-memory text search to find the number of occurrences of each of the retrieved phrases.
I am assuming that phrases cannot cross a sentence boundary. In this case, you can read each sentence of the document into a substring in an array, and use the substring function to search through each sentence for each of the phrases and count occurrences, keeping a running sum for each phrase.

Maybe reading Peter Turney on keyphrase extraction will give you some ideas. Overall, his approach has some similarity to what itsadok has suggested.

Parsing and formatting search results

Search:
Scripting+Language Web+Pages Applications
Results:
...scripting language originally...producing dynamic web pages. It has...graphical applications....purpose scripting language that is...d creating web pages as output...
Suppose I want a value that represents the amount of characters to allow as padding on either side of the matched terms, and another value that represents how many matches will be shown in the result (ie, I want to see only the first 5 matches, nothing more).
How exactly would you go about doing this?
This is pretty language-agnostic, but I will be implementing the solution in a PHP environment, so please restrict answers to options that do not require a specific language or framework.
Here's my thought process: create an array from the search words. Determine which search word has the lowest index regarding where it's found in the article-body. Gather that portion of the body into another variable, and then remove that section from the article-body. Return to step 1. You might even add a counter to each word, skipping it when the counter reaches 3 or so.
Important:
The solution must match all search terms in a non-linear fashion. Meaning, term one should be found after term two if it exists after term two. Likewise, it should be found after term 3 as well. Term 3 should be found before term 1 and 2, if it happens to exist before them.
The solution should allow me to declare "Only allow up to three matches for each term, then terminate the summary."
Extra Credit:
Get the padding-variable to optionally pad words, rather than chars.

My thought process:
Create a results array that supports non-unique name/value pairs (PHP supports this in its standard array object)
Loop through each search term and find its character starting position in the search text
Add an item to the results array that stores this character position it has just found with the actual search term as the key
When you've found all the search terms, sort the array ascending by value (the character position of the search term)
Now, the search results will be in order that they were found in the search text
Loop through the results array and use the specified word padding to get words on each side of the search term while also keeping track of the word count in a separate name/value pair
Pseudocode, or my best attempt at it:
function string GetSearchExcerpt(searchText, searchTerms, wordPadding = 0, searchLimit = 3)
{
results = new array()
startIndex = 0
foreach (searchTerm in searchTerms)
{
charIndex = searchText.FindByIndex(searchTerms, startIndex) // finds 1st position of searchTerm starting at startIndex
results.Add(searchTerm, charIndex)
startIndex = charIndex + 1
}
results = results.SortByValue()
lastSearchTerm = ""
searchTermCount = new array()
outputText = ""
foreach (searchTerm => charIndex in results)
{
searchTermCount[searchTerm]++
if (searchTermCount[searchTerm] <= searchLimit)
{
// WordPadding is a simple function that moves left or right a given number of words starting at a specified character index and returns those words
outputText += "..." + WordPadding(-wordPadding, charIndex) + "<strong>" + searchTerm + "</strong>" + WordPadding(wordPadding, charIndex)
}
}
return outputText
}

Personally I would convert the search terms into Regular Expressions and then use a Regex Find-Replace to wrap the matches in strong tags for the formatting.
Most likely the RegEx route would be you best bet. So in your example, you would end up getting three separate RegEx values.
Since you want a non-language dependent solution I will not put the actual expressions here as the exact syntax varies by language.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008