Parsing and formatting search results - language-agnostic

Search:
Scripting+Language Web+Pages Applications
Results:
...scripting language originally...producing dynamic web pages. It has...graphical applications....purpose scripting language that is...d creating web pages as output...
Suppose I want a value that represents the amount of characters to allow as padding on either side of the matched terms, and another value that represents how many matches will be shown in the result (ie, I want to see only the first 5 matches, nothing more).
How exactly would you go about doing this?
This is pretty language-agnostic, but I will be implementing the solution in a PHP environment, so please restrict answers to options that do not require a specific language or framework.
Here's my thought process: create an array from the search words. Determine which search word has the lowest index regarding where it's found in the article-body. Gather that portion of the body into another variable, and then remove that section from the article-body. Return to step 1. You might even add a counter to each word, skipping it when the counter reaches 3 or so.
Important:
The solution must match all search terms in a non-linear fashion. Meaning, term one should be found after term two if it exists after term two. Likewise, it should be found after term 3 as well. Term 3 should be found before term 1 and 2, if it happens to exist before them.
The solution should allow me to declare "Only allow up to three matches for each term, then terminate the summary."
Extra Credit:
Get the padding-variable to optionally pad words, rather than chars.

My thought process:
Create a results array that supports non-unique name/value pairs (PHP supports this in its standard array object)
Loop through each search term and find its character starting position in the search text
Add an item to the results array that stores this character position it has just found with the actual search term as the key
When you've found all the search terms, sort the array ascending by value (the character position of the search term)
Now, the search results will be in order that they were found in the search text
Loop through the results array and use the specified word padding to get words on each side of the search term while also keeping track of the word count in a separate name/value pair
Pseudocode, or my best attempt at it:
function string GetSearchExcerpt(searchText, searchTerms, wordPadding = 0, searchLimit = 3)
{
results = new array()
startIndex = 0
foreach (searchTerm in searchTerms)
{
charIndex = searchText.FindByIndex(searchTerms, startIndex) // finds 1st position of searchTerm starting at startIndex
results.Add(searchTerm, charIndex)
startIndex = charIndex + 1
}
results = results.SortByValue()
lastSearchTerm = ""
searchTermCount = new array()
outputText = ""
foreach (searchTerm => charIndex in results)
{
searchTermCount[searchTerm]++
if (searchTermCount[searchTerm] <= searchLimit)
{
// WordPadding is a simple function that moves left or right a given number of words starting at a specified character index and returns those words
outputText += "..." + WordPadding(-wordPadding, charIndex) + "<strong>" + searchTerm + "</strong>" + WordPadding(wordPadding, charIndex)
}
}
return outputText
}

Personally I would convert the search terms into Regular Expressions and then use a Regex Find-Replace to wrap the matches in strong tags for the formatting.
Most likely the RegEx route would be you best bet. So in your example, you would end up getting three separate RegEx values.
Since you want a non-language dependent solution I will not put the actual expressions here as the exact syntax varies by language.

Related

Extract tokens from grammar

I have been working through the Advent of Code problems in Perl6 this year and was attempting to use a grammar to parse the Day 3's input.
Given input in this form: #1 # 1,3: 4x4 and this grammar that I created:
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
}
token digits {
<digit>+
}
token id {
<digits>
}
token coordinates {
<digits> ',' <digits>
}
token dimensions {
<digits> 'x' <digits>
}
}
say Claim.parse('#1 # 1,3: 4x4');
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse. I understand that I can pull them from the resulting Match object of Claim.parse(<input>), but I have to dig down through each grammar production to get the value I need e.g.
say $match<id>.hash<digits>.<digit>;
this seems a little messy, is there a better way?
For the particular challenge you're solving, using a grammar is like using a sledgehammer to crack a nut.
Like #Scimon says, a single regex would be fine. You can keep it nicely readable by laying it out appropriately. You can name the captures and keep them all at the top level:
/ ^
'#' $<id>=(\d+) ' '
'# ' $<x>=(\d+) ',' $<y>=(\d+)
': ' $<w>=(\d+) x $<d>=(\d+)
$
/;
say ~$<id x y w d>; # 1 1 3 4 4
(The prefix ~ calls .Str on the value on its right hand side. Called on a Match object it stringifies to the matched strings.)
With that out the way, your question remains perfectly cromulent as it is because it's important to know how P6 scales in this regard from simple regexes like the one above to the largest and most complex parsing tasks. So that's what the rest of this answer covers, using your example as the starting point.
Digging less messily
say $match<id>.hash<digits>.<digit>; # [「1」]
this seems a little messy, is there a better way?
Your say includes unnecessary code and output nesting. You could just simplify to something like:
say ~$match<id> # 1
Digging a little deeper less messily
I am interested in extracting the actual tokens that were matched i.e. id, x + y from coordinates, and height + width from the dimensions from the resulting parse.
For matches of multiple tokens you no longer have the luxury of relying on Perl 6 guessing which one you mean. (When there's only one, guess which one it guesses you mean. :))
One way to write your say to get the y coordinate:
say ~$match<coordinates><digits>[1] # 3
If you want to drop the <digits> you can mark which parts of a pattern should be stored in a list of numbered captures. One way to do so is to put parentheses around those parts:
token coordinates { (<digits>) ',' (<digits>) }
Now you've eliminated the need to mention <digits>:
say ~$match<coordinates>[1] # 3
You could also name the new parenthesized captures:
token coordinates { $<x>=(<digits>) ',' $<y>=(<digits>) }
say ~$match<coordinates><y> # 3
Pre-digging
I have to dig down through each grammar production to get the value I need
The above techniques still all dig down into the automatically generated parse tree which by default precisely corresponds to the tree implicit in the grammar's hierarchy of rule calls. The above techniques just make the way you dig into it seem a little shallower.
Another step is to do the digging work as part of the parsing process so that the say is simple.
You could inline some code right into the TOP token to store just the interesting data you've made. Just insert a {...} block in the appropriate spot (for this sort of thing that means the end of the token given that you need the token pattern to have already done its matching work):
my $made;
grammar Claim {
token TOP {
'#' <id> \s* '#' \s* <coordinates> ':' \s* <dimensions>
{ $made = ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
}
...
Now you can write just:
say $made # 1 1 3 4 4
This illustrates that you can just write arbitrary code at any point in any rule -- something that's not possible with most parsing formalisms and their related tools -- and the code can access the parse state as it is at that point.
Pre-digging less messily
Inlining code is quick and dirty. So is using a variable.
The normal thing to do for storing data is to instead use the make function. This hangs data off the match object that's being constructed corresponding to a given rule. This can then be retrieved using the .made method. So instead of $make = you'd have:
{ make ~($<id>, $<coordinatess><x y>, $<dimensions><digits>[0,1]) }
And now you can write:
say $match.made # 1 1 3 4 4
That's much tidier. But there's more.
A sparse subtree of a parse tree
.oO ( 🎶 On the first day of an imagined 2019 Perl 6 Christmas Advent calendar 🎶 a StackOverflow title said to me ... )
In the above example I constructed a .made payload for just the TOP node. For larger grammars it's common to form a sparse subtree (a term I coined for this because I couldn't find a standard existing term).
This sparse subtree consists of the .made payload for the TOP that's a data structure referring to .made payloads of lower level rules which in turn refer to lower level rules and so on, skipping uninteresting intermediate rules.
The canonical use case for this is to form an Abstract Syntax Tree after parsing some programming code.
In fact there's an alias for .made, namely .ast:
say $match.ast # 1 1 3 4 4
While this is trivial to use, it's also fully general. P6 uses a P6 grammar to parse P6 code -- and then builds an AST using this mechanism.
Making it all elegant
For maintainability and reusability you can and typically should not insert code inline at the end of rules but should instead use Action objects.
In summary
There are a range of general mechanisms that scale from simple to complex scenarios and can be combined as best fits any given use case.
Add parentheses as I explained above, naming the capture that those parentheses zero in on, if that is a nice simplification for digging into the parse tree.
Inline any action you wish to take during parsing of a rule. You get full access to the parse state at that point. This is great for making it easy to extract just the data you want from a parse because you can use the make convenience function. And you can abstract all actions that are to be taken at the end of successfully matching rules out of a grammar, ensuring this is a clean solution code-wise and that a single grammar remains reusable for multiple actions.
One final thing. You may wish to prune the parse tree to omit unnecessary leaf detail (to reduce memory consumption and/or simplify parse tree displays). To do so, write <.foo>, with a dot preceding the rule name, to switch the default automatic capturing off for that rule.
You can refer to each of you named portions directly. So to get the cordinates you can access :
say $match.<coordinates>.<digits>
this will return you the Array of digits matches. Ig you just want the values the easiest way is probably :
say $match.<coordinates>.<digits>.map( *.Int) or say $match.<coordinates>.<digits>>>.Int or even say $match.<coordinates>.<digits>».Int
to cast them to Ints
For the id field it's even easier you can just cast the <id> match to an Int :
say $match.<id>.Int

Capture a value from a repeating group on every iteration (as opposed to just last occurrence)

How does one capture a value recursively with regex, where value is a part of a group that repeats?
I have a serialized array in mysql database
These are 3 examples of a serialized array
a:2:{i:0;s:2:"OR";i:1;s:2:"WA";}
a:1:{i:0;s:2:"CA";}
a:4:{i:0;s:2:"CA";i:1;s:2:"ID";i:2;s:2:"OR";i:3;s:2:"WA";}
a:1 stands for array:{number of elements}
then in between {} i:0 means element 0, i:1 means element 1 etc.
then the actual value s:2:"CA" means string with length of 2
so I have 2 elements in first array, 1 element in the second and 4 elements in the last
I have this data in mysql database and I DO NOT HAVE an option to parse this with back-end code - this has to be done in mysql (10.0.23-MariaDB-log)
the repeating pattern is inside of the curly braces
the number of repeats is variable (as in 3 examples each has a different number of repeating patterns),
the number of repeating patterns is defined by the number at 3rd position (if that helps)
for the first example it's a:2:
and so there are 2 repeating blocks:
i:0;s:2:"OR";
i:1;s:2:"WA";
I only care to extract the values in bold
So I came up with this regex
^a:(?:\d+):\{(?:i:(?:\d+);s:(?:\d+):\"(\w\w)\";)+}$
it captures the values I want all right but problem is it only captures the last one in each repeating group
so going back to the example what would be captured is
WA
CA
WA
What I would want is
OR|WA
CA
CA|ID|OR|WA
these are the language specific regex functions available to me:
https://mariadb.com/kb/en/library/regular-expressions-functions/
I don't care which one is used to solve the problem
Ultimately I need this in as sensible form that can be presented to the client e.g. CA,ID,OR or CA|ID|OR
Current thoughts are perhaps this isn't possible in a one liner, and I have to write a multi-step function where
extract the repeating portion between the curly braces
then somehow iterate over each repeating portion
then use the regex on each
then return the results as one string with separated elements
I doubt if such a capture is possible. However, this would probably do the job for your specific purpose.
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(str1, '^a:\\d+:\{', ''),
'i:\\d+;s:\\d+:\"(\\w\\w)\";',
'\\1,'
),
'\,?\}$',
''
)
Basically, this works with the input string (or column) str1 like
remove the first part
replace every cell with the string you want
remove the last 2 characters, ,}
and voila! You get a string CA,ID,OR.
Aftenote
It may or may not work well when the original array before serialised is empty (it depends how it is serialised).

How to remove string that is no longer needed in MySQL database?

This seemed like it should be very simple to do yet I've not been able to find an answer after weeks of looking.
I'm trying to remove strings that are no longer needed. Regex_replace sounds perfect but is not available in MySQL.
In MySQL how would I accomplish changing this:
[quote=ABC;xxxxxx]
to this:
[quote=ABC]
The issues are:
- this can appear anywhere in a text blob
- the xxxxxx can only be numeric but may be 6, 7 or 8 characters long
- not adding/removing any rows, just rewriting the contents of one column on one row at a time.
Thanks.
I don't think you really need REGEX_Replace (though it would make things easier of course).
Assuming that the example you presented is a real reflection of what you have:
Your starting point is with the string [quote=<something>;, meaning that you can start searching for [quote=,
Once you found it, you need to search for ; and after that for ],
Once you found them both, you know what to extract when where to start for the next search (if the pattern you mentioned can appear more than once within a singe blob.
Did I get you correctly?
EDIT
This paradigm is aimed to convert all instances of [quote=ABC;xxxxxx] to [quote=ABC] under the following assumptions:
The pattern can appear any number of times within the input string,
The length of xxxxxx is not fixed,
The resulting string (after removing all the appearances of ;xxxxxx) should replace the value in the table,
Performance is not an issue since either this is going to be a one-time job (through the whole table) or it will run every time on a single string (e.g. before INSERTing a new record).
Some MySQL functions that will be used:
INSTR: Searches within a string for the first appearance of a sub-string and returns the position (offset) where the sub-string was found,
SUBSTR: Returns a substring from a string (several ways to use it),
CONCAT: Concatenates two or more strings.
The guidelines presented here apply for the manipulation of a single INPUT string. If this needs to be used over, say, a whole table, simply get the strings into a CURSOR and loop.
Here are the steps:
Declare five INT local variables to serve as indices and total input string length, say L_Start, L_UpTo, l_Total_Length, l_temp1 and l_temp 2, setting the initial value for l_Start = 1 and l_Total_Length = LENGTH(INPUT_String),
Declare a string variable into which you will copy the "cleaned" result and initiate it as '', say l_Output_str; also declare a temporary string to hold the value of 'ABD', say l_Quote,
Start a infinite loop (you will set the exit condition within it; see below),
Exit loop if l_Start >= l_Total_Length (here is one of the two exit points from the loop),
Find the first location of '[quote=' within the input string starting from L_Start,
If the returned value is 0 (i.e. substring not found), concatenate the current contents of l_Output_str with whatever remains if the input string from position L_start (e.g. SET l_Output_str = CONCAT(l_Output_str,SUBSTR(INPUT_String,L_Start) ;) and exit loop (second exit position),
Search the input string for the ; symbol starting from L_start + 7 (i.e. the length of [quote=) and save the value in l_temp_1,
Search the input string for the ] symbol starting from L_start + 7 + l_temp2 and save the value in l_temp_2,
Add the found result to output string as SET l_Output_str = CONCAT(l_Output_str,'[quote=',SUBSTR(INPUT_String,L_Start + 7, l_temp_2 - l_temp_1),']') ;,
Set L_Start = L_Start + 7 + l_temp_2 + 1 ;
End of loop.
Notes:
As I neither made the code nor tested it, it is possible that I'm not setting indices correctly; you will need to perform detailed tests to make get it working as needed;
The above IS the method I suggested;
If the input string is very long (many MBs), you might observe poor performance (i.e. it might take few seconds to complete) because of the concatenations. There are some steps that can be taken to improve performance, but let's have this working first and then, if needed, tackle the performance issues.
Hope that the above is clear and comprehensive.

Perl search lookup efficiency

I have a batch of urls that I have to search through the database for a match or rather if the url contains the url in the database.
An example of a url is
http://www.foodandnuts.com/login.html
The database has a table filled with urls
Currently my script created a array at the start that has all the urls in my database
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
foreach my $j (keys %$results) {
push(#urldb, $j);
}
It will then go through the array to see if the url contains the url from the database
foreach(#urldb){
if($searchedurl=~ /$_/){
#do things here
}}
The problem is that is extremely slow as the array has more than 10000 urls so each searched url has to go through that array. Is there any way to make this faster?
The question can be answered differently depending on which of 3 kinds of URL matches you wish:
Exact full matches only (string equality). E.g. if DB url is "google.com", then search string "http://google.com" will NOT match, nor will "google.com/q=a".
In this case, drop using regexps, and either simply do SELECT * FROM urls WHERE url="$search", or do a hash lookup as Andreas' answer details.
Both search URL and URLs in DB are valid URLs (e.g. start with http://) and therefore MUST match starting with beginning of string, but the search URL can contain a DB URL+suffix to match. E.g. if DB URL is "http://google.com", then search strings "http://google.com" AND "http://google.com/q=a" match.
In this case, either do a start-anchored RegEx, or start-anchored "LIKE" DB match - see details in the next part of the answer.
Any substring match. E.g. if DB URL is "google", then any URL containing "google" string matches anywhere.
In this case, either do word-lookup table, or even smarter substring lookups algoritrhms; or do a batched regex matches using "|" to join multiple DB urls. See details in the last part of the answer.
This part of the answer assumes your URLs in DB can be substrings of search URL but they all start with "http", meaning they always match at the beginning of the string; but are not exact matches.
Solution 1 for start-anchored match (Perl):
Fix your RegExes to be anchored at the beginning: if($searchedurl=~ /^$_/){
Solution 2 for start-anchored match (DB):
Index your URL table by URL field, and do (Sybase syntax)
$query = qq[SELECT * FROM urls WHERE url LIKE "$searchurl\%"];
This will do a very efficient DB search for start-anchored substrings.
NOTE: the tradeoff between doing matches in DB vs Perl is:
If you have 1 DB and 100s of clients, you don't want to overload the DB doing string matching. Distribute the CPU load onto clients.
If you only have 1-2 clients, DB may be better as you will transfer less data from disk IO in DB (index on the table will help) and over network.
This part of the answer assumes your URLs in DB can be full substrings of search URL, not necessarily exact or even anchored matches.
Solution 1 for random substring match (Perl):
One purely Perl way you can make this faster is by combining your search strings into batches:
Split off first N elements from #urldb, in a loop
my $N = 10;
my $start = 0;
my $end = $N - 1;
while ($start < #urldb ) {
search_with($searchedurl, #urldb[$start..$end]); # see next bullet
$start += $N;
$end += $N;
$end = #urldb if $end > #urldb;
}
For each of length-N arrays, join the elements with "|" and create a regex
sub search_with {
my $searchedurl = shift;
my $regex_string = join("|", #_);
if ($searchedurl =~ /($regex_string)/) {
# Do stuff, $1 will contain what matched.
}
}
Solution 2 for random substring match (DB):
Another more algorithmic way to do it is to build an "word lookup" table (aka index, but I'd rather not use the term index to avoid confusion with database indices).
Split off each URL into words.
In the DB, add a unique ID to URL table
In the DB, add an "word lookup" table mapping (1-to-N) URL ID to every individual word (1 per row) in that URL
Use the "word lookup" table to narrow down the list of URLs to query out.
You can use a database index on "word lookup" table to make that search VERY fast.
You will of course need to split search URL into words as well.
Further speed up/narrow down by separately indexing domain name words from paths.
NOTE: using a simple "WHERE" clause in-database to search your URL table is a VERY bad idea if the URLS can be substrings that don't match on the first character - this way, you can't use and index and will simply scan the table.
NOTE2: For even more efficient substring matching against arrays of strings, there are more advanced algorithms based on graphs of substrings.
NOTE3: Tradeoff between doing matching in Perl and DB is same as in the first half of the answer.
#DVK is right about the fact that it is usually more efficient if you can anchor the match at the beginning. That way you can use a standard btree index to search against (MySQL doesn't have PostgreSQL's richer range of index types afaik).
I'd disagree with him/her about where to do the matching. It almost always makes sense to do this in the database itself. That's what a database is for.
The most efficient way is probably something like:
Create a TEMPORARY TABLE to hold your target urls
Bulk insert your targets to that temporary table
Create an index on them (assuming indexes will help here)
Join from your main url table to your targets using a LIKE match.
Even if you can't use indexes, the database should really be quicker than your perl. You're reading the entire table, packaging up the raw data into the transport protocol, transferring it, parsing that into perl values, assembling a hash and then checking it. Assuming your list of target urls is much smaller than the full list in the database you'll win just by not transferring so much data.
Note: OP asked for a solution where the search string should contain the url. I've changed my solution to try to normalize the urls so that hash matches are exact lookups after
getting comments of this.
This code is not tested, it should serve as some form of pseudo code that might work
Create a hash instead of an array. Hashes are ordered and better suited as lookups.
my $results = $dbh->selectall_hashref('SELECT * FROM urltable;', 'url');
my %urldb = map { normalize($_) => 1 } keys %$results;
sub normalize {
my $url = shift;
$url =~ s|http://||; # strip away http:// if present
$url =~ s|www\.||; # strip away www if present
$url =~ s|/.*||; # strip away anything after and including /
return $url;
}
Then you would search with
if (exists($urldb{normalize($searchedurl)})) {
#do things here
}

Inverted search: Phrases per document

I have a database full of phrases (80-100 characters), and some longish documents (50-100Kb), and I would like a ranked list of phrases for a given document; rather than the usual output of a search engine, list of documents for a given phrase.
I've used MYSQL fulltext indexing before, and looked into lucene, but never used it.
They both seem geared to compare the short (search term), with the long (document).
How would you get the inverse of this?
I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.
Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.
First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.
Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.
When you find a match, you do the actual string comparison to see if you found a match.
Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:
HashSet<Long> foundHashes = new HashSet<Long>();
LinkedList<String> words = new LinkedList<String>();
for(int i=0; i<params.maxPhrase; i++) words.addLast("");
StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
Token t = new Token();
while(st.next(t) != null) {
String token = new String(t.termBuffer(), 0, t.termLength());
words.addLast(token);
words.removeFirst();
for(int len=params.minPhrase; len<params.maxPhrase; len++) {
String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");
long hash = Utils.longHash(term);
if(params.lexicon.isTermHash(hash)) {
foundHashes.add(hash);
}
}
}
for(long hash : foundHashes) {
if(count.containsKey(hash)) {
count.put(hash, count.get(hash) + 1);
} else {
count.put(hash, 1);
}
}
Would it be too slow to turn each phrase into a regex and run each one on the document, counting the number of occurrences?
If that doesn't work, maybe you can combine all the phrases into one huge regex (using |), and compile it. Then, run that huge regex starting from every character in the document. Count the number of matches as you go through the characters.
How large is the database of phrases? I am assuming that it is very large.
I would do the following:
Index the phrases by one of the words in it. You might choose the least common word in each phrase.You might make the search better by assuming that the word is at least e.g. 5 characters long, and padding the word to 5 chars if it is shorter. The padding can be the space after the word, followed by the subsequent word, to reduce matches, or some default character (e.g. "XX") if the word occurs at the end of the phrase.
Go through your document, converting each word (common ones can be discarded) to a key by padding if necessary, retrieving phrases.
Retrieve the relevant phrases by these keywords.
Use an in-memory text search to find the number of occurrences of each of the retrieved phrases.
I am assuming that phrases cannot cross a sentence boundary. In this case, you can read each sentence of the document into a substring in an array, and use the substring function to search through each sentence for each of the phrases and count occurrences, keeping a running sum for each phrase.
Maybe reading Peter Turney on keyphrase extraction will give you some ideas. Overall, his approach has some similarity to what itsadok has suggested.