solr highlighting not working for long words - json

I am using Solr 6.2. I want to get the matched words by using highlight option.
When I search with the word "miss" I can get the highlights. But I cant get results for the word "missing".
For Example:
when I search with "miss" I can get the below results:
http://localhost:8983/solr/logbook1/select?debugQuery=on&defType=dismax&defType=edismax&hl.fl=*&hl=on&indent=on&q=miss&rows=5&wt=json
highlighting":{
"3246a347-874a-44e2-bb3d-949a358f435d":{
"String1":["IN REFERENCE CABIN LOG PAGE 22838. TWO EXTENSION SEAT BELT <em>MISS</em> ING"]},
"46a340f8-949f-41fe-b2ee-c1936bfc6b4f":{
"String1":["IN REFERENCE CABIN LOG PAGE 22838. TWO EXTENSION SEAT BELT <em>MISS</em> ING"]},
"df6eef1c-971d-48f7-a93a-07874011ae5b":{
"String1":["ACCESS PANEL 343EB ON R/H HORIZONTAL STAB FOUND WITH SCREW <em>MISS</em> ING AND LOOSE"]},
"9a124f6d-f32b-4e24-beb2-11f7aa22894d":{
"String1":["AFT GALLEY # 4 COFFEE MAKER SHIELDS ON COMPT 419 - 420 ARE <em>MISS</em> ING."]}},
When I search with missing, I am getting no result as below:
http://localhost:8983/solr/logbook1/select?debugQuery=on&defType=dismax&defType=edismax&hl.fl=*&hl=on&indent=on&q=missing&rows=5&wt=json
"highlighting":{
"0d2963a7-adea-40ab-af0a-bb8fe069c4d9":{},
"9f23f4c0-6989-471d-8c61-4016a8e38813":{},
"c77b6be1-547c-43fe-94f0-ae5c0849eab4":{},
"f5792594-7fd2-42b5-92c4-03257c05adba":{},
"68d9251a-74d9-409e-84ec-a67a0eb94866":{}},
I have checked the fragsize options. Please guide is there anything to configure.

1) So i assume you already of lowercase filter on your index field as it will fetch upper and lower case results.
2) And have you added extra space between miss + ing ? if yes you need to remove that and have a try.
3) Please check stop word dictionary if you haven't accidentally added missing there as they get ignored in searching.
4)Try Analyzer from solr to see how to transforms your search term, analyzer is available in solrconsole.

Have you set indexed and stored to true?
For me it looks like, that there are probably different setting for token-handling on indexing and search time. Take a look at your schema.xml an try to work with same settings for indexing and searching.

Related

how to convert/match a handwritten list of names? (HWR)

I would like to see if I can scan a sign-in sheet for a class. The good news is I know 90% of the names that might be written.
My idea was to use tessaract to parse an image of names, and then use the Levenshtein algorithm to compare each line with a list of names in my database and if I get reasonably close matches, then that name is right.
Does this approach sound like a good one? If not, other ideas?
I tried using tesseract on a sample sheet (see below)
I used:
tesseract simple.png -psm 4 outtxt
Tesseract Open Source OCR Engine v3.05.01 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
I am assuming it didn't like line 2 because I went below the line.
The results I got were:
1.. AM: (harm;
l. ’E (J 22 a 00k
2‘ wau \\) [HQ
4. KIM TAYLOE
5. LN] Davis
6‘ Mzflé! Ha K
Obviously not the greatest, my guess is the distance matches for 4 & 5 would work, but the rest are not even close.
I have control of my sign-in sheet, but not the handwriting of folks coming in, so if any changes to that I can do to help, please let me know.
Since your goal is to get names only - I would suggest you to reduce tessedit_char_whitelist to english alphabetical ones("ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.") so that you will not get characters that you don't expect as output like \\) [ .
Your initial approach to calculate L distance is fine if you success to extract text from handwritten image (which is a hard task for tesseract).
I would also suggest to run some preprocessing on your image. For example you can remove horizontal lines and extract text ROIs around them. In the best case you will be able to extract separated characters, but even if you don't do that - you will get better results & will be able to distinguish result names "line by line".
You should also try other recommended output quality improvement stages which you can find in Tesseract OCR wiki (link)

How to realize a context search based on synomyns?

Lets say an internet user searches for "trouble with gmail".
How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?
I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.
Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.
But this won't cover all synonyms like the "gmail"-example.
Which other options do I have? Maybe something based on the given data and logged search phrases of the past?
You have to think of it ignoring the language.
When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.
You type "problem with gmail".
Two choices:
Your search give results: you click on one item.
The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".
Your search give poor results:
We will search in our history for a matching search:
We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.
When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.
If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.
And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.
I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.
But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.
This is a bit long for a comment.
What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)
The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).
Pseudo-code for getting the final word list with synonyms would be something like:
select #finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, #words) > 0;
Seems to me that you have two problems on your hands:
Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services
Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.
So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.
I think you have bitten off a very large problem for yourself.
If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.
Seeing your revised question, what about using a public API?
http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)

Banned words checking algo

I am building a text chat system. I want to add the ability to check for banned words/phrases.
The only technique I can think of, and can't believe it could possibly be the best approach is to do a FOR loop through all the words and search for matches in the text. This seems like it would be unbelievably slow once lots of words are added.
I'm using AS3, but an answer in most any language would probably be useful.
take care,
lee
use an AS3 dictionary or a dict in python and just check if the word is in the dict. there is no way I can see to not go over all the words.
Consider concatenating all the entries in your Dictionary into a single RegExp, with which you have to parse the text only once. I've done some testing, and it's going to be way faster than replacing word for word.
function censorWithDictionary ( dict:Dictionary, text:String ) : String {
var reg : String = "";
for (var key:Object in dict)
{
reg += reg=="" ? "" : "|"; // add an "or" for multiple search words
reg += "\\b"+dict[key]+"\\b"; // only whole words
}
var regExp : RegExp = new RegExp ( reg, "gi" );
return text.replace ( regExp, "----" );
}
I had a similar problem - we run a gaming site and wanted to introduce a chat system which was not manually moderated. We went the "banned word" route and it's working really well.
I just counted them and we now have a list of (just) 79 banned words which originated from something I found on-line to which we have added words over time when chat messages crept through.
The way we check things is that we concatenate an entire chat message by removing all spaces and none alpha characters and then search for banned words in what's left.
The key decisions we made are:
Don't tell people why you rejected their messages
Don't let people post chat until you trust them a bit (on our site they have
to have played 3 games)
5 "Bad" messages and we automatically block you
We email a report out daily with all the chat which got through which we scan through
We allow other users to complain about posted messages - if that happens the message is automatically removed so we can check it later.
1+3+5 Hardly ever happen now and it works wonderfully even though - sometimes messages like
"I wish it was hot!"
Are rejected (the clue is the "sh" part of wish and "it") but even that doesn't happen often.
This is more a comment than an answer, but comments are limited in length and there're big issues here.
I believe you are fundamentally asking the wrong question!
Certainly dictionaries and blacklist would highlight words or phrases that you want to ban but would that list be acceptable to users of your system? Would there be text that users of your system find offensive but you do not. Who decides?
For example, would people living here have trouble or indeed people living here. What if you supported this football/soccer team. This person probably never visits the UK.
Then you get into the issue of anagrams and slang. FCUK is a high street brand in the UK (and elsewhere I'm sure). And then there's pr0n (no link!) or NAMBLA.
The real question is - How do I stop people using the system from using language that is generally unacceptable? And that's more a design / social engineering problem than a programming problem. I don't think this site has word / phrase filtering and yet there's nothing here that would cause offense to anyone.
Here's an idea - let your users decide what is acceptable! Use a reputation based system. Allow users to vote up users who behave and vote down users that cause offense (with the option of allowing users to give feedback on the vote to give them a chance to mend their ways) and then have an option to filter out users with low / negative reputations.

How to seperate an address string mashed together in MySQL

I have an address string in MySQL that has been mashed together from the source. I think it is possible to use a regular expression or some other method to seperate the string into usable parts in MySQL, but I am not aware of how this could be acheived.
Basically each string looks something like these examples (I have added a marker to the top to show what each bit is):
<-------------><-------><-><-->
123 Fake StreetRESERVOIRVIC3001
<-----------------><--------------------><------><-><-->
Brooks Nursing Home123 Little Fake StreetSMITHTONNSW2001
<-------------------><-------------------><--- ><><-->
Grange Police StationShop 1 Fairytale LaneGRANGEWA8001
The address supposed to be broken up into optionally two lines of address information, suburb, state and post code. I'm in Australia so the state will be either NSW,VIC,QLD,WA,SA,NT or ACT and the postcode will always be a 4 digit number at the very end.
The possible ways to break it up are that the suburb will always be capitalised, the state and postcode will be predicatable within the last 6 or 7 characters (depending on state) and the first two lines of address information will be broken up by a change in case with no space character in between.
I have some 100,000 records like this, so to go through and do it by hand would be very time consuming. Any help on a way of doing this programatically would be much appreciated.
With no spaces? Most gross...
MySQL doesn't have the tools to deal with that, so you'll have to access the database with an external program. I tend to use Perl for manipulations like this.
Start from the end and work backwards... we know the last four should be digits, and the letters preceding that one of 7 options. Use that knowledge and you'll be down 2 fields and 6-7 characters.
It looks like your example now has a town in all capital letters at the end... Parse out that, and it should match to the state and area code. I'm certain you can find a database of zip codes within some minutes online.
With the name and street address remaining, that will have some variability to it, and I wish you a bit of luck there. You may have a head-start with being able to concentrate on the lack of a space between a lowercase and capital, or a letter and number as a breaking point.
Challenge accepted. I'll even throw in some basic punctuation to allow for "101 St. Mark's St." and the like.
/^(([\w\'\.](?=[a-z \'\.])| )+[a-z\'\.])?(([\w\'\.](?=[a-z \d\'\.])| )+[a-z\.\'])([A-Z]+)(NSW|VIC|QLD|WA|SA|NT|ACT)(\d{4})/
Could probably use a little more clean-up, but it should work in any language which supports basic regex with lookahead (some implementations, like JavaScript's and (I think) Ruby's, support lookahead, but not lookbehind). (That, and this puzzle kept me up well past my bed time.) At the very least, it worked on the three examples you provided.
By the way, 2problems.com is a great site for quickly testing regular expressions. It's what I used to work this puzzle out. The guy who built it must have been a real genius. (koff koff)
Rubular is another good option, though since it works by making Ajax calls to a Ruby script behind-the-scenes, it's a bit slower. It does have the nice feature of being able to link to entered patterns and haystacks, though; here's this pattern on Rubular. The 2problems guy really should get around to implementing something like that some day.