Solr case insensitve - configuration

Hallo,
I'am implementing an autocompletion feature in Solr and have one problem.
For autocompletion I am using
<fieldType name="text_auto" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I thought that the LowerCaseFilter should make the Token Case insensitiv but that ist wrong. In fact in just lowercases the Token which means that a query like "comput" would lead to "computer" while "Comput" doesn't.
Actually I want comput and Comput to lead to Computer.
I allready tried this:
<fieldType name="text_auto_low" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_auto_up" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
For some reason it doesn't word either. My question is why and haw can I fix this?

Lucene has the Analyser class which you can use(implement) in three ways:
SimpleAnalyzer : This converts all of the input to lower case.
StopAnalyzer : This removes words that removes noise from your search.
StandardAnalyzer : This does both the above filter processes and thus can 'clean up' your query.
Now, coming to your question, i would recommend a techinque called ngram that splits up your query and then searches for those phrases instead. Thus, you can still get excellent results even if there are typos.
To know how to do this, i suggest you to read this to get you started. It also has other great info regarding queries.
This not only will solve your problem, but will enhance your app.
Have fun :D

Related

Get the release date of an album last.fm api

I need to get the release date of a song.
In last.fm API, like described on the documentation, is enough to make an HTTP request to the server and it will reply with an XML (or JSON) that contain the field "" (like is shown in the Sample Response on the website).
The problem is that if I call with the same request in the documentation the reply is identical except the field that I need.
Is possible to get this information in another way?
Example:
portion of the sample response in the website:
<album>
<name>Believe</name>
<artist>Cher</artist>
<id>2026126</id>
<mbid>61bf0388-b8a9-48f4-81d1-7eb02706dfb0</mbid>
<url>http://www.last.fm/music/Cher/Believe</url>
<releasedate>6 Apr 1999, 00:00</releasedate> //i need this
<image size="small">...</image>
<image size="medium">...</image>
<image size="large">...</image>
<listeners>47602</listeners>
<playcount>212991</playcount>
<toptags>
<tag>
<name>pop</name>
<url>http://www.last.fm/tag/pop</url>
</tag>
...
</toptags>
<tracks>
<track rank="1">
<name>Believe</name>
<duration>239</duration>
<mbid/>
<url>http://www.last.fm/music/Cher/_/Believe</url>
<streamable fulltrack="0">1</streamable>
<artist>
<name>Cher</name>
<mbid>bfcc6d75-a6a5-4bc6-8282-47aec8531818</mbid>
<url>http://www.last.fm/music/Cher</url>
</artist>
</track>
...
</tracks>
</album>
my Response
<album>
<name>Believe</name>
<artist>Cher</artist>
<mbid>63b3a8ca-26f2-4e2b-b867-647a6ec2bebd</mbid>
<url>https://www.last.fm/music/Cher/Believe</url>
<image size="small">
https://lastfm.freetls.fastly.net/i/u/34s/3b54885952161aaea4ce2965b2db1638.png
</image>
<image size="medium">
https://lastfm.freetls.fastly.net/i/u/64s/3b54885952161aaea4ce2965b2db1638.png
</image>
<image size="large">
https://lastfm.freetls.fastly.net/i/u/174s/3b54885952161aaea4ce2965b2db1638.png
</image>
<image size="extralarge">
https://lastfm.freetls.fastly.net/i/u/300x300/3b54885952161aaea4ce2965b2db1638.png
</image>
<image size="mega">
https://lastfm.freetls.fastly.net/i/u/300x300/3b54885952161aaea4ce2965b2db1638.png
</image>
<image size="">
https://lastfm.freetls.fastly.net/i/u/300x300/3b54885952161aaea4ce2965b2db1638.png
</image>
<listeners>405536</listeners>
<playcount>2644726</playcount>
<tracks>
...
<tags>...</tags>
<wiki>...</wiki>
</album>
</lfm>
The request is http://ws.audioscrobbler.com/2.0/?method=album.getinfo&api_key=MY_API_KEY&artist=Cher&album=Believe
The page where that informations are is: https://www.last.fm/api/show/album.getInfo
Thanks a lot!
Definitely an inconsistency in the Last.fm API documentation vs the actual response.
The only other place where such an info might be available is the artist.getTopAlbums, but the release date is not available there either.
So, to answer your question, no, it is not possible to get the release date of the album via the API. Your best bet, if you really want this piece of information, is to extract it from the html page itself via web-scraping, and to not rely on the API for this scenario.
I'm having the same issue - the release date doesn't come back. I've opted to use the MusicBrainz API to get the album details. However the issue with that is the calls will come back as 500s if too many requests are made in quick succession - the only workaround that I can find is to put a second delay on each call, which becomes annoying with a lot of albums.
Anyway, here's a sample request : http://musicbrainz.org/ws/2/release/844eb096-2b84-4c8f-9922-7f287126b39e?fmt=json
If you find a workaround for the timing issue let me know

Strip, store and index HTML files in Solr

I'm trying to search a collection of HTML files and also provide excerpts in Solr 6.4.1. And since the highlighting needs to return clean readable text, the HTML needs to be stripped down to bare text and stored.
But no matter what I change in the core's configuration, the field I'm specifying does not get returned in the result and highlighting for the document is always empty {}.
managed-schema:
<fieldType name="text_en_splitting_html" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="content1" type="text_en_splitting_html" multiValued="true" indexed="true" stored="true"/>
solrconfig.xml is the default one, with the default /update/extract requestHandler. The response I'm getting is:
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"*:*",
"hl":"on",
"indent":"on",
"hl.fl":"content1",
"wt":"json",
"_":"1488077854581"}},
"response":{"numFound":100,"start":0,"docs":[
{
"id":"/home/me/files/d1/test.html",
"stream_size":[62963],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"creator":["createhtml"],
"stream_content_type":["text/html"],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["A nice read"],
"content_encoding":["UTF-8"],
"resourcename":["/home/me/files/d1/test.html"],
"title":["A nice read"],
"creator_url":["http://createhtml.net"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1560362957551960064}
...
},
"highlighting":{
"/home/me/files/d1/test.html":{},
...
I'm indexing with
/opt/solr/bin/post -c mycollection -filetypes html files/
I've also tried with the Tika extract handler
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
</lst>
</requestHandler>
but with limited success. A "content" field now appears in the response and it contains what appears to be a poorly stripped and badly formatted version of the initial document. Highlighting appears to work but it's not clean.
So what I need Solr to do is:
clean up my HTML entirely (no tags, class names, or inline styles - just like JavaScript's .text() method)
perform the search on the stripped content
return the stripped content if I ask it to
return the highlighting on the stripped content
It seems that no matter what I change (except Tika above), "content1" is ignored.
All I'm trying to do here, simply put, is be able to search HTML files and provide excerpts like any other search engine.
I was unable to make this work and Tika would not correctly strip the HTML, so I fixed this by using the Solarium PHP Client for Solr and PHPQuery to parse, strip, extract data, then form my own document to post directly to Solr.
The problem was the ERH (ExtractRequestHandler) defined in solrconfig.xml which was enforcing the use of Tika. By using Solarium, the ERH was bypassed so all fields I defined in managed-schema started being used by the /update request handler.

Searching in Solr

I am building an ecommerce project where I am using solr search engine.I want to search based on specific keyword. If I enter "c1234" , it should display all the documents having keyword "c1234". Its working fine. But, if I enter "c12#34" then also it should consider "c1234" only. So the problem is I want to ignore the hash tag here. Solr should not consider my hash tag and it should display the same result for both the cases.
The other problem is I want to trim whitespaces. If I search "HP 940", it should trim the whitespace and should display the similar result as "HP940". So I want to have similar reults to be displayed with or without the whitespace. For example,
if I enter "Hp 940", solr should consider it as "HP940". So the problem is triming the white spaces
Thanks in Advance
Try to use olr.WordDelimiterFilterFactory
Test case:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" catenateWords="1"
generateNumberParts="1" catenateNumbers="0" splitOnNumerics="1"
catenateAll="0" splitOnCaseChange="1"
stemEnglishPossessive="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
To replace # you should use https://cwiki.apache.org/confluence/display/solr/CharFilterFactories
For the hashtag and other characters you should take a look at the solr.WordDelimiterFilterFactory for this with the catenateWords parameter or alternatively the solr.PatternReplaceCharFilterFactory.
For words like HP 940 also consider something like phrase fields on the dismax handler with no slop.

Work Item HTMLFieldControl Content

I am trying to pre-populate some a few different objects in a User Story in my TFS2012 work items list. The HTMLFieldControl is can be found in the Work Item Types section and I can see that it's created here:
<Tab Label="Details">
<Control FieldName="System.Description" Type="HtmlFieldControl" Dock="Fill" />
</Tab>
I've been looking on Google but I just can't seem to find anything around adding text into this field so that it is always available, perhaps I'm just doing something wrong.
Would doing something similar to this give me the results I require? I can't really just try it and potentially break the system (which is unfortunate) so I need some guidance so that I could quickly fix any formatting / spelling without having to roll back any changes.
<Tab Label="Details">
<Control FieldName="System.Description" Type="HtmlFieldControl" Dock="Fill" >
<FIELD name="Description" refname="System.Description" type="Text" Content="Hello there"/>
</Control>
</Tab>
Obviously the above is purely made up (from the FIELD section) but I included here just along the lines of what I was thinking and hopefully to show what I'm trying to do.
You were looking in the wrong place; those are the definitions for the form (aka, how the Work Item is displayed graphically). You need to scroll up to the top, under FIELDS, and find this:
<FIELD name="Description" refname="System.Description" type="HTML" />
...then change it to this:
<FIELD name="Description" refname="System.Description" type="HTML">
<DEFAULT from="value" value="Hello there" />
</FIELD>

Log4Net filters "OR"

Is it possible to make a filter, for example a PropertyFilter that is neutral (and passed to next filter in the chain) if either one or another value matches? Something like:
<filter type="log4net.Filter.PropertyFilter">
<Key value="myProperty" />
<StringsToMatch Operator="OR">
<Match>value1</Match>
<Match>value2</Match>
</StringsToMatch>
</filter>
I really don't want to write my own filter and would prefer to accomplish this with the normal Log4Net filters. Is this possible?
You could certainly develop such a filter yourself by subclassing FilterSkeleton.
But instead of making a specialized filter like this I suggest you rather implement a more generic filter that could be configured to contain a collection of filters and apply the Operator over those. The config could look something like this:
<filter type="CompositeFilter">
<operator value="Or" />
<filters>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value1" />
</filter>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value2" />
</filter>
</filters>
</filter>
If you make such a filter I encourage you to submit it to the log4net project. It would certainly be useful for the general public :)