Searching in Solr - mysql

I am building an ecommerce project where I am using solr search engine.I want to search based on specific keyword. If I enter "c1234" , it should display all the documents having keyword "c1234". Its working fine. But, if I enter "c12#34" then also it should consider "c1234" only. So the problem is I want to ignore the hash tag here. Solr should not consider my hash tag and it should display the same result for both the cases.
The other problem is I want to trim whitespaces. If I search "HP 940", it should trim the whitespace and should display the similar result as "HP940". So I want to have similar reults to be displayed with or without the whitespace. For example,
if I enter "Hp 940", solr should consider it as "HP940". So the problem is triming the white spaces
Thanks in Advance

Try to use olr.WordDelimiterFilterFactory
Test case:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" catenateWords="1"
generateNumberParts="1" catenateNumbers="0" splitOnNumerics="1"
catenateAll="0" splitOnCaseChange="1"
stemEnglishPossessive="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
To replace # you should use https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

For the hashtag and other characters you should take a look at the solr.WordDelimiterFilterFactory for this with the catenateWords parameter or alternatively the solr.PatternReplaceCharFilterFactory.
For words like HP 940 also consider something like phrase fields on the dismax handler with no slop.

Related

How to apply Photoshop-style color curves filters to an HTML tag?

I'm trying to replicate the following Photoshop color curves filter in an HTML video tag.
The closest answer I've found so far is how to add Photoshop-like color levels with CSS and SVG Filters but it's not quite what I need.
Any approach using CSS, SVG Filters, or even a third-party library is welcome!
feComponentTransfer / table is how you implement color curves. This filter will - roughly - produce that curve combo and should give you a place to start. The first feComponentTransfer implements the color curves (I eyeballed the values - you'll want to go back and do it more carefully). The second implements the white-point adjustment.
I think the order is correct, but I'm not a Photoshop expert, so you may have to put the white point adjustment first.
<filter id="color-curve" color-interpolation-filters="sRGB">
<feComponentTransfer>
<feFuncR type="table" tableValues="0.0 0.22 0.4 0.6 0.4 0.8 0.86 0.92 0.96 0.98 1.0"/>
<feFuncG type="table" tableValues="0.0 0.0 0.05 0.1 0.22 0.4 0.6 0.83 0.92 0.97 1.0"/>
<feFuncB type="table" tableValues="0.1 0.8"/>
</feComponentTransfer>
<feComponentTransfer>
<feFuncR type="table" tableValues="0.1 1"/>
<feFuncG type="table" tableValues="0.1 1"/>
<feFuncB type="table" tableValues="0.1 1"/>
</feComponentTransfer>
</filter>
I wrote the docs for webplatform on feComponentTransfer linked here: http://www.webplatform.org/docs/svg/elements/feComponentTransfer/
There is no more comprehensive guide to how this filter primitive works so just read that carefully.

Dynamic Highlighted content filtering using page property as filter

I am using caml for filtering highlighted content. My query is like:
<View>
<Query>
<Where>
<eq>
<FieldRef Name='Service' />
<Value Type='Text'>Wi-Fi</Value>
</eq>
</Where>
</Query>
</View>
How can I replace the value Wi-Fi with a page property used to filter all highlighted web parts on the page.
I have 70 pages and 7 highlighted web parts per page and do not want to have to enter manual filters for each high listed content webpart.

Strip, store and index HTML files in Solr

I'm trying to search a collection of HTML files and also provide excerpts in Solr 6.4.1. And since the highlighting needs to return clean readable text, the HTML needs to be stripped down to bare text and stored.
But no matter what I change in the core's configuration, the field I'm specifying does not get returned in the result and highlighting for the document is always empty {}.
managed-schema:
<fieldType name="text_en_splitting_html" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="content1" type="text_en_splitting_html" multiValued="true" indexed="true" stored="true"/>
solrconfig.xml is the default one, with the default /update/extract requestHandler. The response I'm getting is:
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"*:*",
"hl":"on",
"indent":"on",
"hl.fl":"content1",
"wt":"json",
"_":"1488077854581"}},
"response":{"numFound":100,"start":0,"docs":[
{
"id":"/home/me/files/d1/test.html",
"stream_size":[62963],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"creator":["createhtml"],
"stream_content_type":["text/html"],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["A nice read"],
"content_encoding":["UTF-8"],
"resourcename":["/home/me/files/d1/test.html"],
"title":["A nice read"],
"creator_url":["http://createhtml.net"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1560362957551960064}
...
},
"highlighting":{
"/home/me/files/d1/test.html":{},
...
I'm indexing with
/opt/solr/bin/post -c mycollection -filetypes html files/
I've also tried with the Tika extract handler
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
</lst>
</requestHandler>
but with limited success. A "content" field now appears in the response and it contains what appears to be a poorly stripped and badly formatted version of the initial document. Highlighting appears to work but it's not clean.
So what I need Solr to do is:
clean up my HTML entirely (no tags, class names, or inline styles - just like JavaScript's .text() method)
perform the search on the stripped content
return the stripped content if I ask it to
return the highlighting on the stripped content
It seems that no matter what I change (except Tika above), "content1" is ignored.
All I'm trying to do here, simply put, is be able to search HTML files and provide excerpts like any other search engine.
I was unable to make this work and Tika would not correctly strip the HTML, so I fixed this by using the Solarium PHP Client for Solr and PHPQuery to parse, strip, extract data, then form my own document to post directly to Solr.
The problem was the ERH (ExtractRequestHandler) defined in solrconfig.xml which was enforcing the use of Tika. By using Solarium, the ERH was bypassed so all fields I defined in managed-schema started being used by the /update request handler.

Solr case insensitve

Hallo,
I'am implementing an autocompletion feature in Solr and have one problem.
For autocompletion I am using
<fieldType name="text_auto" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I thought that the LowerCaseFilter should make the Token Case insensitiv but that ist wrong. In fact in just lowercases the Token which means that a query like "comput" would lead to "computer" while "Comput" doesn't.
Actually I want comput and Comput to lead to Computer.
I allready tried this:
<fieldType name="text_auto_low" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_auto_up" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
For some reason it doesn't word either. My question is why and haw can I fix this?
Lucene has the Analyser class which you can use(implement) in three ways:
SimpleAnalyzer : This converts all of the input to lower case.
StopAnalyzer : This removes words that removes noise from your search.
StandardAnalyzer : This does both the above filter processes and thus can 'clean up' your query.
Now, coming to your question, i would recommend a techinque called ngram that splits up your query and then searches for those phrases instead. Thus, you can still get excellent results even if there are typos.
To know how to do this, i suggest you to read this to get you started. It also has other great info regarding queries.
This not only will solve your problem, but will enhance your app.
Have fun :D

Log4Net filters "OR"

Is it possible to make a filter, for example a PropertyFilter that is neutral (and passed to next filter in the chain) if either one or another value matches? Something like:
<filter type="log4net.Filter.PropertyFilter">
<Key value="myProperty" />
<StringsToMatch Operator="OR">
<Match>value1</Match>
<Match>value2</Match>
</StringsToMatch>
</filter>
I really don't want to write my own filter and would prefer to accomplish this with the normal Log4Net filters. Is this possible?
You could certainly develop such a filter yourself by subclassing FilterSkeleton.
But instead of making a specialized filter like this I suggest you rather implement a more generic filter that could be configured to contain a collection of filters and apply the Operator over those. The config could look something like this:
<filter type="CompositeFilter">
<operator value="Or" />
<filters>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value1" />
</filter>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value2" />
</filter>
</filters>
</filter>
If you make such a filter I encourage you to submit it to the log4net project. It would certainly be useful for the general public :)