Strip, store and index HTML files in Solr

Strip, store and index HTML files in Solr - html

I'm trying to search a collection of HTML files and also provide excerpts in Solr 6.4.1. And since the highlighting needs to return clean readable text, the HTML needs to be stripped down to bare text and stored.
But no matter what I change in the core's configuration, the field I'm specifying does not get returned in the result and highlighting for the document is always empty {}.
managed-schema:
<fieldType name="text_en_splitting_html" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="content1" type="text_en_splitting_html" multiValued="true" indexed="true" stored="true"/>
solrconfig.xml is the default one, with the default /update/extract requestHandler. The response I'm getting is:
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"*:*",
"hl":"on",
"indent":"on",
"hl.fl":"content1",
"wt":"json",
"_":"1488077854581"}},
"response":{"numFound":100,"start":0,"docs":[
{
"id":"/home/me/files/d1/test.html",
"stream_size":[62963],
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.HtmlParser"],
"creator":["createhtml"],
"stream_content_type":["text/html"],
"viewport":["width=device-width, initial-scale=1"],
"dc_title":["A nice read"],
"content_encoding":["UTF-8"],
"resourcename":["/home/me/files/d1/test.html"],
"title":["A nice read"],
"creator_url":["http://createhtml.net"],
"content_type":["text/html; charset=UTF-8"],
"_version_":1560362957551960064}
...
},
"highlighting":{
"/home/me/files/d1/test.html":{},
...
I'm indexing with
/opt/solr/bin/post -c mycollection -filetypes html files/
I've also tried with the Tika extract handler
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
</lst>
</requestHandler>
but with limited success. A "content" field now appears in the response and it contains what appears to be a poorly stripped and badly formatted version of the initial document. Highlighting appears to work but it's not clean.
So what I need Solr to do is:
clean up my HTML entirely (no tags, class names, or inline styles - just like JavaScript's .text() method)
perform the search on the stripped content
return the stripped content if I ask it to
return the highlighting on the stripped content
It seems that no matter what I change (except Tika above), "content1" is ignored.
All I'm trying to do here, simply put, is be able to search HTML files and provide excerpts like any other search engine.

I was unable to make this work and Tika would not correctly strip the HTML, so I fixed this by using the Solarium PHP Client for Solr and PHPQuery to parse, strip, extract data, then form my own document to post directly to Solr.
The problem was the ERH (ExtractRequestHandler) defined in solrconfig.xml which was enforcing the use of Tika. By using Solarium, the ERH was bypassed so all fields I defined in managed-schema started being used by the /update request handler.

Related

Can't use EWS UpdateItem operation from Outlook add-in

I'm am trying to update a dictionary element in the IPM.Configuration.OWA.UserOptions message using an UpdateItem via an EWS (SOAP) request from an Outlook web add-in with ReadWriteMailbox permissions. However it is failing with the following error in the response:
ErrorAccessDenied: Office extension is not allowed to update this type of item.
The UpdateItem request I'm using is a fairly straightforward example of updating a message by it's ID and setting the value of an extended property:
<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:t="http://schemas.microsoft.com/exchange/services/2006/types">
<soap:Header>
<RequestServerVersion Version="Exchange2013" xmlns="http://schemas.microsoft.com/exchange/services/2006/types" soap:mustUnderstand="0" />
</soap:Header>
<soap:Body>
<m:UpdateItem MessageDisposition="SaveOnly" ConflictResolution="AlwaysOverwrite">
<m:ItemChanges>
<t:ItemChange>
<t:ItemId Id="AAMkAGM0YTZmNjhiLTI0OWYtNGFlNC05ODAzLTNlZWQyODhmOTY2MABGAAAAAACxU7lpjO+oS5hB0UfA6muFBwDcAGmTk49MRrSCdR7rvVFPAAAAAAEBAADcAGmTk49MRrSCdR7rvVFPAAD2pXuVAAA=" ChangeKey="CQAAABYAAADcAGmTk49MRrSCdR7rvVFPAAD2uhNb" />
<t:Updates>
<t:SetItemField>
<t:ExtendedFieldURI PropertyTag="0x7c07" PropertyType="Binary" />
<t:Message>
<t:ExtendedProperty>
<t:ExtendedFieldURI PropertyTag="0x7c07" PropertyType="Binary" />
<t:Value>PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz4NCjxVc2VyQ29uZmlndXJhdGlvbj4NCgk8SW5mbyB2ZXJzaW9uPSJFeGNoYW5nZS4xMiIgLz4NCgk8RGF0YT4NCgkJPGUgaz0iMTgtSXNGb2N1c2VkSW5ib3hPbkxhc3RVcGRhdGVUaW1lIiB2PSIxOC0wMS8wMS8wMDAxIDAwOjAwOjAwIiAvPg0KCQk8ZSBrPSIxOC1hdXRvYWRkc2lnbmF0dXJlIiB2PSIzLVRydWUiIC8+DQoJCTxlIGs9IjE4LVVzZXJPcHRpb25zTWlncmF0aW9uU3RhdGUiIHY9IjktNSIgLz4NCgkJPGUgaz0iMTgtdGhlbWVTdG9yYWdlSWQiIHY9IjE4LSIgLz4NCgkJPGUgaz0iMTgtYXV0b2FkZHNpZ25hdHVyZW9ucmVwbHkiIHY9IjMtVHJ1ZSIgLz4NCgkJPGUgaz0iMTgtdGltZXpvbmUiIHY9IjE4LUNlbnRyYWwgU3RhbmRhcmQgVGltZSIgLz4NCgkJPGUgaz0iMTgtc2lnbmF0dXJldGV4dCIgdj0iMTgtJiN4RDsmI3hBOy0tJiN4RDsmI3hBO0l6enogYW0gSmFuaWNrJiN4RDsmI3hBOyYjeEQ7JiN4QTsiIC8+DQoJCTxlIGs9IjE4LUZhdm9yaXRlRm9sZGVycyIgdj0iMS0xOC0zLTEyMC1BQU1rQUdNMFlUWm1OamhpTFRJME9XWXROR0ZsTkMwNU9EQXpMVE5sWldReU9EaG1PVFkyTUFBdUFBQUFBQUN4VTdscGpPK29TNWhCMFVmQTZtdUZBUURjQUdtVGs0OU1SclNDZFI3cnZWRlBBQUFBQUFFTUFBQT0tMTIwLUFBTWtBR00wWVRabU5qaGlMVEkwT1dZdE5HRmxOQzA1T0RBekxUTmxaV1F5T0RobU9UWTJNQUF1QUFBQUFBQ3hVN2xwak8rb1M1aEIwVWZBNm11RkFRRGNBR21UazQ5TVJyU0NkUjdydlZGUEFBQUFBQUVKQUFBPS0xMjAtQUFNa0FHTTBZVFptTmpoaUxUSTBPV1l0TkdGbE5DMDVPREF6TFRObFpXUXlPRGhtT1RZMk1BQXVBQUFBQUFDeFU3bHBqTytvUzVoQjBVZkE2bXVGQVFEY0FHbVRrNDlNUnJTQ2RSN3J2VkZQQUFBQUFBRVBBQUE9IiAvPg0KCQk8ZSBrPSIxOC1Jc09wdGltaXplZEZvckFjY2Vzc2liaWxpdHkiIHY9IjMtRmFsc2UiIC8+DQoJCTxlIGs9IjE4LUlzRm9jdXNlZEluYm94RW5hYmxlZCIgdj0iMy1UcnVlIiAvPg0KCQk8ZSBrPSIxOC1OZXdFbmFibGVkUG9udHMiIHY9IjktMjE0NzQwMTcyNyIgLz4NCgkJPGUgaz0iMTgtc2lnbmF0dXJlaHRtbCIgdj0iMTgtJmx0O2h0bWwmZ3Q7JiN4RDsmI3hBOyZsdDtoZWFkJmd0OyYjeEQ7JiN4QTsmbHQ7L2hlYWQmZ3Q7JiN4RDsmI3hBOyZsdDtib2R5Jmd0OyYjeEQ7JiN4QTsmbHQ7cCZndDsmYW1wO25ic3A7Jmx0Oy9wJmd0OyYjeEQ7JiN4QTsmbHQ7cCZndDstLSZsdDsvcCZndDsmI3hEOyYjeEE7Jmx0O3AmZ3Q7SXp6eiBhbSBKYW5pY2smbHQ7L3AmZ3Q7JiN4RDsmI3hBOyZsdDtwJmd0OyZhbXA7bmJzcDsmbHQ7L3AmZ3Q7JiN4RDsmI3hBOyZsdDsvYm9keSZndDsmI3hEOyYjeEE7Jmx0Oy9odG1sJmd0OyYjeEQ7JiN4QTsiIC8+DQoJPC9EYXRhPg0KPC9Vc2VyQ29uZmlndXJhdGlvbj4=</t:Value>
</t:ExtendedProperty>
</t:Message>
</t:SetItemField>
</t:Updates>
</t:ItemChange>
</m:ItemChanges>
</m:UpdateItem>
</soap:Body>
</soap:Envelope>
The 0x7c07 property I'm updating in the UserOptions message contains a base64 encoded value of various signature related dictionary properties that I've modified:
<?xml version="1.0" encoding="utf-8"?>
<UserConfiguration>
<Info version="Exchange.12" />
<Data>
<e k="18-IsFocusedInboxOnLastUpdateTime" v="18-01/01/0001 00:00:00" />
<e k="18-autoaddsignature" v="3-True" />
<e k="18-UserOptionsMigrationState" v="9-5" />
<e k="18-themeStorageId" v="18-" />
<e k="18-autoaddsignatureonreply" v="3-True" />
<e k="18-timezone" v="18-Central Standard Time" />
<e k="18-signaturetext" v="18-
--
Izzz am Janick
" />
<e k="18-FavoriteFolders" v="1-18-3-120-AAMkAGM0YTZmNjhiLTI0OWYtNGFlNC05ODAzLTNlZWQyODhmOTY2MAAuAAAAAACxU7lpjO+oS5hB0UfA6muFAQDcAGmTk49MRrSCdR7rvVFPAAAAAAEMAAA=-120-AAMkAGM0YTZmNjhiLTI0OWYtNGFlNC05ODAzLTNlZWQyODhmOTY2MAAuAAAAAACxU7lpjO+oS5hB0UfA6muFAQDcAGmTk49MRrSCdR7rvVFPAAAAAAEJAAA=-120-AAMkAGM0YTZmNjhiLTI0OWYtNGFlNC05ODAzLTNlZWQyODhmOTY2MAAuAAAAAACxU7lpjO+oS5hB0UfA6muFAQDcAGmTk49MRrSCdR7rvVFPAAAAAAEPAAA=" />
<e k="18-IsOptimizedForAccessibility" v="3-False" />
<e k="18-IsFocusedInboxEnabled" v="3-True" />
<e k="18-NewEnabledPonts" v="9-2147401727" />
<e k="18-signaturehtml" v="18-<html><head></head><body><p>&nbsp;</p><p>--</p><p>John Doe</p><p>&nbsp;</p></body></html>;" />
</Data>
</UserConfiguration>
I can't find any documentation that states what is allowed or not allowed with UpdateItem operations (the list should be here). Updating the Outlook Online signature is also not currently possible with Graph or the Mail API. If what I'm trying to do ultimately cannot work then I just wasted 40+ hours of effort and will lose a client. :-|
Does anybody have any clever workarounds or know of a way to enable this operation? Note that using the EWS Managed API in server-side code is not currently an option for this solution.

Outlook does not allow add-ins to create or update FAI messages by design. In general we don't allow modifications of OWA options (or any other “service-type”, internal, item/data) directly which is probably an internal data structure, subject to change and potentially break the add-in.
If your scenario specifically requires access to outlook's signature system, I would recommend making a request to UserVoice

Searching in Solr

I am building an ecommerce project where I am using solr search engine.I want to search based on specific keyword. If I enter "c1234" , it should display all the documents having keyword "c1234". Its working fine. But, if I enter "c12#34" then also it should consider "c1234" only. So the problem is I want to ignore the hash tag here. Solr should not consider my hash tag and it should display the same result for both the cases.
The other problem is I want to trim whitespaces. If I search "HP 940", it should trim the whitespace and should display the similar result as "HP940". So I want to have similar reults to be displayed with or without the whitespace. For example,
if I enter "Hp 940", solr should consider it as "HP940". So the problem is triming the white spaces
Thanks in Advance

Try to use olr.WordDelimiterFilterFactory
Test case:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" catenateWords="1"
generateNumberParts="1" catenateNumbers="0" splitOnNumerics="1"
catenateAll="0" splitOnCaseChange="1"
stemEnglishPossessive="1" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
To replace # you should use https://cwiki.apache.org/confluence/display/solr/CharFilterFactories

For the hashtag and other characters you should take a look at the solr.WordDelimiterFilterFactory for this with the catenateWords parameter or alternatively the solr.PatternReplaceCharFilterFactory.
For words like HP 940 also consider something like phrase fields on the dismax handler with no slop.

where to put highlighting snippet configration in solr 3.4

I'm a beginner in solr, I need to add the highlight configuration (color, snippet, ....) in solrConfig.xml. which tag shall I use?? can anyone give an example ??.
thanks

You can specify the highlight parameters in request url as well as solrconfig.xml
The solrconfig.xml file available as a part of the packaged solr example adds in the highlighting settings.
e.g. -
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
.....
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">text features name</str>
<str name="f.name.hl.fragsize">0</str>
<str name="f.name.hl.alternateField">name</str>
...
</lst>
</requestHandler>
The highlight component can be configured for the fields needed to be highlighted on, the snippets size, count, snippets formatter and much more.
By default the items are highlight using the <em></em> tags.
For colored highlight you would need to use the colored fragmentsBuilder and fast vector highlighter.
<str name="hl">on</str>
<str name="hl.fl">text features name</str>
<str name="hl.useFastVectorHighlighter">true</str>
<str name="hl.fragmentsBuilder">colored</str>
Also, for FastVectorHighlighter requires the field is termVectors=on, termPositions=on and termOffsets=on
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
Detailed list of parameters # http://wiki.apache.org/solr/HighlightingParameters

I have try
<str name="hl">on</str>
<str name="hl.fl">text features name</str>
<str name="hl.useFastVectorHighlighter">true</str>
<str name="hl.fragmentsBuilder">colored</str>
And don't work,
if use sample_techproducts_configs,hightlight will work

Solr case insensitve

Hallo,
I'am implementing an autocompletion feature in Solr and have one problem.
For autocompletion I am using
<fieldType name="text_auto" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I thought that the LowerCaseFilter should make the Token Case insensitiv but that ist wrong. In fact in just lowercases the Token which means that a query like "comput" would lead to "computer" while "Comput" doesn't.
Actually I want comput and Comput to lead to Computer.
I allready tried this:
<fieldType name="text_auto_low" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_auto_up" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldType>
For some reason it doesn't word either. My question is why and haw can I fix this?

Lucene has the Analyser class which you can use(implement) in three ways:
SimpleAnalyzer : This converts all of the input to lower case.
StopAnalyzer : This removes words that removes noise from your search.
StandardAnalyzer : This does both the above filter processes and thus can 'clean up' your query.
Now, coming to your question, i would recommend a techinque called ngram that splits up your query and then searches for those phrases instead. Thus, you can still get excellent results even if there are typos.
To know how to do this, i suggest you to read this to get you started. It also has other great info regarding queries.
This not only will solve your problem, but will enhance your app.
Have fun :D

Log4Net filters "OR"

Is it possible to make a filter, for example a PropertyFilter that is neutral (and passed to next filter in the chain) if either one or another value matches? Something like:
<filter type="log4net.Filter.PropertyFilter">
<Key value="myProperty" />
<StringsToMatch Operator="OR">
<Match>value1</Match>
<Match>value2</Match>
</StringsToMatch>
</filter>
I really don't want to write my own filter and would prefer to accomplish this with the normal Log4Net filters. Is this possible?

You could certainly develop such a filter yourself by subclassing FilterSkeleton.
But instead of making a specialized filter like this I suggest you rather implement a more generic filter that could be configured to contain a collection of filters and apply the Operator over those. The config could look something like this:
<filter type="CompositeFilter">
<operator value="Or" />
<filters>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value1" />
</filter>
<filter type="log4net.Filter.PropertyFilter">
<stringToMatch value="value2" />
</filter>
</filters>
</filter>
If you make such a filter I encourage you to submit it to the log4net project. It would certainly be useful for the general public :)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Strip, store and index HTML files in Solr - html

Related

Can't use EWS UpdateItem operation from Outlook add-in

Searching in Solr

where to put highlighting snippet configration in solr 3.4

Solr case insensitve

Log4Net filters "OR"

Categories

Resources