Rails Sunspot Solr search is not working when there is less than 3 chars - sunspot-rails

When I search for "beer" then I got results, but when I search for "bee" i am not getting any results. I can not search for any word which is shorter than 4 chars. Is there a way to make this possible?!

Check your SOLR config conf/schema.xml and configurate the settings to your demands. After changes rebuild your index and try again.
It is probably this part, but you have to "play" with the settings.
Try this config as example:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="0"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
splitOnNumerics="0"
splitOnCaseChange="1"
preserveOriginal="1" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="42" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

Related

Solr Spell Check returning false positives correctlySpelled()

I am currently using solr 5x on a local server and using a Drupal instance to generate all the indexes. After a lot of configuration, i got to a point i am fairly happy with the solr implementation.
However, one of the problems i have just noticed is that correct spellings are still counted as misspelled and are still offered suggestions.
"correctlySpelled":false
As you can see in the JSON output, both words: license and vehicle are spelt correctly and are still classed as incorrect.
"spellcheck":{
"suggestions":[
"license",
{
"numFound":3,
"startOffset":0,
"endOffset":7,
"suggestion":[
"licensed",
"licensee",
"licenser"
]
},
"vehicle",
{
"numFound":3,
"startOffset":8,
"endOffset":15,
"suggestion":[
"chicle",
"pedicle",
"vehiculate"
]
}
],
"correctlySpelled":false,
"collations":[
"collation",
"licensed chicle",
"collation",
"licensed pedicle",
"collation",
"licensed vehiculate",
"collation",
"licenser chicle",
"collation",
"licenser pedicle"
]
}
Does anyone have any idea on why it would produce false positives?
URL Encoded Query:
http://192.168.33.10:8983/solr/drupal/spell?q=license+vehicle&spellcheck=true&spellcheck.accuracy=0.7&spellcheck.collate=true&defType=edismax&json.nl=flat&omitHeader=true&qf=ts_title^1&fl=*,score&start=0&fq=index_id:"new_index"&fq=hash:"96z3wm"&rows=10&wt=json&stopwords=true&lowercaseOperators=true
Query:
q = license+vehicle
spellcheck = true
spellcheck.accuracy = 0.7
spellcheck.collate = true
defType = edismax
json.nl = flat
omitHeader = true
qf = ts_title^1
fl = *,score
start = 0
fq = index_id:"new_index"
fq = hash:"96z3wm"
rows = 10
wt = json
stopwords = true
lowercaseOperators = true
Relevant part of schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal. -->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="multiterm">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="1"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Relevant part of solrconfig.xml
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="df">spell</str> <!--The default field for spell checking. -->
<str name="spellcheck.dictionary">file</str> <!--default or file or jarowinkler as mentioned above. -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.count">3</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">false</str>
<str name="spellcheck.collateExtendedResults">false</str>
<str name="spellcheck.maxCollationTries">10</str>
<str name="spellcheck.maxCollations">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="spellcheckIndexDir">spellchecker</str>
<str name="buildOnOptimize">true</str>
</lst>
<lst name="spellchecker">
<str name="classname">solr.FileBasedSpellChecker</str>
<str name="name">file</str>
<str name="sourceLocation">spellings.txt</str>
<str name="characterEncoding">UTF-8</str>
<str name="spellcheckIndexDir">spellcheckerFile</str>
</lst>
</searchComponent>
This is something which i have also experienced with Solr . It happens in an unpredictable fashion . The approach i use to avoid such conditions is to make a spell pre check with edismax parameter "mm" , by setting it to 100 . Try setting mm=100 in your edismax query , and see if that works . Then you make a flow , where your first strictly only spell check the word , and then pass it on to search query handler . When you specify mm=100 , don't pass your phrase in any kind of double quotes, just pass it as is it . Let me know if that helps :)

DIH Mysql to Solr import problems

I have trouble Indexing Documents from Mysql to Solr.
My Config:
data-config.xml
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://xxx?characterEncoding=utf8"
user="xxx"
password="xxx"/>
<document name="articledata">
<entity name="outer"
transformer="HTMLStripTransformer"
query="SELECT
id,kundenid,LOWER(title) as title,LOWER(content) as content,
DATE_FORMAT(cr,'%Y-%m-%dT%H:%i:%sZ') as cr,
lang
FROM articledata
WHERE
DATE(cr) BETWEEN DATE(DATE_SUB(now(),INTERVAL 3 DAY)) AND DATE(now())
AND content IS NOT NULL
ORDER BY DATE(cr) DESC">
<field column="id" name="id" />
<field column="kundenid" name="kundenid" />
<field column="title" name="title" />
<field column="content" name="content" stripHTML="true" />
<field column="cr" name="cr" />
<field column="lang" name="lang" />
</entity>
</document>
</dataConfig>
schema.xml
<?xml version="1.0" ?>
<schema name="articledata core zero" version="1.1">
<types>
<fieldtype name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="dt" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0" />
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="int" indexed="true" stored="true" required="true"/>
<field name="kundenid" type="int" indexed="true" stored="true" required="true"/>
<field name="title" type="string" indexed="true" stored="true" />
<field name="content" type="textgen" indexed="true" stored="true" />
<field name="cr" type="dt" indexed="true" stored="true" />
<field name="lang" type="string" indexed="true" stored="true" />
<field name="_version_" type="long" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
</schema>
With this configuration i get Documents like this:
"docs": [
{
"content": "[B#7f017c71",
"id": 20785923,
"cr": "2014-07-24T08:01:58Z",
"title": "general motors entdeckt neue mängel bei hunderttausenden wagen - news - alle aktuellen news - dpa-afx - general motors dl-,01 - onvista",
"kundenid": 1,
"_version_": 1474502436614832000
},
The title gets indexed properly
The content shows up as bullshit chars and is not searchable.
Any ideas how i can fix that?
Thanks in advance.
I suspect that your content field in DB must be text/BLOB and not varchar (as title must be varchar). Hence you are able to index title correctly and content is not getting indexed correctly.
If you are having a BLOB of data or text data in DB then it would possibly be useful to use a field type that has the right set of tokenizers, analyzers and filters.
For example, adding a StandardTokenizerFactory keeps tokens to a meaningful value set.
An example of the fieldtype definition:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldtype>
If the problem still persist then following information will help you in investigating this issue:
1) Can check what values you get from MYSQL when you run query: SELECT id,kundenid,LOWER(title) as title,LOWER(content) as content, DATE_FORMAT(cr,'%Y-%m-%dT%H:%i:%sZ') as cr,lang FROM articledata WHERE DATE(cr) BETWEEN DATE(DATE_SUB(now(),INTERVAL 3 DAY)) AND DATE(now()) AND content IS NOT NULL ORDER BY DATE(cr) DESC"
2) Try to change textgen to string.
3) Try Removing stripHTML="true" from content
Hope this will help you in resolving your issue or at least help you in investigating further.

SVG element with background image and background color and drop shadow

I'm using this answer to apply a drop shadow on a SVG element like a circle. I have a fill attribute on my element to apply a background color on it, and now I'd like to combine all that with a background image on the circle.
I've tried using a <pattern>but I can only have the background-image, or adding <feImage> to my filter drop shadow but the filter doesn't work anymore.
Basically, what should I add to this code knowing my image can be found in /public/images/... :
<filter id="dropshadow" width="130%" height="130%">
<feGaussianBlur in="SourceAlpha" stdDeviation="3"/>
<feOffset dx="0" dy="4" result="offsetblur"/>
<feComponentTransfer>
<feFuncA type="linear" slope="0.1"/>
</feComponentTransfer>
<feMerge>
<feMergeNode/>
<feMergeNode in="SourceGraphic"/>
</feMerge>
</filter>
<circle cx="50%" cy="50%" r="49%" filter="url(#dropshadow)" fill="#f8f8f8" stroke="#e7e7e7" stroke-width="1px"/>
Well I don't know how hard you tried on the feImage,. but this code works perfectly. You want to pull in the feImage, then clip it to the source with a feComposite "in". Then you can composite the dropshadow under that result.
<svg>
<defs>
<filter id="dropshadow" width="130%" height="130%">
<feGaussianBlur in="SourceAlpha" stdDeviation="3"/>
<feOffset dx="0" dy="4" result="offsetblur"/>
<feComponentTransfer in="offsetblur" result="dropshadow">
<feFuncA type="linear" slope="0.1"/>
</feComponentTransfer>
<feImage result="bgpic" xlink:href="http://www.sencha.com/img/sencha-large.png" />
<feComposite operator="atop" in="bgpic" in2="SourceGraphic" result="coikelwimg"/>
<feMerge>
<feMergeNode in="coikelwimg"/>
<feMergeNode in="dropshadow"/>
</feMerge>
</filter>
</defs>
<circle cx="50%" cy="50%" r="49%" filter="url(#dropshadow)" fill="#f8f8f8" stroke="#e7e7e7" stroke-width="1px"/>
</svg>

Indexing and querying BLOBS stored in Mysql

Greeting friends,
Straight to the point. I have stored many BLOBS in a Mysql DB. These are mainly PDF's(80%) and .doc. I have also text in the DB. Till now i have indexed and i can query the text, but i cannot index the BLOBS. I am trying to make a single collection(document)-but sucks. Is there any recipe on how to do such a thing?
A portion of data-config.xml:
<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
<dataSource type="JdbcDataSource"
autoCommit="true" batchSize="-1"
convertType="false"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://127.0.0.1:3306/ktimatologio"
user="root"
password="********"
name="db"/>
<dataSource name="fieldReader" type="FieldStreamDataSource" />
<document>
<entity name="aitiologikes_ektheseis"
dataSource="db"
transformer="HTMLStripTransformer"
query="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text'"
deltaImportQuery="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, CONCAT( body,' ',title) AS content from aitiologikes_ektheseis where type = 'text' and last_modified > '${dataimporter.last_index_time}'">
<field column="id" name="ida" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" stripHTML="true" />
<field column="grid_title" name="grid_title" stripHTML="true" />
<field column="model" name="model" stripHTML="true" />
<field column="type" name="type" stripHTML="true" />
<field column="url" name="url" stripHTML="true" />
<field column="last_modified" name="last_modified" stripHTML="true" />
<field column="search_tag" name="search_tag" stripHTML="true" />
<field column="content" name="content" stripHTML="true" />
</entity>
<entity name="aitiologikes_ektheseis_bin"
query="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin'"
deltaImportQuery="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and id='${dataimporter.delta.id}'"
deltaQuery="select id, title, title AS grid_title, model, type, url, last_modified, CONCAT_WS('_',id,model) AS solr_id, search_tag, bin_con AS text from aitiologikes_ektheseis where type = 'bin' and last_modified > '${dataimporter.last_index_time}'"
transformer="TemplateTransformer"
dataSource="db">
<field column="id" name="ida" />
<field column="solr_id" name="solr_id" />
<field column="title" name="title" stripHTML="true" />
<field column="grid_title" name="grid_title" stripHTML="true" />
<field column="model" name="model" stripHTML="true" />
<field column="type" name="type" stripHTML="true" />
<field column="url" name="url" stripHTML="true" />
<field column="last_modified" name="last_modified" stripHTML="true" />
<field column="search_tag" name="search_tag" stripHTML="true" />
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="aitiologikes_ektheseis_bin.text" format="text">
<field column="text" name="contentbin" stripHTML="true" />
</entity>
</entity>
...
...
</document>
</dataConfig>
A portion from schema.xml (the fieldTypes and filed definition):
<fieldType name="text_ktimatologio" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" enablePositionIncrements="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff" ignoreCase="true" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_el.txt" enablePositionIncrements="true"/>
<filter class="solr.GreekLowerCaseFilterFactory"/>
<filter class="solr.GreekStemFilterFactory"/>
<filter class="solr.HunspellStemFilterFactory" dictionary="dictionaries/el_GR.dic" affix="dictionaries/el_GR.aff" ignoreCase="true" />
</analyzer>
</fieldType>
<fields>
<field name="ida" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="solr_id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="title" type="text_ktimatologio" indexed="true" stored="true"/>
<field name="grid_title" type="text_ktimatologio" indexed="true" stored="true"/>
<field name="model" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="url" type="string" indexed="true" stored="true"/>
<field name="last_modified" type="string" indexed="true" stored="true"/>
<field name="search_tag" type="string" indexed="true" stored="true"/>
<field name="contentbin" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="content" type="text_ktimatologio" indexed="true" stored="true" multiValued="true"/>
</fields>
I really need help on this!
With respect,
Tom
Greece
Do you want to "index" a BLOB? Meaning you want to eventually be able to search it? I am not sure I understand your question correctly.
I am guessing that you would want to first convert your PDF or .doc using something like Apache Tika in Solr and then let Solr index that for you. Also, if you want to have your users access the PDF or doc, the best place would be the DB and retrieve it from there?

Keep relational database structure in solr index?

I was able to import data through solr DIH.
In my database I have 4 tables:
threads: id, user_id, country_id
tags: id
thread_tag_map: thread_id, tag_id
countries: id
posts: id, thread_id
I want each document in solr to consist of:
thread_id
tag_id
country_id
post_id
For example:
thread_id: 1
tag_id: 23
tag_id: 34
country_id: 43
post_id: 4
post_id: 23
post_id: 23
How should I map it?
I haven't been able to configure data-config.xml for this. I have followed the DIH tutorial without success.
Here is my schema.xml:
<schema name="example" version="1.2">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="uuid" class="solr.UUIDField" indexed="true" />
<fieldType name="text_rev" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>
<fields>
<field name="id" type="uuid" indexed="true" stored="true" default="NEW"/>
<field name="threads.title" type="text_rev" indexed="true" stored="true"/>
<field name="posts.body" type="text_rev" indexed="true" stored="true"/>
<dynamicField name="*id" type="int" indexed="false" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>posts.body</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
It seems like you just want to define these fields:
thread_id
tag_id
country_id
post_id
as indexed 'string' fields in schema.xml. post_id should be multi-valued="true". See the default schema.xml files for formatting guidelines. Or...
http://wiki.apache.org/solr/SchemaXml
The only tricky thing here is actually querying the database, not configuring solr. Just write a JOIN query where you can get all of the ID's you need and use a solr client library for your language to build a simple datastruction, eg (json-y):
[{"thread_id":"1",
"tag_id":"14",
"country_id":"2",
"post_id":["5",
"7",
"18"
]
},...and more...]
Since Solr isn't a RDBMS, you'll have to fake your searches by either doing multiple queries or using subqueries. Another option might be using Solr to retrieve your thread or post with a full-text search, and then using an ID from there to run a MySQL query that will get you everything else you need.