Solr - DIH define & import many-to-many field - mysql

I've two MySQL tables book and author, they have many-to-many relationship, done via book_author_mapper whose row contain columns book_id / author_id.
In Solr, I have a query to get book list, for each book I need to get an array of author_id for the book.
Currently, I am thinking about to use a multi-valued field to store book ids.
My question is:
How to define the field, and how to write the SQL in DIH, it seems need multiple SQL, right? Thx.
If I want to get not just the author_id list, but as well as author_name for each author_id, is that possible?

After viewing doc & googling, I have kind solved the problem.
Tables
book
author
book_author_map (this is the middle table for many-to-many relationship)
DIH config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/test?characterEncoding=utf8&zeroDateTimeBehavior=convertToNull" user="root"
password="123456" />
<document>
<entity name="book" pk="id"
query="SELECT * FROM book where status = 0 limit 200000;"
deltaImportQuery="SELECT * FROM book where status = 0 and id='${dih.delta.id}' limit 200000;"
deltaQuery="select id from book where status = 0 and CONVERT_TZ(`update_date`, ##session.time_zone, '+00:00') > '${dih.last_index_time}'"
>
<entity name="author"
query="SELECT au.cn_name as author_cn_name FROM author AS au JOIN book_author_map AS bam ON au.id = bam.author_id WHERE bam.book_id = ${book.id} limit 10;"
>
<field name="authors" column="author_cn_name" />
</entity>
</entity>
</document>
</dataConfig>
Field definition
<field name="cn_name" type="textComplex" indexed="true" stored="true" />
<field name="en_name" type="textComplex" indexed="true" stored="true" />
<field name="status" type="int" indexed="true" stored="true" />
<field name="authors" type="textComplex" indexed="true" stored="true" multiValued="true" />
TODOs
parentDeltaQuery It get pk of parent entity, but when it is called, and what is do? Is that necessary?
Does deltaQuery and parentDeltaQuery necessary in sub entity?

Related

How to index csv files using dih in solr

I am new to solr so I might have wrongly written the dih.I have data already in my solr db and I need to extend my dih file such that it can index csv file which contain more information and csv and solr db data are mapped by common id.What I have done is -find in the code below
This is my csv file-->
Node_IP probe_name Global_ID
10.53.3.87 ILRNAPSUD03 4b44aaff1e09f3d793fe9
10.224.47.26 ILRNAPSUD03 47eebea2c2d485b59
Here is a dih snippet-->
<entity name="tmin"
processor="XPathEntityProcessor"
dataSource="FileDataSource"
stream="true"
url="${pickupdir.fileAbsolutePath}"
onError="skip"
forEach="/execution/"
transformer="script:makePair,script:makeLogPair, TemplateTransformer,
RegexTransformer, HTMLStripTransformer"
>
<field column="jobid_t" xpath="/execution/#jobId" />
<field column="destinationid_t" xpath="/execution/#destinationid" />
<field column="id" template="${tmin.destinationid_t}" />
<field column="log_param" xpath="/execution/log/#severity" />
<field column="log" xpath="/execution/log" />
<entity name="importcsv"
processor="LineEntityProcessor"
url="C:\Users\arpiagar\Desktop\IP Probe name_ILRNAPSUD01.csv"
rootEntity="false"
dataSource="FileDataSource"
header="true"
separator=","
transformer="TemplateTransformer, RegexTransformer,script:mapcsv"
>
<field column="rawLine" groupNames="Node_IP,probe_name,Global_ID"/>
<field column="id" name="Global_ID" />
<field column="probe_name" name="probe_name" />
</entity>
</entity>
I need to map id in the tmin entity with the id which we will get after indexing csv data and index probe_name and node_ip at that particular id.

Solr some mdb files are not process as documents in dataimport

I have to import some mdb files to solr. some of mdb files are indexed well as document but there are others don't.
I use Solr 4.10.0 and ucanaccess ver. 2.0.9 The follwoing is a screen shot from the log:
For some missing fields values (in the screen shot case 6 fields) I have set onError="continue" in the dataimport-config:
<document>
<entity name="Book" dataSource="a" query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok WHERE bkid = 29435">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a" query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a1" onError="continue" query="SELECT nass AS PageContent, page AS PageNum FROM book ORDER BY page">
<field column="PageContent" name="PageContent"/>
<field column="PageNum" name="PageNum"/>
<entity name="Titles" dataSource="a1" onError="continue" query="SELECT * FROM title WHERE id = ${Pages.PAGENUM} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
This is a screen shot for the table regarded in the database with the 6 undefined data fields:
At the end of dataimporting for this mdb file I got the following response:
Last Update: 09:12:04 Requests: 31,952, Fetched: 78,980, Skipped: 0,
Processed: 0 Started: 18 minutes ago
Which it is shown that 0 processed!
There are other mdb files are proceed i.e 1 processed is generated in the response but I have got the folwing errors in the log:
10/7/2014 9:28:08 AM ERROR SolrWriter Exception while solr commit.
this writer hit an OutOfMemoryError; cannot commit...
and
SolrIndexWriter Error closing IndexWriter this writer hit an
OutOfMemoryError; cannot flush...
How could I solve this issue? and why Solr requests and fetched all this records and then process and index none?!

Solr 4.6.0 DataImportHandler speed up performance

I am using Solr 4.6.0, indexing about 10'000 elements at a time and I suffer bad import performance. That means that importing those 10'000 documents takes about 10 minutes. Of course I know, that this hardly depends on the server hardware, but I still would like to know, how any performance boosts could be done and which of them are actually useful in real-world situations (joins etc.)? I am also very thankful for precise examples and not just links to the official documentation.
Here is the data-config.xml
<dataConfig>
<dataSource name="mysql" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://xxxx"
batchSize="-1"
user="xxxx" password="xxxx" />
<document name="publications">
<entity name="publication" transformer="RegexTransformer" pk="id" query="
SELECT
sm_publications.id AS p_id,
CONCAT(sm_publications.title, ' ', sm_publications.abstract) AS p_text,
sm_publications.year AS p_year,
sm_publications.doi AS p_doi,
sm_conferences.full_name AS c_fullname,
sm_journals.full_name AS j_fullname,
GROUP_CONCAT(DISTINCT sm_query_publications.query_id SEPARATOR '_-_-_-_-_') AS q_id
FROM sm_publications
LEFT JOIN sm_conferences ON sm_conferences.id = sm_publications.conference_id
LEFT JOIN sm_journals ON sm_journals.id = sm_publications.journal_id
INNER JOIN sm_query_publications ON sm_query_publications.publication_id = sm_publications.id
WHERE '${dataimporter.request.clean}' != 'false' OR
sm_publications.modified > '${dataimporter.last_index_time}' GROUP BY sm_publications.id">
<field column="p_id" name="id" />
<field column="p_text" name="text" />
<field column="p_text" name="text_tv" />
<field column="p_year" name="year" />
<field column="p_doi" name="doi" />
<field column="c_fullname" name="conference" />
<field column="j_fullname" name="journal" />
<field column="q_id" name="queries" splitBy="_-_-_-_-_" />
<entity name="publication_authors" query="
SELECT
CONCAT(
IF(sm_authors.first_name != '',sm_authors.first_name,''),
IF(sm_authors.middle_name != '',CONCAT(' ',sm_authors.middle_name),''),
IF(sm_authors.last_name != '',CONCAT(' ',sm_authors.last_name),'')
) AS a_name,
sm_affiliations.display_name AS aa_display_name,
CONCAT(sm_affiliations.latitude, ',', sm_affiliations.longitude) AS aa_geo,
sm_affiliations.country_name AS aa_country_name
FROM sm_publication_authors
INNER JOIN sm_authors ON sm_authors.id = sm_publication_authors.author_id
LEFT JOIN sm_affiliations ON sm_affiliations.id = sm_authors.affiliation_id
WHERE sm_publication_authors.publication_id = '${publication.p_id}'">
<field column="a_name" name="authors" />
<field column="aa_display_name" name="affiliations" />
<field column="aa_geo" name="geo" />
<field column="aa_country_name" name="countries" />
</entity>
<entity name="publication_keywords" query="
SELECT sm_keywords.name FROM sm_publication_keywords
INNER JOIN sm_keywords ON sm_keywords.id = sm_publication_keywords.keyword_id
WHERE sm_publication_keywords.publication_id = '${publication.p_id}'">
<field column="name" name="keywords" />
</entity>
</entity>
</document>
</dataConfig>
By query caching, I meant the CachedSqlEntityProcessor. I favor the merged solution as in your other question MySQL GROUP_CONCAT duplicate entries. But CachedSqlEntityProcessor will help too, if p_id repeated over and over in the resultset of the main query publication_authors, and you have less concern on the extra memory usage.
Update: It looks like you have two other questions solved, probably you can go either way, I post the short example/pointer as you requested anyway in case others find it handy to have
<entity name="x" query="select * from x">
<entity name="y" query="select * from y" processor="CachedSqlEntityProcessor" where="xid=x.id">
</entity>
<entity>
This example was taken from wiki. This will still run each query "select * from y where xid=id" per id from the main query "select * from x". But it won't send in the same query repeatedly.

Is it possible to pass a parameter to mysql apache solr?

Need pass userID parameter in apache solr.
Example:
http://localhost.com:8983/solr/collection1/select?q=abc&wt=json&indent=true&userID=THIS-PARAMETR-NEED-PASS
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/mydatabase" user="root" password="root"/>
<document name="t">
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='THIS-PARAMETR-NEED-PASS'">
<field column="Ac" name="acid"/>
<field column="UserID" name="userid"/>
<field column="Comment" name="comment"/>
<entity name="m"
query="SELECT * FROM `table2`WHERE `tid` = '${mytable.tid}'">
<field column="Title" name="title"/>
</entity>
</document>
</dataConfig>
The example you give is a bit mixed up as the url you show hints to a search request, but the configuration shows that you want to access a request parameters within a dataimport handler.
Your concrete parameter could be accessed like ${dataimporter.request.userID}. Referring to the wiki you would need to alter your dataconfig like this
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='${dataimporter.request.userID}'">

multivalued field in solr returns one item only

I am newbie in solr,
i have a table like this
id infield body
---------------------------------------------
1 ValX Article1-Body
1 ValY Article1-Body
1 ValZ Article1-Body
2 ValW Article2-Body
....
and my mysql query looks like
select A.id,B.infield, A.body from A inner join B on A.id=B.id;
and in schema.xml i have this
<field indexed="true" multiValued="true" name="infield" stored="true" type="string"/>
now when my query is * : *, i am supposed to get all infield like below
<str name="id">1</str>
<str name="body">Article1-Body</str>
<arr name="infield">
<str>ValX</str>
<str>ValY</str>
<str>ValZ</str>
</arr>
but i am getting this
<str name="id">1</str>
<str name="body">Article1-Body</str>
<arr name="infield">
<str>ValX</str>
</arr>
EDIT
my dataconfig.xml contains:
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<dataSource autoCommit="true" batchSize="-1" convertType="true" driver="com.mysql.jdbc.Driver" password="pass" url="jdbc:mysql://127.0.0.1/test" user="root"/>
<document name="items">
<entity name="root" pk="id" preImportDeleteQuery="data_source:10" query="select A.id,B.infield, A.body from A inner join B on A.id=B.id;" transformer="TemplateTransformer">
<field column="data_source" template="10"/>
<field column="data_source_type" template="Jdbc"/>
<field column="data_source_name" template="Test"/>
</entity>
</document>
</dataConfig>
any idea what might be wrong?
Thanks
The query probably produces multiple records with the same id and hence the separate/individual records are getting overridden as they have the same id.
So you end up with just one infield value.
For multivalued fields you should include it as a sub entity which returns multiple values.