Solr 4.6.0 DataImportHandler speed up performance

Solr 4.6.0 DataImportHandler speed up performance - mysql

I am using Solr 4.6.0, indexing about 10'000 elements at a time and I suffer bad import performance. That means that importing those 10'000 documents takes about 10 minutes. Of course I know, that this hardly depends on the server hardware, but I still would like to know, how any performance boosts could be done and which of them are actually useful in real-world situations (joins etc.)? I am also very thankful for precise examples and not just links to the official documentation.
Here is the data-config.xml
<dataConfig>
<dataSource name="mysql" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://xxxx"
batchSize="-1"
user="xxxx" password="xxxx" />
<document name="publications">
<entity name="publication" transformer="RegexTransformer" pk="id" query="
SELECT
sm_publications.id AS p_id,
CONCAT(sm_publications.title, ' ', sm_publications.abstract) AS p_text,
sm_publications.year AS p_year,
sm_publications.doi AS p_doi,
sm_conferences.full_name AS c_fullname,
sm_journals.full_name AS j_fullname,
GROUP_CONCAT(DISTINCT sm_query_publications.query_id SEPARATOR '_-_-_-_-_') AS q_id
FROM sm_publications
LEFT JOIN sm_conferences ON sm_conferences.id = sm_publications.conference_id
LEFT JOIN sm_journals ON sm_journals.id = sm_publications.journal_id
INNER JOIN sm_query_publications ON sm_query_publications.publication_id = sm_publications.id
WHERE '${dataimporter.request.clean}' != 'false' OR
sm_publications.modified > '${dataimporter.last_index_time}' GROUP BY sm_publications.id">
<field column="p_id" name="id" />
<field column="p_text" name="text" />
<field column="p_text" name="text_tv" />
<field column="p_year" name="year" />
<field column="p_doi" name="doi" />
<field column="c_fullname" name="conference" />
<field column="j_fullname" name="journal" />
<field column="q_id" name="queries" splitBy="_-_-_-_-_" />
<entity name="publication_authors" query="
SELECT
CONCAT(
IF(sm_authors.first_name != '',sm_authors.first_name,''),
IF(sm_authors.middle_name != '',CONCAT(' ',sm_authors.middle_name),''),
IF(sm_authors.last_name != '',CONCAT(' ',sm_authors.last_name),'')
) AS a_name,
sm_affiliations.display_name AS aa_display_name,
CONCAT(sm_affiliations.latitude, ',', sm_affiliations.longitude) AS aa_geo,
sm_affiliations.country_name AS aa_country_name
FROM sm_publication_authors
INNER JOIN sm_authors ON sm_authors.id = sm_publication_authors.author_id
LEFT JOIN sm_affiliations ON sm_affiliations.id = sm_authors.affiliation_id
WHERE sm_publication_authors.publication_id = '${publication.p_id}'">
<field column="a_name" name="authors" />
<field column="aa_display_name" name="affiliations" />
<field column="aa_geo" name="geo" />
<field column="aa_country_name" name="countries" />
</entity>
<entity name="publication_keywords" query="
SELECT sm_keywords.name FROM sm_publication_keywords
INNER JOIN sm_keywords ON sm_keywords.id = sm_publication_keywords.keyword_id
WHERE sm_publication_keywords.publication_id = '${publication.p_id}'">
<field column="name" name="keywords" />
</entity>
</entity>
</document>
</dataConfig>

By query caching, I meant the CachedSqlEntityProcessor. I favor the merged solution as in your other question MySQL GROUP_CONCAT duplicate entries. But CachedSqlEntityProcessor will help too, if p_id repeated over and over in the resultset of the main query publication_authors, and you have less concern on the extra memory usage.
Update: It looks like you have two other questions solved, probably you can go either way, I post the short example/pointer as you requested anyway in case others find it handy to have
<entity name="x" query="select * from x">
<entity name="y" query="select * from y" processor="CachedSqlEntityProcessor" where="xid=x.id">
</entity>
<entity>
This example was taken from wiki. This will still run each query "select * from y where xid=id" per id from the main query "select * from x". But it won't send in the same query repeatedly.

Related

Solr-MySQL DataImportHandler only retrieves ID instead of SELECT * query

I'm new to SEO and having an issue with Mysql data extraction within Solr 8.8; despite that below declaration, the document is only retrieving ID instead of the whole bunch.
<document>
<entity name="foobars"
query="SELECT *, 'test' AS ENTITY FROM foobar"
deltaquery="SELECT ID FROM foobar WHERE updated >= '${dataimporter.last_index_time}'"
deltaimportquery="SELECT *, 'MAT' AS ENTITY FROM foobar WHERE ID = ${dataimporter.delta.id}">
<field column="ENTITY" name="entity" />
<field column="ID" name="id" />
<field column="FOO" name="foo" />
<field column="BAR" name="bar" />
<field column="BAZ" name="baz" />
<field column="UPDATED" name="updated" />
</entity>
</document>
This is a sample of what was imported :
{
"responseHeader":{
"status":0,
"QTime":9,
"params":{
"q":"*:*",
"_":"1623166185835"}},
"response":{"numFound":147,"start":0,"numFoundExact":true,"docs":[
{
"id":"214768.0",
"_version_":1702016739810738176},
{
"id":"296594.0",
"_version_":1702016739840098304},
...
Does anyone knows what I'm missing here? Thanks for any help.

How to handle Multiple MySQL Tables using DataImportHandler in Solr?

Below mentioned query works perfectly fine while running it on phpmyadmin. I want to index these tables completely using solr and generate aggregated result using single query.
"select biblio.biblionumber as 'id', biblio.*, biblioitems.*, items.*, branches.* from biblio
inner join biblioitems ON (biblioitems.biblionumber=biblio.biblionumber)
inner join items ON (items.biblionumber=biblio.biblionumber)
inner join branches ON (branches.uid=items.uid);
I gave it a try on solr but could not get the desired result using this :
<document>
<entity name="id" query="select biblio.biblionumber as 'id', biblio.* from biblio ;">
<field column="BIBLIONUMBER" name="biblionumber" />
<field column="AUTHOR" name="author" />
<field column="TITLE" name="title" />
<field column="SERIESTITLE" name="seriestitle" />
<field column="COPYRIGHTDATE" name="copyrightdate" />
<field column="ABSTRACT" name="abstract" />
<entity name="id2" query="select biblioitems.biblioitemnumber as 'id2', biblioitems.* from biblioitems where biblionumber='${biblio.id}'">
<field name="BIBLIOITEMNUMBER" column="biblioitemnumber" />
<field name="ISBN" column="isbn" />
<field name="ISSN" column="issn" />
<field name="PUBLISHERCODE" column="publishercode" />
<field name="EDITIONSTATEMENT" column="editionstatement" />
<field name="PAGES" column="pages" />
<field name="PLACE" column="place" />
<field name="URL" column="url" />
</entity>
<entity name="id3" query="select items.uid as 'id3', items.* from items where biblionumber='${biblio.id}'">
<field name="ITEMNUMBER" column="itemnumber" />
<field name="PRICE" column="price" />
<field name="BARCODE" column="barcode" />
<field name="ENUMCHRON" column="enumchron" />
<field name="UID" column="uid" />
<field name="HOMEBRANCH" column="homebranch" />
<entity name="id4" query="select branches.uid AS 'id4', branches.* from branches where uid = '${items.id3}'">
<field name="UID" column="uid" />
<field name="BRANCHNAME" column="branchname" />
</entity>
</entity>
</entity>
</document>
The result is displayed upto abstract the moment join operation comes into play. I'm struggling with the query.
I request you all to help me with this query.
Thanks in Advance!!!

Solr - DIH define & import many-to-many field

I've two MySQL tables book and author, they have many-to-many relationship, done via book_author_mapper whose row contain columns book_id / author_id.
In Solr, I have a query to get book list, for each book I need to get an array of author_id for the book.
Currently, I am thinking about to use a multi-valued field to store book ids.
My question is:
How to define the field, and how to write the SQL in DIH, it seems need multiple SQL, right? Thx.
If I want to get not just the author_id list, but as well as author_name for each author_id, is that possible?

After viewing doc & googling, I have kind solved the problem.
Tables
book
author
book_author_map (this is the middle table for many-to-many relationship)
DIH config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/test?characterEncoding=utf8&zeroDateTimeBehavior=convertToNull" user="root"
password="123456" />
<document>
<entity name="book" pk="id"
query="SELECT * FROM book where status = 0 limit 200000;"
deltaImportQuery="SELECT * FROM book where status = 0 and id='${dih.delta.id}' limit 200000;"
deltaQuery="select id from book where status = 0 and CONVERT_TZ(`update_date`, ##session.time_zone, '+00:00') > '${dih.last_index_time}'"
>
<entity name="author"
query="SELECT au.cn_name as author_cn_name FROM author AS au JOIN book_author_map AS bam ON au.id = bam.author_id WHERE bam.book_id = ${book.id} limit 10;"
>
<field name="authors" column="author_cn_name" />
</entity>
</entity>
</document>
</dataConfig>
Field definition
<field name="cn_name" type="textComplex" indexed="true" stored="true" />
<field name="en_name" type="textComplex" indexed="true" stored="true" />
<field name="status" type="int" indexed="true" stored="true" />
<field name="authors" type="textComplex" indexed="true" stored="true" multiValued="true" />
TODOs
parentDeltaQuery It get pk of parent entity, but when it is called, and what is do? Is that necessary?
Does deltaQuery and parentDeltaQuery necessary in sub entity?

Solr some mdb files are not process as documents in dataimport

I have to import some mdb files to solr. some of mdb files are indexed well as document but there are others don't.
I use Solr 4.10.0 and ucanaccess ver. 2.0.9 The follwoing is a screen shot from the log:
For some missing fields values (in the screen shot case 6 fields) I have set onError="continue" in the dataimport-config:
<document>
<entity name="Book" dataSource="a" query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok WHERE bkid = 29435">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a" query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a1" onError="continue" query="SELECT nass AS PageContent, page AS PageNum FROM book ORDER BY page">
<field column="PageContent" name="PageContent"/>
<field column="PageNum" name="PageNum"/>
<entity name="Titles" dataSource="a1" onError="continue" query="SELECT * FROM title WHERE id = ${Pages.PAGENUM} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
This is a screen shot for the table regarded in the database with the 6 undefined data fields:
At the end of dataimporting for this mdb file I got the following response:
Last Update: 09:12:04 Requests: 31,952, Fetched: 78,980, Skipped: 0,
Processed: 0 Started: 18 minutes ago
Which it is shown that 0 processed!
There are other mdb files are proceed i.e 1 processed is generated in the response but I have got the folwing errors in the log:
10/7/2014 9:28:08 AM ERROR SolrWriter Exception while solr commit.
this writer hit an OutOfMemoryError; cannot commit...
and
SolrIndexWriter Error closing IndexWriter this writer hit an
OutOfMemoryError; cannot flush...
How could I solve this issue? and why Solr requests and fetched all this records and then process and index none?!

Is it possible to pass a parameter to mysql apache solr?

Need pass userID parameter in apache solr.
Example:
http://localhost.com:8983/solr/collection1/select?q=abc&wt=json&indent=true&userID=THIS-PARAMETR-NEED-PASS
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/mydatabase" user="root" password="root"/>
<document name="t">
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='THIS-PARAMETR-NEED-PASS'">
<field column="Ac" name="acid"/>
<field column="UserID" name="userid"/>
<field column="Comment" name="comment"/>
<entity name="m"
query="SELECT * FROM `table2`WHERE `tid` = '${mytable.tid}'">
<field column="Title" name="title"/>
</entity>
</document>
</dataConfig>

The example you give is a bit mixed up as the url you show hints to a search request, but the configuration shows that you want to access a request parameters within a dataimport handler.
Your concrete parameter could be accessed like ${dataimporter.request.userID}. Referring to the wiki you would need to alter your dataconfig like this
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='${dataimporter.request.userID}'">

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Solr 4.6.0 DataImportHandler speed up performance - mysql

Related

Solr-MySQL DataImportHandler only retrieves ID instead of SELECT * query

How to handle Multiple MySQL Tables using DataImportHandler in Solr?

Solr - DIH define & import many-to-many field

Solr some mdb files are not process as documents in dataimport

Is it possible to pass a parameter to mysql apache solr?

Categories

Resources