I am using Solr 4.6.0, indexing about 10'000 elements at a time and I suffer bad import performance. That means that importing those 10'000 documents takes about 10 minutes. Of course I know, that this hardly depends on the server hardware, but I still would like to know, how any performance boosts could be done and which of them are actually useful in real-world situations (joins etc.)? I am also very thankful for precise examples and not just links to the official documentation.
Here is the data-config.xml
<dataConfig>
<dataSource name="mysql" type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://xxxx"
batchSize="-1"
user="xxxx" password="xxxx" />
<document name="publications">
<entity name="publication" transformer="RegexTransformer" pk="id" query="
SELECT
sm_publications.id AS p_id,
CONCAT(sm_publications.title, ' ', sm_publications.abstract) AS p_text,
sm_publications.year AS p_year,
sm_publications.doi AS p_doi,
sm_conferences.full_name AS c_fullname,
sm_journals.full_name AS j_fullname,
GROUP_CONCAT(DISTINCT sm_query_publications.query_id SEPARATOR '_-_-_-_-_') AS q_id
FROM sm_publications
LEFT JOIN sm_conferences ON sm_conferences.id = sm_publications.conference_id
LEFT JOIN sm_journals ON sm_journals.id = sm_publications.journal_id
INNER JOIN sm_query_publications ON sm_query_publications.publication_id = sm_publications.id
WHERE '${dataimporter.request.clean}' != 'false' OR
sm_publications.modified > '${dataimporter.last_index_time}' GROUP BY sm_publications.id">
<field column="p_id" name="id" />
<field column="p_text" name="text" />
<field column="p_text" name="text_tv" />
<field column="p_year" name="year" />
<field column="p_doi" name="doi" />
<field column="c_fullname" name="conference" />
<field column="j_fullname" name="journal" />
<field column="q_id" name="queries" splitBy="_-_-_-_-_" />
<entity name="publication_authors" query="
SELECT
CONCAT(
IF(sm_authors.first_name != '',sm_authors.first_name,''),
IF(sm_authors.middle_name != '',CONCAT(' ',sm_authors.middle_name),''),
IF(sm_authors.last_name != '',CONCAT(' ',sm_authors.last_name),'')
) AS a_name,
sm_affiliations.display_name AS aa_display_name,
CONCAT(sm_affiliations.latitude, ',', sm_affiliations.longitude) AS aa_geo,
sm_affiliations.country_name AS aa_country_name
FROM sm_publication_authors
INNER JOIN sm_authors ON sm_authors.id = sm_publication_authors.author_id
LEFT JOIN sm_affiliations ON sm_affiliations.id = sm_authors.affiliation_id
WHERE sm_publication_authors.publication_id = '${publication.p_id}'">
<field column="a_name" name="authors" />
<field column="aa_display_name" name="affiliations" />
<field column="aa_geo" name="geo" />
<field column="aa_country_name" name="countries" />
</entity>
<entity name="publication_keywords" query="
SELECT sm_keywords.name FROM sm_publication_keywords
INNER JOIN sm_keywords ON sm_keywords.id = sm_publication_keywords.keyword_id
WHERE sm_publication_keywords.publication_id = '${publication.p_id}'">
<field column="name" name="keywords" />
</entity>
</entity>
</document>
</dataConfig>
By query caching, I meant the CachedSqlEntityProcessor. I favor the merged solution as in your other question MySQL GROUP_CONCAT duplicate entries. But CachedSqlEntityProcessor will help too, if p_id repeated over and over in the resultset of the main query publication_authors, and you have less concern on the extra memory usage.
Update: It looks like you have two other questions solved, probably you can go either way, I post the short example/pointer as you requested anyway in case others find it handy to have
<entity name="x" query="select * from x">
<entity name="y" query="select * from y" processor="CachedSqlEntityProcessor" where="xid=x.id">
</entity>
<entity>
This example was taken from wiki. This will still run each query "select * from y where xid=id" per id from the main query "select * from x". But it won't send in the same query repeatedly.
Related
I'm new to SEO and having an issue with Mysql data extraction within Solr 8.8; despite that below declaration, the document is only retrieving ID instead of the whole bunch.
<document>
<entity name="foobars"
query="SELECT *, 'test' AS ENTITY FROM foobar"
deltaquery="SELECT ID FROM foobar WHERE updated >= '${dataimporter.last_index_time}'"
deltaimportquery="SELECT *, 'MAT' AS ENTITY FROM foobar WHERE ID = ${dataimporter.delta.id}">
<field column="ENTITY" name="entity" />
<field column="ID" name="id" />
<field column="FOO" name="foo" />
<field column="BAR" name="bar" />
<field column="BAZ" name="baz" />
<field column="UPDATED" name="updated" />
</entity>
</document>
This is a sample of what was imported :
{
"responseHeader":{
"status":0,
"QTime":9,
"params":{
"q":"*:*",
"_":"1623166185835"}},
"response":{"numFound":147,"start":0,"numFoundExact":true,"docs":[
{
"id":"214768.0",
"_version_":1702016739810738176},
{
"id":"296594.0",
"_version_":1702016739840098304},
...
Does anyone knows what I'm missing here? Thanks for any help.
Below mentioned query works perfectly fine while running it on phpmyadmin. I want to index these tables completely using solr and generate aggregated result using single query.
"select biblio.biblionumber as 'id', biblio.*, biblioitems.*, items.*, branches.* from biblio
inner join biblioitems ON (biblioitems.biblionumber=biblio.biblionumber)
inner join items ON (items.biblionumber=biblio.biblionumber)
inner join branches ON (branches.uid=items.uid);
I gave it a try on solr but could not get the desired result using this :
<document>
<entity name="id" query="select biblio.biblionumber as 'id', biblio.* from biblio ;">
<field column="BIBLIONUMBER" name="biblionumber" />
<field column="AUTHOR" name="author" />
<field column="TITLE" name="title" />
<field column="SERIESTITLE" name="seriestitle" />
<field column="COPYRIGHTDATE" name="copyrightdate" />
<field column="ABSTRACT" name="abstract" />
<entity name="id2" query="select biblioitems.biblioitemnumber as 'id2', biblioitems.* from biblioitems where biblionumber='${biblio.id}'">
<field name="BIBLIOITEMNUMBER" column="biblioitemnumber" />
<field name="ISBN" column="isbn" />
<field name="ISSN" column="issn" />
<field name="PUBLISHERCODE" column="publishercode" />
<field name="EDITIONSTATEMENT" column="editionstatement" />
<field name="PAGES" column="pages" />
<field name="PLACE" column="place" />
<field name="URL" column="url" />
</entity>
<entity name="id3" query="select items.uid as 'id3', items.* from items where biblionumber='${biblio.id}'">
<field name="ITEMNUMBER" column="itemnumber" />
<field name="PRICE" column="price" />
<field name="BARCODE" column="barcode" />
<field name="ENUMCHRON" column="enumchron" />
<field name="UID" column="uid" />
<field name="HOMEBRANCH" column="homebranch" />
<entity name="id4" query="select branches.uid AS 'id4', branches.* from branches where uid = '${items.id3}'">
<field name="UID" column="uid" />
<field name="BRANCHNAME" column="branchname" />
</entity>
</entity>
</entity>
</document>
The result is displayed upto abstract the moment join operation comes into play. I'm struggling with the query.
I request you all to help me with this query.
Thanks in Advance!!!
I've two MySQL tables book and author, they have many-to-many relationship, done via book_author_mapper whose row contain columns book_id / author_id.
In Solr, I have a query to get book list, for each book I need to get an array of author_id for the book.
Currently, I am thinking about to use a multi-valued field to store book ids.
My question is:
How to define the field, and how to write the SQL in DIH, it seems need multiple SQL, right? Thx.
If I want to get not just the author_id list, but as well as author_name for each author_id, is that possible?
After viewing doc & googling, I have kind solved the problem.
Tables
book
author
book_author_map (this is the middle table for many-to-many relationship)
DIH config file
<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/test?characterEncoding=utf8&zeroDateTimeBehavior=convertToNull" user="root"
password="123456" />
<document>
<entity name="book" pk="id"
query="SELECT * FROM book where status = 0 limit 200000;"
deltaImportQuery="SELECT * FROM book where status = 0 and id='${dih.delta.id}' limit 200000;"
deltaQuery="select id from book where status = 0 and CONVERT_TZ(`update_date`, ##session.time_zone, '+00:00') > '${dih.last_index_time}'"
>
<entity name="author"
query="SELECT au.cn_name as author_cn_name FROM author AS au JOIN book_author_map AS bam ON au.id = bam.author_id WHERE bam.book_id = ${book.id} limit 10;"
>
<field name="authors" column="author_cn_name" />
</entity>
</entity>
</document>
</dataConfig>
Field definition
<field name="cn_name" type="textComplex" indexed="true" stored="true" />
<field name="en_name" type="textComplex" indexed="true" stored="true" />
<field name="status" type="int" indexed="true" stored="true" />
<field name="authors" type="textComplex" indexed="true" stored="true" multiValued="true" />
TODOs
parentDeltaQuery It get pk of parent entity, but when it is called, and what is do? Is that necessary?
Does deltaQuery and parentDeltaQuery necessary in sub entity?
I have to import some mdb files to solr. some of mdb files are indexed well as document but there are others don't.
I use Solr 4.10.0 and ucanaccess ver. 2.0.9 The follwoing is a screen shot from the log:
For some missing fields values (in the screen shot case 6 fields) I have set onError="continue" in the dataimport-config:
<document>
<entity name="Book" dataSource="a" query="select bkid AS id, bkid AS BookID,bk AS BookTitle, betaka AS BookInfo, cat as cat from 0bok WHERE bkid = 29435">
<field column="id" name="id"/>
<field column="BookID" name="BookID"/>
<field column="BookTitle" name="BookTitle"/>
<field column="cat" name="cat"/>
<entity name="Category" dataSource="a" query="select name as CatName, catord as CatWeight, Lvl as CatLevel from 0cat where id = ${Book.CAT}">
<field column="CatName" name="CatName"/>
<field column="CatWeight" name="CatWeight"/>
<field column="CatLevel" name="CatLevel"/>
</entity>
<entity name="Pages" dataSource="a1" onError="continue" query="SELECT nass AS PageContent, page AS PageNum FROM book ORDER BY page">
<field column="PageContent" name="PageContent"/>
<field column="PageNum" name="PageNum"/>
<entity name="Titles" dataSource="a1" onError="continue" query="SELECT * FROM title WHERE id = ${Pages.PAGENUM} ORDER BY sub">
<field column="ID" name="TitleID"/>
<field column="TIT" name="PageTitle"/>
<field column="SUB" name="TitleWeight"/>
<field column="LVL" name="TitleLevel"/>
</entity>
</entity>
</entity>
</document>
This is a screen shot for the table regarded in the database with the 6 undefined data fields:
At the end of dataimporting for this mdb file I got the following response:
Last Update: 09:12:04 Requests: 31,952, Fetched: 78,980, Skipped: 0,
Processed: 0 Started: 18 minutes ago
Which it is shown that 0 processed!
There are other mdb files are proceed i.e 1 processed is generated in the response but I have got the folwing errors in the log:
10/7/2014 9:28:08 AM ERROR SolrWriter Exception while solr commit.
this writer hit an OutOfMemoryError; cannot commit...
and
SolrIndexWriter Error closing IndexWriter this writer hit an
OutOfMemoryError; cannot flush...
How could I solve this issue? and why Solr requests and fetched all this records and then process and index none?!
Need pass userID parameter in apache solr.
Example:
http://localhost.com:8983/solr/collection1/select?q=abc&wt=json&indent=true&userID=THIS-PARAMETR-NEED-PASS
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/mydatabase" user="root" password="root"/>
<document name="t">
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='THIS-PARAMETR-NEED-PASS'">
<field column="Ac" name="acid"/>
<field column="UserID" name="userid"/>
<field column="Comment" name="comment"/>
<entity name="m"
query="SELECT * FROM `table2`WHERE `tid` = '${mytable.tid}'">
<field column="Title" name="title"/>
</entity>
</document>
</dataConfig>
The example you give is a bit mixed up as the url you show hints to a search request, but the configuration shows that you want to access a request parameters within a dataimport handler.
Your concrete parameter could be accessed like ${dataimporter.request.userID}. Referring to the wiki you would need to alter your dataconfig like this
<entity name="act" dataSource="ds-1" query="SELECT * FROM mytable WHERE UserID='${dataimporter.request.userID}'">