How to properly set up DataImportHandler for MySQL database with large number or records? - mysql

I have set up Solr's data import handler as instructed in manual. Solr reads the records from a MySQL database. The database has large number of records (expected is milliards/billions).
I have read that batch size does not work for MySQL because the JDBC driver does not support it. I have tried setting it up to -1. In this case, Solr performs one select and gets all records from the DB and indexes them.
Now, I have problem, since a timeout occurred while indexing and caused it to stop. I see that Solr hasn't written any id value in the properties file after the exception occurred. I am not sure how to proceed with indexing the rest of the records.
Can anyone suggest to me how to set up Solr with MySQL for a proper data import?
Below is data config I am currently using.
<dataConfig>
<dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/myowndb" batchSize="-1" />
<document name="statuses">
<entity name="status" query="select s.*, ti.id2, ti.value2 from tblTable1 s inner join tblTable2 ti on s.table2Id = ti.id;">
<field column="id" name="id" />
<field column="statusID" name="statusId" />
<field column="type" name="type" />
<field column="date" name="date" />
<field name="id2" column="id2" />
<field name="value2" column="value2" />
</entity>
</document>
</dataConfig>
EDIT:
Based on my tests today, it looks like batchSize is working. If batchSize is set to -1, it will make single request to MySQL retrieving all rows at once. If set to some value greater than 0, it will put every record in memory before processing.
New question is next: how to set up data import handler so it can index in batches? Not only to perform batch select from database, but to index collected set before collecting next one.
EDIT: Specified question
New question that came up from reading is next: is it possible to mark row in database as processed? There are only two events available in DIH, onImportStart and onImportEnd.
Current flow in ideas lead me to implement EntityProcessor. If it would be possible to know when some row is indexed, it would also be easy to mark isIndexed flag in database for indexed row. This is in case I implement custom EntityProcessor.

Related

How can we use solr with both MongoDB and MySQL?

I want to use Solr with MongoDB and MySQL together and need to combine in single core.
For example, I have a MongoDB collection which has depends on MySQL's one table,
I tried both with separate Solr core it's working fine but i want it in single core, i don't know its possible or not, if its possible then how we can use?
Updated
Here my DIHs: (Data import Handler)
- Solr with MySQL
<dataConfig>
<dataSource
name="MySQl"
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/test"
user="root" password="root"
batchSize="-1"/>
<document>
<entity
query="select * from master_table"
name="master">
</entity>
</document>
</dataConfig>
- Solr with MongoDB
<dataConfig>
<dataSource
name="MyMongo"
type="MongoDataSource"
database="test" />
<document>
<entity
processor="MongoEntityProcessor"
query=""
collection="MarketCity"
datasource="MyMongo"
transformer="MongoMapperTransformer"
name="sample_entity">
<field column="_id" name="id" mongoField="_id" />
<field column="keyName" name="keyName" mongoField="keyName"/>
</entity>
</document>
</dataConfig>
So i want to do with the single core.
You can read the data from Mysql and MongoDB. Merge this records in single record and the index the same into solr.
To get the data from MySql, use any programming language and fetch the data.
For example you can use Java and fetch the data from mysql.
Apply the same logic to MongoDB. Get all the required records from mongoDB using Java.
Now By using the SolrJ apis create the solrDocument. Read more about the SolrDOcument and other apis here
Once your create the instance of SolrDocument then add the data that you fetched from Mysql and MongoDB into it using the below method.
addField(String name, Object value)
This will add a field to the document.
You can prepare the document something like this.
SolrInputDocument document = new SolrInputDocument();
document.addField("id", "123456");
document.addField("name", "Kevin Ross");
document.addField("price", "100.00");
solr.add(document);
solr.commit();
Get a solr instance of HttpSolrClient.
Once the SolrDocument is ready, index it to solr.

How to create a new column in Apache Solr during full import?

This is the current query I am using for my Apache Solr full import.
<document>
<entity name="id" query="select f_name as 'id', f_name, l_name, AES_DECRYPT(l_name,'key')as decp_col from table_name;"/>
</document>
This is the format I get on Apache Solr.
l_name":["4\u0007{¾Nen•øÕ+·$Õ\u0002"],
"f_name":["Jayde"],
"id":"Jayde",
"_version_":1650635178055303168},
I want it to be imported in a way that I get a decp_col with the decrypted values.
I have tried multiple queries but am unable to generate anything on Sokr side.

solr data import handler working on localhost and not on server

i have been trying to configure solr-DIH on server which has about 5 million documents and it is not working but it is working well on my localhost with 100000 documents.what can be the problem?
this is the log i am getting
Exception while processing: product_master document : SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT cs_product_id,title FROM product_master Processing Document # 1
16:10:56
SEVERE
DataImporter
Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: SELECT cs_product_id,​title
my data-config goes here
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://www.mysite.com/mydb" user="myusername" password="mypwd" batchSize="-1"/>
<document>
<entity name="product_master" pk="cs_product_id"
query="SELECT cs_product_id,title FROM product_master"
deltaImportQuery=" SELECT cs_product_id,title FROM product_master WHERE cs_product_id = '${dataimporter.delta.cs_product_id}'"
deltaQuery=" SELECT cs_product_id FROM product_master WHERE update_timestamp > '${dataimporter.last_index_time}'">
<field column="cs_product_id" name="cs_product_id"/>
<field column="title" name="title"/>
</entity>
</document>
</dataConfig>
There shouldn't be any difference so I suggest you to do the following:
set to DEBUG the log level of DIH components
create a simple class with a simple main method that does something trivial with that database, using that exact connection URL, exact username and password
I think with one of the two tricks above you will get the point (which, as I guess, has nothing to do with Solr)

How can I set the Nhibernate type generator to increment when working with SQL Server 2008?

Using HBM files to map my types.
One of my classes uses bag of items called PartnerEnv. One of their fields is set to be the id which should be generated using increment. for some reason I am getting the following error:
could not fetch initial value for increment generator[SQL: SQL not available]
Inner details: "{"Invalid object name 'jj.dbo.Partners2Env'."}"
If I change the generation method to assigned everything is ok.
I will appreciate any help given!
Can you set your Id column on the PartnerEnv table (or whatever that table is called) to an Identity column and then use the following in the .hbm file for that class?
<id name="Id" type="Int32">
<column name="Id" />
<generator class="identity" />
</id>

update index in Solr, error: required field in SolrSchema not found in DataConfig

I'm trying to update my index, but I keep on getting the error:
org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema INFO: UPC is a
required field in SolrSchema . But not
found in DataConfigfound in DataConfig
I can't figure out why it's complainting, since:
the first time I ran the import, it worked fine, and the only thing I changed was add a few fields (columns) to schema.xml
the table I am querying indeed has a UPC column. Here is what my data-config.xml looks like:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/product"
user="root"
password="some_password"/>
<document>
<entity name="product"
query="select * from productdetails">
</entity>
</document>
</dataConfig>
But again, the interesting part is that the import worked a second ago, but fails on re-import. I'm hoping somebody has had this problem before. If not, maybe someone can suggest other things to check for?
The reason for this is that when DataImportHandler starts it checks its config against your loaded schema. It is not an error, merely a warning. To remove it you have to add a specific field in your import config with a name that matches your required field.
This is not the cause of your failed reimport as this is simply a warning.