Creating dynamically named fields in Solr without declaring a Transformer - mysql

I needed to index multiple fields from a table dynamically. At first it didn't work, reading the field name as "_t". Out of curiosity, I added a transformer declaration to the entity. This gave me the list of fields I needed.
Why did adding a transformer that is never used fix my problem? Removing the transformer declaration causes the problem to come back, and it does not matter which transformer I declare.
I am running Solr 4.1, and this issue first happened to me on the included example-DIH example that came with the download. I read the Solr reference guide for 4.0, and could not find a reason why adding a transformer made it work.
Below is the db-data-config.xml file:
<dataConfig>
<dataSource name="foo" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/foo" user="fakeadmin" password="fakepass"/>
<document>
<entity name="products" query="select * from `product`">
<field name="id" column="id" />
<entity name="names" query="select * from names where productId = '${products.id}'" transformer="LogTransformer">
<field name="${names.fieldName}_t" column="fieldValue"/>
</entity>
</entity>
</document>

Related

Unknown field errors when posting in Solr

I'm new to Solr. I'm running ver. 8.3.1 on Ubuntu 18.04. So far I've successfully crawled a website via a custom script, posted the crawled content to a schemaless core and run queries against the index. Now I'd like to do highlighting.
I found highlighting-related questions Question 1 and Question 2, which recommend storing the data to be indexed in files containing two fields, one field for the original, formatted HTML and the other for the same content as unformatted text, stripped of its HTML tags.
I turned off schemaless mode and managed-schema, and have created a minimal schema.xml based on the default managed-schema. My schema contains only 3 of the predefined fields (id, version and text) and two additional fields, content_html and content_text. Per the second answer to question 2 above, I also defined a text_html fieldtype for content_html. But the core failed to restart with that fieldtype, so I removed it. I have left all filetypes, dynamic and copy fields, etc. intact.
Here are the fields defined in schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_html" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_text" type="text_general" indexed="true" stored="true" multiValued="true"/>
My crawler is a simple PHP script. I've tried storing the crawled content as XML and JSON in separate attempts. With XML, PHP converted special characters to their HTML equivalents (e.g., < becomes <), so I abandoned this format. JSON seems to be working, as evidenced by the fact that my script doesn't die, and spot checks of the output appear to be formatted correctly. Here's a simplified example of the script's output:
{
"content_html": "<!DOCTYPE html> <html lang=\"en\"><head><meta charset=\"utf-8\"> ... <ul> ... <\/ul> ... etc.",
"content_text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit ... etc."
}
I have followed the instructions for clearing out the index prior to re-posting the documents, which are to delete by query and commit as follows:
curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"*:*" }}' http://localhost:8983/solr/corename/update
curl http://localhost:8983/solr/corename/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
The instructions say the data dir should be empty when this finishes, but it isn't. The folder structure remains intact, and two files remain in the index folder, segments_a (117 bytes) and write.lock (0 bytes). There are also a number of files of various sizes in the tlog folder.
I’m sending the data to Solr with the following command:
sudo -u solr /opt/solr/bin/post -c corename /var/solr/data/userfiles/corename/*
When I post, Solr throws errors for each document and the index isn't updated. Here are the two errors I'm getting:
..."status":500, ... "msg":"org.apache.tika.exception.TikaException: Zip bomb detected!"
..."status":400, ... unknown field 'x_parsed_by'
The Tika error probably results from my putting large amounts of data in a single JSON field. Research tells me that this error is related to Tika's maxStringLength setting for its WriteOutContentHandler. I will ask about Tika in a separate question.
Regarding the unknown field error, my assumption is that Solr will index only the data contained in the fields I've defined in schema.xml, ignoring other fields it encounters. So, I don't know why a new field, x_parsed_by is coming into the picture and causing trouble. Perhaps my assumption is incorrect. Must I account in advance for each field that will be encountered? Seems to me this would be impossible with a large set of data unless schemaless mode is used. Perhaps, instead of using a minimal schema.xml, I should rename my core's managed-schema, which was modified by indexing, to schema.xml so that all of the fields it defines are retained. I've learned, though, that it's wise to reduce the index size if possible by eliminating unneccessary fields. How should I approach this? Perhaps there's a recommended solution for highlighting HTML content that I've missed.
What additional information can I provide to facilitate an answer?
After receiving MatsLindh's advice and reading the reference guide more carefully, I made the following changes which allowed me to index the data and highlight query results.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_description" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_html" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_text" type="text_general" indexed="true" stored="true" multiValued="true"/>
JSON output:
[
{
"id": "1",
"content_title": "Title Text Here",
"content_description": "Description text here",
"content_html": "<!DOCTYPE html> etc.",
"content_text": "Text from content_html, stripped of tags."
},
{
"id": "2",
"content_title": "Title Text Here",
"content_description": "Description text here",
"content_html": "<!DOCTYPE html> etc.",
"content_text": "Text from content_html, stripped of tags."
}
]
Command to index the data via /update:
curl 'localhost:8983/solr/corename/update?commit=true' --data-binary #/var/solr/data/userfiles/corename/content.json -H 'Content-type:application/json'

Can a partial number of columns in a CSV file be imported to Solr using post.jar?

I'm trying to only import certain columns of a CSV file into Solr, but I'm not sure how to do this or if this is even possible with Solr. Currently, I'm using one of the books.csv examples that came with the Solr installation (can be found in C:\solr-5.2.1\example\exampledocs).
The below xml that I put in the schema.xml file works if all fields are included, but if I comment some fields out, Solr complains about unknown fields that are the ones commented out.
<uniqueKey>id</uniqueKey>
<!-- Fields added for books.csv load-->
<field name="cat" type="text_general" indexed="true" stored="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="price" type="tdouble" indexed="true" stored="true"/>
<!-- these columns commented out
<field name="inStock" type="boolean" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
-->
Because the post script in C:\solr-5.2.1\bin is a shell script and wont run in Windows, which I am using, I need to use the post.jar file located in the same place as the books.csv file.
java -Dtype=text/csv -Durl=http://localhost:8983/solr/jcg/update -jar post.jar books.csv
CSV Update Handler has a large number of parameters to control CSV import process. skip seems the most relevant to your specific problem.

Data import from csv file to Solr core using DIH

I am trying to upload a csv file content to SOLR using DIH(date import handler). I have wrote custom data config file and included that in solr-config.xml. Content of data config file is as shown below:
<dataConfig>
<dataSource name="ds1" type="FileDataSource" encoding="UTF-8"/>
<document>
<entity name="entryline"
processor="LineEntityProcessor"
url="testSolr.csv"
rootEntity="false"
dataSource="ds1" header="true"
separator="^" transformer="DateFormatTransformer" loglevel="debug">
<field column="id" name="id"/>
<field column="ab" name="ab"/>
<field column="bc" name="bc"/>
<field column="tt" name="tt" dateTimeFormat="EEE MMM dd HH:mm:ss yyyy" locale="en"/>
</entity>
</document>
</dataConfig>
The problem here is, Solr was able to fetch all the lines from csv but not able to add/update those lines to its core (Please note that i also have schema.xml with above mentioned attributes). Below is the snapshot of solr dashboard after executing the import command:
Solr dashboard snapshot after executing import command :
I am not getting any exception either.Could anybody help me to understand the issue or provide a solution for the same. Thanks in advance.
Use baseDir in entity.
<dataConfig>
<dataSource type="FileDataSource"/>
<document>
<!-- this outer processor generates a list of files satisfying the conditions specified in the attributes -->
<entity name="f" processor="FileListEntityProcessor" fileName="SoapHeader_.*|atom.xml$" recursive="true" rootEntity="false" dataSource="null" baseDir="D:\Apache\Apache-Solr-5.1\example\exampledocs">

How to import data from mysql to solr

I am trying to do the full db import using below URL
`127.0.0.1:8983/solr/dataimport?command=full-import`
I installed solr and trying to configure it. I changed few files and putted details (file names and added code is described below). But when I am trying to import the table data into solr json format it is showing below error:
HTTP ERROR 404
Problem accessing /solr/dataimport. Reason:
Not Found
Powered by Jetty://
Can anyone let me know what the actual problem is? Or did I misconfigure Sorl?
My data-config.xml file have below code:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/sq_dbLoveOffers"
user="pksqueak"
password="passwd"/>
<document>
<entity name="id"
query="select sq_prom_id, sq_prom_name, sq_prom_description, sq_latitude, sq_longitude from sq_offers">
</entity>
</document>
</dataConfig>
I added below code into Solrconfig.xml:
<lib dir="../../../../contrib/dataimporthandler/lib/" regex=".*\.jar" />
<lib dir="../../../dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />
and
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
I added below code in schema.xmal FILE:
<fields>
<field name="sq_prom_id" type="string" indexed="true" stored="true" required="true" />
<field name="sq_prom_name" type="string" indexed="true" stored="true" />
<field name="sq_prom_description" type="string" indexed="true" stored="true" />
<field name="sq_latitude" type="string" indexed="true" stored="true" />
<field name="sq_longitude" type="string" indexed="true" stored="true" />
</fields>
In case the core you address is not your default core, your request is lacking the core's name in the URL. You should request should be like this
127.0.0.1:8983/solr/<core-name>/dataimport?command=full-import
There you need to replace the <core-name> with the actual name of your core, as configured in your solr.xml.
I used below command to run sorl server for DIH.
java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar
and I did the full import using below URL, it solved my problem.
http://127.0.0.1:8983/solr/db/dataimport?command=full-import
I have a functioning Data Import Handler and you can compare configs with me if necessary http://amac4.blogspot.co.uk/2013/08/configuring-solr-4-data-import-handler.html

How to import XML file into MySQL database table using XML_LOAD(); function

I have an XML file which looks like this :
<?xml version="1.0" encoding="UTF-8"?>
<resultset statement="YOUR SQL STATEMENTS TO GENERATE THIS XML FILE" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<row>
<field name="personal_number">539</field>
<field name="firstname">Name</field>
<field name="lastname">Surname</field>
<field name="email">email.domain.com</field>
<field name="start_time">2011-04-02 13:30:00</field>
<field name="end_time">2011-04-02 18:15:00</field>
<field name="employee_category">1,2,4,5,22,37,38,39,41,43,44</field>
</row>
<row>
<field name="personal_number">539</field>
<field name="firstname">Name</field>
<field name="lastname">Surname</field>
<field name="email">email.domain.com</field>
<field name="start_time">2011-04-02 13:30:00</field>
<field name="end_time">2011-04-02 18:15:00</field>
<field name="employee_category">1,2,4,5,22,37,38,39,41,43,44</field>
</row>
<row>
<field name="personal_number">539</field>
<field name="firstname">Name</field>
<field name="lastname">Surname</field>
<field name="email">email.domain.com</field>
<field name="start_time">2011-04-02 13:30:00</field>
<field name="end_time">2011-04-02 18:15:00</field>
<field name="employee_category">1,2,4,5,22,37,38,39,41,43,44</field>
</row>
I am trying to import it in MySQL using SQL statement :
use databasename;
LOAD XML LOCAL INFILE '/pathtofile/file.xml' INTO TABLE my_tablename;
The table my_tablename has the following fields :
id (auto increment id)
personal_number(varchar)
firstname(varchar)
lastname(varchar)
email(varchar)
start_time(varchar)
end_time(varchar)
employee_category(varchar)
I get error :
Error Code: 1136
Column count doesn't match value count at row 1
I am using MySQL 5.1.56
I assume this error occurs because the database table has field id, which is not present in the XML file. How is it possible to import this XML file using MySQL queries of built in functions such that it skips id column during the import and relies on the auto increment function for the id column?
Is there some smarter way of handling XML file imports im MySQL?
Maybe there is better statement which allows to specify column mapping?
Thank you!
you can specify fields like this:
LOAD XML LOCAL INFILE '/pathtofile/file.xml'
INTO TABLE my_tablename(personal_number, firstname, ...);
Since ID is auto increment, you can also specify ID=NULL as,
LOAD XML LOCAL INFILE '/pathtofile/file.xml' INTO TABLE my_tablename SET ID=NULL;