I'm doing my first steps with solr and for that I try to load my own data from csv using a schema adapted from the example supplied. I replaced the fields with my own fields.
<field name="publicationNumber" type="string" indexed="true" stored="true" required="true" />
Plus more fields. When I knwo try to load data using command from solr documentation:
curl http://localhost:8983/solr/update/csv?stream.file=exampledocs/test.csv^&stream.contentType=text/csv;charset=utf-8
(Windows hence the ^ before the &)
I get the error:
undefined field: "?publicationNumber"
The first column in csv is publicationNumber. However the field is clearly defined and what is it with the ? before the field name? How can I load the data?
Always the same. You try for hours and as soon as you make a forum post you find the solution. Pretty simple. Even so the command says charset=utf-8 the csv file must not be in utf-8 encoding but ANSI. Eg. converting to ANSI and it worked without issue.
Related
In rsreportserver.config file, how can I set CSV export to have a field delimiter of none So that I can extract a fixed length file? I tried keeping it empty but it gives commas after each field value.
Also if possible, please provide the script to get fixed length file and some of the fields can have commas in between values.
I would really appreciate any input as my client is fully determined to use SSRS and not SSIS.
You can achieve this by adding a new custom extension
use the following as an example :
<Extension Name="csvnoseperator" Type="Microsoft.ReportingServices.Rendering.DataRenderer.CsvReport,Microsoft.ReportingServices.DataRendering">
<OverrideNames>
<Name Language="en-US">csvnoseperator</Name>
</OverrideNames>
<Configuration>
<DeviceInfo>
<FieldDelimiter></FieldDelimiter>
<UseFormattedValues>False</UseFormattedValues>
<NoHeader>True</NoHeader>
<Encoding>ASCII</Encoding>
<FileExtension>csv</FileExtension>
</DeviceInfo>
</Configuration>
</Extension>
You should be able to then use this as an export format from the front end.
Please ensure you backup your config file before you make this change! If you get it wrong, you will not be able to starts Reporting Service
I'm new to Solr. I'm running ver. 8.3.1 on Ubuntu 18.04. So far I've successfully crawled a website via a custom script, posted the crawled content to a schemaless core and run queries against the index. Now I'd like to do highlighting.
I found highlighting-related questions Question 1 and Question 2, which recommend storing the data to be indexed in files containing two fields, one field for the original, formatted HTML and the other for the same content as unformatted text, stripped of its HTML tags.
I turned off schemaless mode and managed-schema, and have created a minimal schema.xml based on the default managed-schema. My schema contains only 3 of the predefined fields (id, version and text) and two additional fields, content_html and content_text. Per the second answer to question 2 above, I also defined a text_html fieldtype for content_html. But the core failed to restart with that fieldtype, so I removed it. I have left all filetypes, dynamic and copy fields, etc. intact.
Here are the fields defined in schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_html" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_text" type="text_general" indexed="true" stored="true" multiValued="true"/>
My crawler is a simple PHP script. I've tried storing the crawled content as XML and JSON in separate attempts. With XML, PHP converted special characters to their HTML equivalents (e.g., < becomes <), so I abandoned this format. JSON seems to be working, as evidenced by the fact that my script doesn't die, and spot checks of the output appear to be formatted correctly. Here's a simplified example of the script's output:
{
"content_html": "<!DOCTYPE html> <html lang=\"en\"><head><meta charset=\"utf-8\"> ... <ul> ... <\/ul> ... etc.",
"content_text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit ... etc."
}
I have followed the instructions for clearing out the index prior to re-posting the documents, which are to delete by query and commit as follows:
curl -X POST -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"*:*" }}' http://localhost:8983/solr/corename/update
curl http://localhost:8983/solr/corename/update --data '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
The instructions say the data dir should be empty when this finishes, but it isn't. The folder structure remains intact, and two files remain in the index folder, segments_a (117 bytes) and write.lock (0 bytes). There are also a number of files of various sizes in the tlog folder.
I’m sending the data to Solr with the following command:
sudo -u solr /opt/solr/bin/post -c corename /var/solr/data/userfiles/corename/*
When I post, Solr throws errors for each document and the index isn't updated. Here are the two errors I'm getting:
..."status":500, ... "msg":"org.apache.tika.exception.TikaException: Zip bomb detected!"
..."status":400, ... unknown field 'x_parsed_by'
The Tika error probably results from my putting large amounts of data in a single JSON field. Research tells me that this error is related to Tika's maxStringLength setting for its WriteOutContentHandler. I will ask about Tika in a separate question.
Regarding the unknown field error, my assumption is that Solr will index only the data contained in the fields I've defined in schema.xml, ignoring other fields it encounters. So, I don't know why a new field, x_parsed_by is coming into the picture and causing trouble. Perhaps my assumption is incorrect. Must I account in advance for each field that will be encountered? Seems to me this would be impossible with a large set of data unless schemaless mode is used. Perhaps, instead of using a minimal schema.xml, I should rename my core's managed-schema, which was modified by indexing, to schema.xml so that all of the fields it defines are retained. I've learned, though, that it's wise to reduce the index size if possible by eliminating unneccessary fields. How should I approach this? Perhaps there's a recommended solution for highlighting HTML content that I've missed.
What additional information can I provide to facilitate an answer?
After receiving MatsLindh's advice and reading the reference guide more carefully, I made the following changes which allowed me to index the data and highlight query results.
schema.xml:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="content_title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_description" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_html" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="content_text" type="text_general" indexed="true" stored="true" multiValued="true"/>
JSON output:
[
{
"id": "1",
"content_title": "Title Text Here",
"content_description": "Description text here",
"content_html": "<!DOCTYPE html> etc.",
"content_text": "Text from content_html, stripped of tags."
},
{
"id": "2",
"content_title": "Title Text Here",
"content_description": "Description text here",
"content_html": "<!DOCTYPE html> etc.",
"content_text": "Text from content_html, stripped of tags."
}
]
Command to index the data via /update:
curl 'localhost:8983/solr/corename/update?commit=true' --data-binary #/var/solr/data/userfiles/corename/content.json -H 'Content-type:application/json'
I'm trying to load a CSV data into mysql from talend studio and getting the below mentioned error:
Couldn't parse value for column 'RegisterTime' in 'row1', value is '1974-10-22 08:46:40.000'. Details: java.lang.RuntimeException: Unparseable date: "1974-10-22 08:46:40.000"
Field RegisterTime has the data type as 'Date' and format as yyyy-MM-dd hh:mm:ss.SSS while defining the schema as metadata in the talend studio.
Am I using an incorrect format? Any help in suggesting the right format would be great.
This is probably a Date pattern problem even though the one you indicate is the right one. You should make sure this pattern is used in the component itself by going to the component view -> "Edit Schema".
because has millisecond,'RegisterTime' the data type should be 'DATETIME' and length is 3.try again
I want to clarify that i browse similar questions about file encoding and importing csv files to Mysql. None of these worked for me.
The situation is like this:
i was in charge of the task of creating a spring boot application which generates csv files filled with random data to populate a database (each csv file correspond to a table in the DB).
the first scenario was to import these .csv files into Oracle 12c. There was no problem, i done it, there was no errors in the data.
The actual problem is using the same .csv files to import them into Mysql, it is not possible.
Clarifying both Databases (Oracle and Mysql) have both the same tables with the equivalent Data Types
I tried:
LOAD IN FILE option but the query never completes (it never shows any error message).
Exporting the data from Oracle (from SQLDeveloper) into a .csv file and then import it in Mysql, i got the same error below.
the Data Table Import Wizard from Mysql-workbench and it says:
Can't analyze file, please try to change encoding type, if that doesn't help, maybe the file is not: csv, or the file is empty.
Followed by:
Unhandled exception: 'ascii' codec can't encode character u'\xcd' in position 8: ordinal not in range(128).
I browsed questions about the encoding but none works to my case. And i'm pretty sure that i set the encoding to UTF-8 in my spring boot application.
My file writer has it:
writer = new OutputStreamWriter(fileOutputStream, StandardCharsets.UTF_8);
How is it possible that the generated .csv files work perfectly in Oracle but no in Mysql?
Aditional info:
springboot 2.0.3.RELEASE
Java version 1.8.0_162
If you are working on Java environment. I think DBIS solution can work better for you.
In this solution you only need to map csv file columns(header name or index) with column name of database table in XML file. no other code is required.
https://stackoverflow.com/a/50180272/2480620
Add one script to export table to csv and next csv to table.
<?xml version="1.0" encoding="UTF-8"?>
<config>
<connections>
<jdbc name="mysql">
<driver>com.mysql.jdbc.Driver</driver>
<url>jdbc:mysql://localhost:3306/test1</url>
<user>root</user>
<password></password>
</jdbc>
<jdbc name="oracle">
<driver>oracle.jdbc.OracleDriver</driver>
<url>jdbc:oracle:thin:#localhost:1521:orcl</url>
<user>root</user>
<password></password>
</jdbc>
</connections>
<component-flow>
<execute name="oracle2csv" success="csv2mysql" enabled="true">
<migrate>
<source>
<table connection="oracle">locations</table>
</source>
<destination>
<file delimiter="," header="false" path="D:/test_source.csv"/>
</destination>
<mapping>
<column source="location" destination="1" />
</mapping>
</migrate>
</execute>
<execute name="csv2mysql" enabled="true">
<migrate>
<source>
<file delimiter="," header="false" path="D:/test_source.csv"/>
</source>
<destination>
<table connection="mysql">locations</table>
</destination>
<mapping>
<column source="1" destination="location" />
</mapping>
</migrate>
</execute>
</component-flow>
</config>
full solution is explained
https://dbisweb.wordpress.com/
I have used it on MySql, Oracle and Postgres.
I am trying to load the CSV file in the solr 6.5 collection, using the solr Admin UI. Here are the steps that I did and got the following error:
Created a data driven managed schema config set in Zookeeper. Changed the unique key to "MyId" (String field) instead of default id.
<uniqueKey>MyId</uniqueKey>
...
<field name="MyId" type="string" indexed="true" stored="true" required="true" multiValued="false" />
Created collection and associated the config set mentioned above (using new Admin UI).
Load the CSV file using Admin UI (collections --> collection name drop down --> Documents). I have added request handler parameter of &rowid=MyId parameters. My CSV file has MyId field in it. During the load I get this error:
Document contains multiple values for uniqueKey field: MyId=[82552329, 1]
at org.apache.solr.update.AddUpdateCommand.getHashableId(AddUpdateCommand.java:168)
Without changing the unique ID and just using the default id (with auto generated UUID) field the csv loading fine. But I need the unique id to be MyId
I would like to know why my key field is reported as multi-valued, my CSV does not really contain multi-valued data, it is simple comma separated numeric and string fields. Please suggest what could have gone wrong.
Note: I have made this change as well Solr Schemaless Mode creating fields as MultiValued in the schema (does not help, as the problem is input data)
EDIT: Adding full exception trace
https://pastebin.com/raw/juRj7ZUi
I got a clue in the documentation csv update params that the issues is something to do with this param that i pass ( &rowid=MyId). As the documentation states that we should pass this paramater to add the line number as the id. That explains why my key (MyId) becomes a multi valued ([my actual key, line no.]). But then if i remove this param it was giving an error that id is not being populate. This means that it was expecting an id field. So added &literal.id=1, now everything works fine ( This is because in my schema there is required id field.). Thanks for helping out.