Skip faulty lines when using Solr's csv handler - csv

I want to parse a csv file using the solr handler. The problem is that my file might contain problematic lines (those lines can contain unescaped encaptulators). When Solr finds one such line, fails with the following message and stops
<str name="msg">CSVLoader: input=null, line=1941,can't read line: 1941
values={NO LINES AVAILABLE}</str><int name="code">400</int>
I understand that in that case the parser cannot fix the problematic line and this ok for me.I just want to skip the faulty line and continue with the rest of the file.
I tried using the TolerantUpdateProcessorFactory in my processor chain but the result was the same.
I use solr 6.5.1 and the curl command that I try is something like that
curl '<path>/update?update.chain=tolerant&maxErrors=10&commit=true&fieldnames=<my fields are provided>,&skipLines=1' --data-binary #my_file.csv -H 'Content-type:application/csv'
Finally this is what I put in my solrconfig.xml
<updateRequestProcessorChain name="tolerant">
<processor class="solr.TolerantUpdateProcessorFactory">
<int name="maxErrors">10</int>
</processor>
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

I would suggest that you pre-process and clean the data using the using the UpdateRequestProcessors.
This is a mechanism to transform the documents that is submitted to Solr for indexing.
Read more about UpdateRequestProocessors

Related

Loading multiple JSON records into BigQuery using the console

I'm trying to upload some data into bigquery in JSON format using the BigQuery Console as described here.
If I have a single record in a JSON file I can upload it successfully. If I put two or more records in a JSON file with newline delimination then I get this error:
Error while reading data, error message: JSON parsing error in row starting at position 0: Parser terminated before end of string
I tried searching stackoverflow and google but didn't have any luck finding any information. The two records I uploaded with newline delimination are able to upload successfully as individual records in separate JSON files.
My editor must have been adding some other character on my newlines. I went back to my original json array of records and used:
cat test.json | jq -c '.[]' > testNDJSON.json
This fixed everything.

Oracle SQLcl: Spool to json, only include content in items array?

I'm making a query via Oracle SQLcl. I am spooling into a .json file.
The correct data is presented from the query, but the format is strange.
Starting off as:
SET ENCODING UTF-8
SET SQLFORMAT JSON
SPOOL content.json
Follwed by a query, produces a JSON file as requested.
However, how do I remove the outer structure, meaning this part:
{"results":[{"columns":[{"name":"ID","type":"NUMBER"},
{"name":"LANGUAGE","type":"VARCHAR2"},{"name":"LOCATION","type":"VARCHAR2"},{"name":"NAME","type":"VARCHAR2"}],"items": [
// Here is the actual data I want to see in the file exclusively
]
I only want to spool everything in the items array, not including that key itself.
Is this possible to set as a parameter before querying? Reading the Oracle docs have not yielded any answers, hence asking here.
Thats how I handle this.
After output to some file, I use jq command to recreate the file with only the items
ssh cat file.json | jq --compact-output --raw-output '.results[0].items' > items.json
`
Using this library = https://stedolan.github.io/jq/

split big json files into small pieces without breaking the format

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

load csv file into BQ - too many positional args

I tried loading a sample data file [csv] in BQ. Since CSV has header I wanted to skip first row, Following is the code
project_id1> load prodtest.prod_det_test gs://bucketname/Prod_det.csv prodno:integer,prodname:string,instock:integer --skip_leading_rows=1
Issue: Too many positional args, still have['--skip_leading_rows=1']. Please suggest how to resolve this issue?
This should work:
bq load --skip_leading_rows=1 prodtest.prod_det_test gs://bucketname/Prod_det.csv prodno:integer,prodname:string,instock:integer
The -- arguments come at the beginning.

Solr 4.7.1 uploading CSV Document is missing mandatory uniqueKey field id

I'm new to solr (4.7.1). I unzip the solr code and copied the schemaless example to its own directory. I then used start.jar passing it -Dsolr.solr.home=" to the new location. Jetty came up and everything appears to be working on that front.
Now I wanted to upload/update a csv file to it. Here's what I used:
curl http://localhost:8983/solr/update/csv --data-binary #c:\solrschemaless\data.csv -H "Content-type:text/csv; charset=utf-8"
I received the following:
<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">0</int>
</lst>
<lst name="error">
<str name="msg">Document is missing mandatory uniqueKey field: id</str>
<int name="code">400</int>
</lst>
</response>
The csv file has a column named XXXXID. I changed to "id", "id_s", "id_i" but still getting the same error. There are a lot of post on SO and elsewhere but thus far I didn't see one for the schemaless model.
EDIT: I reduced down my csv file to this:
id,Contact,Address,Focus,Type
2,97087,1170,NULL,1
and I'm still getting the same error message of missing mandatory uniqueKey.
I'm on Windows 8.
Any ideas?
Figured it out. For some reason the column id cannot be the first column on the list. When I change
id,Contact,Address,Focus,Type
2,97087,1170,NULL,1
to this
foo,id,Contact,Address,Focus,Type
bar,2,97087,1170,NULL,1
it works.