Neo4j CSV Load: How to avoid Null and escape characters - csv

I am trying to load large volume of data into graph using CSV Load script (xyx.cpl) and Neo4jShell.
Mostly it is doing well. Sometimes I am receiving following errors
Cannot merge node using null property value ...
Error related to escape characters
So, seeking assistance to understand the best way to handle this issues in import script.
Thanks in advance

Cannot merge node using null property value
You can use a WITH statement to filter out rows that have a null value for the property you are using in the MERGE. For example:
LOAD CSV WITH HEADERS FROM "file:///file.csv" AS row
WITH row WHERE row.name IS NOT NULL
MERGE (p:Person {name: row.name})
SET p.age = row.age
...
Error related to escape characters
Can you be a bit more specific about the error you are getting / show a Cypher and data example?
Without seeing your specific error / code here is some info that might help:
the character for string quotation within your CSV file is a double quote "
the escape character is \
more info and some examples here and here

Related

Using a custom row delimiter in Azure Data Factory Copy Activity fails

I have a requirement to produce test data files using the same formatting as the source files they are supposed to mimic (row delimiter, column delimiter, encoding, etc..). There is a process that reads files from an MS SQL database and creating files as output.
I have made a dataset with parameters that can supply the definition of the dataset at runtime. The problem I have is the following error is raised on execution:
Copy activity doesn't support multi-char or none row delimiter.
The parameter that is causing the error is the row delimiter. I have tried:
\r\n
\r
\n
n
r
r,n
\r,\n
I read this Custom Row Delimiter in Azure Data Factory (ADF) where someone says they have been able to make a likewise solution work.
I can output a file using either r or n but there is no separation of data over lines. I also read in another post that this is not supported but that is hard to believe because you can use the default option to create this particular row delimiter behavior.
As per official documentation Currently, row delimiter as empty string is only supported for mapping data flow but not Copy activity.
So, passing "\r\n" as a parameter will pass "\\r\\n" for some reason.
And you can circumvent the error by editing the source code.
There was a similar question here: here
It's not the solution you're looking for, but at least it tells you how to get around the error and why you've got the error.

Bigquery - Handle Double Quotes & Pipe Field Separator in CSV (Federated Table)

I am currently facing issues loading data into big query or even creating federated table as the incoming data is delimited by | pipe symbol with escaped double quotes on fields inside the file
Sample Data (also tried escaping double quote values with double double-quotes i.e "" on field level)
13|2|"\"Jeeps, Trucks & Off-Roa"|"JEEPSTRU"
Create DDL
CREATE OR REPLACE EXTERNAL TABLE `<project>.<dataset>.<table>`
WITH PARTITION COLUMNS (
dt DATE
)
OPTIONS (
allow_jagged_rows=true,
allow_quoted_newlines=true,
format="csv",
skip_leading_rows=1,
field_delimiter="|",
uris=["gs://path/to/bucket/table/*"],
hive_partition_uri_prefix='gs://path/to/bucket/table'
)
Query
SELECT
*
FROM
`<project>.<dataset>.<table>`
WHERE field_ like '%Jeep%'
Error
Error while reading table: <project>.<dataset>.<table>, error message: Error detected while parsing row starting at position: 70908. Error: Data between close double quote (") and field separator.
However, it works if I create the table with the option quote empty character quote="" which makes hard to filter out on SQL query
I need the field_ data to be loaded as "Jeeps, Trucks & Off-Roa
I tried to find various documentation & StackOverflow question (since everything is old or not working - or unlucky me) I am posting this question again.
I have a very basic question > What is the better way to escape double quotes in a column for federated big query table to avoid this problem without preprocessing csv/psv raw data?
This is not problem with external table or bigquery, but rather CSV files feature. I had similar once when I uploaded data to table in UI. I have found some sources(BTW which I cannot find right now) that double quotes should be used twice ("") in CSV file to get such behavior, like using your exaple:
13|2|"""Jeeps, Trucks & Off-Roa"|"JEEPSTRU"
I have tested it in your sample. When I downloaded data to table from csv I got the same error. And after using above it worked as expected. Result field value is:
"Jeeps, Trucks & Off-Roa
I suppose it will work for you as well.
EDIT: I have found it in Basic Rules of CSV on Wikipedia:
Each of the embedded double-quote characters must be represented by a pair of double-quote characters.
1997,Ford,E350,"Super, ""luxurious"" truck"

Difficulties creating CSV table in Google BigQuery

I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.
This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;
It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.
As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.
I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps

Error code: Inavlid in Loading Data on BigQuery

I have a large CSV file (nearly 10,000 rows) and I am trying to upload it on the BigQuery but it gives me this error:
ile-00000000: CSV table references column position 8, but line starting at position:622 contains only 8 columns. (error code: invalid)
Can anyone please tell me a possible to reason to it? I have double checked my Schema and it looks alright.
Thanks
I had this same issue when trying to import a large data set in a csv to a BigQuery table.
The issue turned out to be some ascii control characters (\b, \t, \r, \n) in the data that was written in the csv. When the csv was being sent to BigQuery these characters caused the BiqQuery csv parser to misinterpret the line and break because the data didn't match with the number of columns in the header.
Replacing these characters with a space (to preserve formatting as best as possible) allowed me to import the data without further issues.
The error message suggests that the load job failed because at least one row has fewer columns than the automatically detected schema dictates.
Add
allow_jagged_rows=true
in the options.

Pentaho Kettle conversion from String to Integer/Number error

I am new to Pentaho Kettle and I am trying to build a simple data transformation (filter, data conversion, etc). But I keep getting errors when reading my CSV data file (whether using CSV File Input or Text File Input).
The error is:
... couldn't convert String to number : non-numeric character found at
position 1 for value [ ]
What does this mean exactly and how do I handle it?
Thank you in advance
I have solved it. The idea is similar to what #nsousa suggested, but I didn't use the Trim option because I tried it and it didn't work on my case.
What I did is specify that if the value is a single space, it is set to null. In the Fields tab of the Text File Input, set the Null if column to space .
That value looks like an empty space. Set the Format of the Integer field to # and set trim to both.