Loading CSV Neo4j "Neo.ClientError.Statement.SemanticError: Cannot merge node using null property value for Test1'" - csv

I am using grades.csv data from the link below,
https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html
I noticed that all the strings in the csv file were in "" and it causes
error messages:
Neo.ClientError.Statement.SemanticError: Cannot merge node using null property value for Test1
so I removed the "" in the headers
the code I was trying to run:
LOAD CSV WITH HEADERS FROM 'file:///grades.csv' AS row
MERGE (t:Test1 {Test1: row.Test1})
RETURN count(t);
error message:
Neo.ClientError.Statement.SyntaxError: Type mismatch: expected Any, Map, Node, Relationship, Point, Duration, Date, Time, LocalTime, LocalDateTime or DateTime but was List<String> (line 2, column 24 (offset: 65))
"MERGE (t:Test1 {Test1: row.Test1})

Basically you can not merge node using null property value. In your case, Test1 must be null for one or more lines in your file. If you don't see blank values for Test1, please check is there is any blank line at the end of file.
You can also handle null check before MERGE using WHERE, like
LOAD CSV ...
WHERE row.Test1 IS NOT NULL
MERGE (t:Test1 {Test1: row.Test1})
RETURN count(t);

The issues are:
The file is missing a comma after the Test1 value in the row for "Airpump".
The file has white spaces between the values in each row. (Search for the regexp ", +" and replace with ",".)
Your query should work after fixing the above issues.

Related

Create nodes from CSV in Neo4j

I have a csv file and I want to make 2 nodes with relation (node country-reported_on->node report_date). I have tried this code but it returns empty nodes with numbers instead of country name.
Here is what my dataset looks like:
PEOPLE_POSITIVE_CASES_COUNT;REPORT_DATE;COUNTRY_SHORT_NAME;PEOPLE_DEATH_COUNT;LIFE_EXPECTANCY;GDP;DENSITY_POPULATION;WORKFORCE
0;22.01.2020;Lesotho;0;54.836;875.353432963926;70.5616600790514
134;09.07.2020;Lesotho;1;54.836;875.353432963926;70.5616600790514
79557;02.03.2021;Zambia;1104;64.194;985.132436038869;94.4781600309238
106470;02.03.2021;Kenya;1863;66.991;1878.58070251348;94.4781600309238
Here is the code that I used:
LOAD CSV WITH HEADERS FROM "file:///dataset.csv"
as row WITH row WHERE row.COUNTRY_SHORT_NAME IS NOT NULL
MERGE (c:Country {name: row.COUNTRY_SHORT_NAME,
life_exp: row.LIFE_EXPECTANCY,
gdp: row.GDP,
density_population: row.DENSITY_POPULATION,
worforce: row.WORKFORCE } )
MERGE ( d:Report_date { date: row.REPORT_DATE } )
MERGE (c)-[:reported_on {cases_count: row.PEOPLE_POSITIVE_CASES_COUNT,
death_count: row.PEOPLE_DEATH_COUNT}]->(d)
EDIT
I changed the delimiter to ';' because that is what we had in our dataset however we still get bad results here is how it looks like in neo4j after running this code:
LOAD CSV WITH HEADERS FROM "file:///dataset.csv"
as row FIELDTERMINATOR ';' WITH row WHERE row.COUNTRY_SHORT_NAME IS NOT NULL
MERGE (c:Country {name: row.COUNTRY_SHORT_NAME,
life_exp: row.LIFE_EXPECTANCY,
gdp: row.GDP,
density_population: row.DENSITY_POPULATION,
worforce: row.WORKFORCE } )
MERGE ( d:Report_date { date: row.REPORT_DATE } )
MERGE (c)-[:reported_on {cases_count: row.PEOPLE_POSITIVE_CASES_COUNT,
death_count: row.PEOPLE_DEATH_COUNT}]->(d)
I think you got confused with node caption in Neo4j browser, all nodes get assigned an node id by default and nodes must be showing that. You can change it to country name property by clicking on node label. Screen shot for reference.

How can I Download to CSV in Neo4j

I've been trying to download a certain data on my graph and it returns this error :
Neo.ClientError.Statement.SyntaxError: Type mismatch: expected List<Node> but was Node (line 2, column 27 (offset: 77))"CALL apoc.export.csv.data(c,[], "contrib.csv",{})"
This is the query I did :
MATCH (c:Contrib) WHERE c.nationality CONTAINS "|" CALL apoc.export.csv.data(c,[], "contrib.csv",{}) YIELD file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data RETURN file, source, format, nodes, relationships, properties, time, rows, batchSize, batches, done, data
What went wrong ? :(
Thanks
The syntax for the function: apoc.export.csv.data is
apoc.export.csv.data(nodes,rels,file,config)
exports given nodes and relationships as csv to the provided file
The nodes is a collection of nodes rather than a node.
OLD: MATCH (c:Contrib) WHERE c.nationality CONTAINS "|"
CALL apoc.export.csv.data(c,[], "contrib.csv",{})
NEW: MATCH (c:Contrib) WHERE c.nationality CONTAINS "|"
WITH collect(c) as contribs
CALL apoc.export.csv.data(contribs, [], "contrib.csv", {})

need to add "_corrupt_record" column explicitly in the schema if you need to do schema validation when reading json via spark

When I read JSON through spark( using scala )
val rdd = spark.sqlContext.read.json("/Users/sanyam/Downloads/data/input.json")
val df = rdd.toDF()
df.show()
println(df.schema)
//val schema = df.schema.add("_corrupt_record",org.apache.spark.sql.types.StringType,true)
//val rdd1 = spark.sqlContext.read.schema(schema).json("/Users/sanyam/Downloads/data/input_1.json")
//rdd1.toDF().show()
this results in following DF:
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
| appId| appTimestamp|appVersion| bankCode|bankLocale| data|date| environment| event| id| logTime| logType| msid| muid| owner|recordType| uuid|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
|services| 1 446026400000 | 2.10.4|loadtest81| en|Properties : {[{"...|user|af593c4b000c29605c90|Payment| 1|152664593|AppActivityLog|90022384526564ffc...|22488dcc8b29-235c...|productOwner|event-logs|781ce0aaaaa82313e8c9|
+--------+----------------+----------+----------+----------+--------------------+----+--------------------+-------+---+---------+--------------+--------------------+--------------------+------------+----------+--------------------+
StructType(StructField(appId,StringType,true), StructField(appTimestamp,StringType,true), StructField(appVersion,StringType,true), StructField(bankCode,StringType,true), StructField(bankLocale,StringType,true), StructField(data,StringType,true), StructField(date,StringType,true), StructField(environment,StringType,true), StructField(event,StringType,true), StructField(id,LongType,true), StructField(logTime,LongType,true), StructField(logType,StringType,true), StructField(msid,StringType,true), StructField(muid,StringType,true), StructField(owner,StringType,true), StructField(recordType,StringType,true), StructField(uuid,StringType,true))
If I want to apply validation for any further json I read then I take schema as a variable and parse that in .schema as an argument [refer the commented lines of code], but even the corrupt records don't go into _corrupt_record column(which should happen by default), instead it parses that bad records as null in all columns and this is resulting into data loss as theie is no record of it.
Although when you add _corrupt_record column in schema explicitly everything works fine and the corrupt_record goes into the respective column, I want to know the reason why this is so?
(Also, if you give a malformed Json, spark automatically handles it by making a _corrupt_record column, so how come schema validation needs explicit column addition earlier) ??
Reading corrupt json data returns schema as [_corrupt_record: string]. But you are reading the corrupt data with schema which is wrong and hence you are getting the whole row as null.
But when you add _corrupt_record explicitly you get whole json record in that column and I assume getting null in all other columns.

Export to csv empty string as NULL

I'm trying to export some data from MsSQL to a CSV file using sqsh.
Assuming the SQL statement is SELECT * from [dbo].[searchengines].
The resulting CSV is something like these,
field1,field2,field3,field4,field5
,,"Google",,
,,"Yahoo",,
,,"Altavista",,
,,"Lycos",,
What can I do to make it into something like these :
field1,field2,field3,field4,field5
NULL,NULL,"Google",NULL,NULL
NULL,NULL,"Yahoo",NULL,NULL
NULL,NULL,"Altavista",NULL,NULL
NULL,NULL,"Lycos",NULL,NULL
I basically want to change fields that are empty into NULL.
Any idea?
Unfortunately the empty string in csv output for nullable columns is hard coded in sqsh. See src/dsp_csv.c on line 144 where the call is made:
dsp_col( output, "", 0 );
You could replace it by
dsp_fputs( "NULL", output );
and rebuild sqsh. In the next release I will come up with a more elaborate solution.

Use JSON Input step to process uneven data

I'm trying to process the following with an JSON Input step:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
However this seems not to be possible:
Json Input.0 - ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) :
The data structure is not the same inside the resource!
We found 1 values for json path [$..Locality], which is different that the number retourned for path [$..Street] (3509 values).
We MUST have the same number of values for all paths.
The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.
This limits the power of this step to read uneven data, which was really one of my priorities.
My step Fields are defined as follows:
Am I missing something? Is this the correct behavior?
What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:
var AddressId = getFromMap('AddressId', jsonRow);
var Street = getFromMap('Street', jsonRow);
var Locality = getFromMap('Locality', jsonRow);
In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:
function getFromMap(key,jsonRow){
try{
var map = JSON.parse(jsonRow);
}
catch(e){
var message = "Unparsable JSON: "+jsonRow+" Desc: "+e.message;
var nr_errors = 1;
var field = "jsonRow";
var errcode = "JSON_PARSE";
_step_.putError(getInputRowMeta(), row, nr_errors, message, field, errcode);
trans_Status = SKIP_TRANSFORMATION;
return null;
}
if(map[key] == undefined){
return null;
}
trans_Status = CONTINUE_TRANSFORMATION;
return map[key]
}
You can solve this by changing the JSONPath and splitting up the steps in two JSON input steps. The following website explains a lot about JSONPath: http://goessner.net/articles/JsonPath/
$..AddressId
Does in fact return all the AddressId's in the address array, BUT since Pentaho is using grid rows for input and output [4 rows x 3 columns], it can't handle a missing value aka null value when you want as results return all the Streets (3 rows) and return all the Locality (2 rows), simply because there are no null values in the array itself as in you can't drive out of your garage with 3 wheels on your car instead of the usual 4.
I guess your script returns null (where X is zero) values like:
A S X
A S X
A S L
A X L
The scripting step can be avoided same by changing the Fields path of the first JSONinput step into:
$.address[*]
This is to retrieve all the 4 address lines. Create a next JSONinput step based on the new source field which contains the address line(s) to retrieve the address details per line:
$.AddressId
$.Street
$.Locality
This yields the null values on the four address lines when a address details is not available in an address line.