Import huge data from CSV into Neo4j - csv

When I try to import huge data into neo4j it gives following error:
there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Hello! I am trying to combine 2 variables to one variable. The variables are Public Folder Names and the ParentPath. Both can be found using Get-PublicFolder
Basically I want an array of Public Folders Path and Name so I will have an array like /Engineering/NewUsers
Below is my code
$parentpath = Get-PublicFolder -ResultSize Unlimited -Identity """ "'

It seems that there may be some information lacking from your question, especially about the data that is getting parsed, stack trace a.s.o.
Anyway, I think you can get around this by changing which character is treated as quote character. How are you calling the import tool and which version of Neo4j are you doing this on?
Try including argument --quote %, and I'm making this up by just using another character % as quote character. Would that help you?

Related

Spark reading multiple files : double quotes replaced by %22

I have requirements to read random json files in different folders where data has changed. So I can't apply regex to read pattern . I know which are those files and I could list them .But when I form string with all the file path and try reading json in spark. The double quotes are replaced by %22 and reading files via spark fails. Could any one please help ?
val FilePath = "\"/path/2019/02/01/*\"" + ","+ "\"path/2019/02/05/*\"" + "\"path/2019/02/24/*\""
FilePath:String = "path/2019/02/20/*","path/2019/02/05/*","path/2019/02/24/*"
Now when I use this variable to read josn files, it fails with error and quotes are replaced by %22.
spark.read.json(FilePath)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: "/path/2019/02/01/*%22,%22/path/2019/02/05/*%22,%22/path/2019/02/24/*%22
I've just tried this with an older version of Spark (1.6.0) and it works fine if you supply separate paths or wildcard patterns as varargs to the json method, i.e.:
sqlContext.read.json("foo/*", "bar/*")
When you pass multiple patterns in a single string, Spark is trying to construct a single URI from them, which is incorrect, and it will try to URL-encode the quotes characters as %22.
As an aside, trying to create a URI is failing because your string starts with a double-quote, which is an illegal character in that position (RFC 3986):
Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-").
Add the paths in a list(ex: pathList) and use as below
spark.read.option("basePath", basePath).json(pathList: _*)

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

SPSS Syntax to import RFC 4180 CSV file with escaped double quotes

How do I read an RFC4180-standard CSV file into SPSS? Specifically, how to handle string values that have embedded double quotes which are (properly) escaped with a second double quote?
Here's one instance of a record with a problematic value:
2985909844,,3,3,3,3,3,3,1,2,2,"I recall an ad for ""RackSpace"", but I don't recall if this was here or in another page.",200,1,1,1,0,1,0,Often
The SPSS syntax I used is as follows:
GET DATA
/TYPE=TXT
/FILE="/Users/pieter/Work/Stackoverflow/2013_StackOverflowRecoded.csv"
/IMPORTCASE=ALL
/ARRANGEMENT=DELIMITED
/DELCASE=LINE
/FIRSTCASE=2
/DELIMITERS=","
/QUALIFIER='"'
/VARIABLES= ... list of column names...
The import succeeds, but gets off track and throws warnings after encountering such values.
I'm afraid this is a bug in SPSS and therefore not possible to solve.
You might want to ask the IBM Support team about this issue and post their answer here, if you find it helpful.
One Workaround would be to change the escaped double quotes in your *.csv file(s) to some other quote type. This should be only little work if you use an advanced text editor such as notepad++ or the "sed" command line tool on UNIX like operation systems.
Trying an example in the current version of Statistics (22) doubled identifiers are handled correctly, however, if you generate the syntax with the Text Wizard, the fields are too short in the generated syntax, so you would need to increase the widths.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.

Is JSON safe to use as a command line argument or does it need to be sanitized first?

Is the following dangerous?
$ myscript '<somejsoncreatedfromuserdata>'
If so, what can I do to make it not dangerous?
I realize that this can depend on the shell, OS, utility used for making system calls (if being done inside a programming language), etc. However, I'd just like to know what kind of things I should watch out for.
Yes. That is dangerous.
JSON can include single quotes in string values (they do not need to be escaped). See "the tracks" at json.org.
Imagine the data is:
{"pwned": "you' & kill world;"}
Happy coding.
I would consider piping the data in to the program in question (e.g. use "popen" or even a version of "exec" that passes arguments directly) -- this can avoid issues that result from passing through the shell, for instance. Just as with SQL: using placeholders eliminates the need to trifle with "escaping".
If passing through a shell is the only way, then this may be an option (it is not tested, but something similar holds for a "<script>" context):
For every character in the JSON, which is either outside the range of "space" to "~" in ASCII, or has a special meaning in the '' context of a the shell such as \ and ' (but excluding " or any other character -- such as digits -- that can appear outside of "string" data, which is a limitation of this trivial approach), then encode the character using the \uXXXX JSON form. (Per the limitations defined above this should only encode potentially harmful characters appearing within the "strings" in the JSON and there should be no \\ pairs, no trailing \, and no 's, etc.)
It's ok. Just escape the character you use to wrap the string:
' should become '\''
So the JSON string
{"pwned": "you' & kill world;"}
becomes
{"pwned": "you'\'' & kill world;"}
and your final command, as the shell sees it, will be:
$ myscript '{"pwned": "you'\'' & kill world;"}'