Ruby: malformed CSV when converting to json - json

I'm trying to convert a csv file of incident codes with their descriptions to a json file. with the following code.
require 'csv'
require 'json'
csv = File.open('incidentCodes.json').read
CSV.parse(csv).to_json
File.open("incidentCodes.json", "w") do |f|
f.write(csv)
end
Every time I run the code though it says "CSV::MalformedCSVError(Illegal quoting in line 1) "
This is the first few lines of my CSV
111, "Building fire. Excludes confined fires."
112, "Fire in structure, other than in a building. Included are fires on or in piers, quays, or pilings: tunnels or under-
ground connecting structures; bridges, trestles, or overhead elevated structures; transformers, power or utility
vaults or equipment; fences; and tents."
113, "Cooking fire involving the contents of a cooking vessel without fire extension beyond the vessel"
114, "Chimney or flue fire originating in and confined to a chimney or flue. Excludes fires that extend beyond the
chimney."
I tried bypassing using CSV.parse with other methods I saw but it was then only writing "[['incidentCodes.csv']]" to the json file.
Im very new to ruby so any help is a big help.

You have an extra space after the comma.
> CSV.parse '111, "Building fire. Excludes confined fires."'
=> CSV::MalformedCSVError: Illegal quoting in line 1.
> CSV.parse '111,"Building fire. Excludes confined fires."'
=> [["111", "Building fire. Excludes confined fires."]]

Related

Error in a line on big CSV imported to BigQuery

I'm trying to import a big CSV file to BigQuery (2.2 GB+). This is the error I get:
"Error while reading data, error message: CSV table references column position 33, but line starting at position:254025076 contains only 26 columns."
There are more errors on that file – and on that file only, out of one per state. Usually I would skip the faulty lines, but then I would lose a lot of data.
What can be a good way to check and correct the errors in a file that big?
EDIT: This is what seems to happen in the file. It's one single line and it breaks between "Instituto" and "Butantan". As a result, BigQuery parses it as one line with 26 columns and another with nine columns. That repeats a lot.
As far as I've seen, it's just with Butantan, but sometimes the first word is described differently (I caught "Instituto" and "Fundação"). Can I correct that maybe with grep on the command line? If so, what syntax?
Actually 2.2GB is quite manageble size. It can be quickly pre-processed with command line tools or simple python script on any +/- modern laptop/desktop or on a small VM in GCP.
You can start from looking at the problematic row:
head -n 254025076 your_file.csv | tail -n 1
If problematic rows just have missing values for last columns - you can use "--allow_jagged_rows" loading CSV option.
Otherwise I'm usually using simple python script like this:
import fileinput
def process_line(line):
# your logic to fix line
return line
if __name__ == '__main__':
for line in fileinput.input():
print(process_line(line))
and run it with:
cat your_file.csv | python3 preprocess.py > new_file.csv
UPDATE:
For newline characters in value - try BigQuery "Allow quoted newlines" option.

JMeter reaching EOF too early in CSV file

I have setup an SMTP sampler in JMeter that gets the body data from a csv file. It reads the first element and then stops. Any suggestions on what could be wrong?
The CSV file looks like this:
"This is
a multiline
record
"`"This is
a seond
multi line
record
"`"And this is a third record"
Result
Configuration
As per CSV Data Set Config documentation
JMeter supports CSV files with quoted data that includes new-lines.
By default, the file is only opened once, and each thread will use a different line from the file.
So the "line" with newline characters needs to start from the new line (hopefully it makes sense), you need to organize your CSV file a little bit differently to wit:
"This is
a multiline
record
"`
"This is
a seond
multi line
record
"`
"And this is a third record"
If you don't have possibility to amend your CSV file you will have to go for other options of reading the data, i.e. using JSR223 Test Elements and Groovy scripts or storing the data into the database and using JDBC Test Elements for retrieving it

CSV Data Set Config not looping

I'm using v5.1.1 of JMeter and attempting to use the "CSV Data Set Config". The file is read correctly as I can tell from the Debug Sampler/Results Tree, but the file is not being read line by line. In other words, it reads the first line and never proceeds to the next line for processing.
I would like to use the data inside the CSV to iterate over a series of HTTP Requests to an external API. I currently have a single thread with only the "CSV Data Set Config" and "HTTP Request".
Do I need to wrap this with a ForEach controller or another looping construct? Perhaps I'm missing it but I do not see in the documentation that would indicate it's necessary.
Thanks
You dont need to wrap this in a ForEach loop. First line in the CSV file is a var name:
Let's say your csv file looks like
foo, bar
1, John
2, George
3, Laura
And you use an http request sampler
then ${foo} and ${bar} will get iterated sequentially. However please make sure you are mindful about the CSV Data Set Config options. The following options works ok for me:
By default CSV Data Set Config doesn't trigged any "looping", it reads next line from the CSV file for each thread (virtual user) for each iteration.
So if you want to see more values from the CSV file - either add more users or loops or both.
Given
This CSV file:
line1
line2
line3
Following CSV Data Set Config setup:
And the following Thread Group setup:
You will get the following values (assuming __threadNum() function to visualize current virtual user number and ${__jm__Thread Group__idx} pre-defined variable to show current Thread Group iteration) :
Check out JMeter Parameterization - The Complete Guide article for more information on various approaches on parameterizing JMeter tests using external data sources

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

Labels on Nodes and Relationships from a CSV file

I have problem when i want to add a label on a Node or to a Relatioship.
I do this in Neo4j with Cypher:
LOAD CSV WITH HEADERS FROM "file:c:/Users/Test/test.csv" AS line
CREATE (n:line.FROM)
and i get this error:
Invalid input '.': expected an identifier character, whitespace, NodeLabel, a property map, ')' or a relationship pattern (line 2, column 15 (offset: 99))
"CREATE (n:line.FROM)"
If there is not a possible way of doing this with the Cypher Language, can you recommend me an other way to do my job?
It is very important to find a solution on this problem even with a Cypher solution or any Java thing to do this job...
Depends on how dynamic you need it to be, for small variability:
LOAD CSV WITH HEADERS FROM "file:c:/Users/Test/test.csv" AS line
WHERE line.FROM = "Foo"
CREATE (n:Foo)
From Java you can use node.addLabel(DynamicLabel.label(line.from))
Otherwise you can look into my neo4j-shell-tools, which allow dynamic labels and rel-types: with #{FROM}.
see: https://github.com/jexp/neo4j-shell-tools#cypher-import
Thank you all for your answers but none of them helped me to solve my problem.
I found a solution to do exactly what i wanted. The solution was the Neo4jImporter tool (Link from official manual: Neo4jImporter tool Manual ) and not Cypher language nor Java.
So here is an example of what i have done and worked for me
A test.csv file contains the "PropertyTest" and ":LABEL". Firstly it creates one node with the label "TEST" and after the creation it adds the "proptest" property on the "TEST" node. So to add a Label on your node you use :LABEL and to add a Property on the same node you add any name you want as a header in .csv file.
Example of test.csv file:
PropertyTest,:LABEL
proptest,TEST
For windows i've done the Neo4jImport.bat command as it is described in the manual page of Neo4j.You can found the Neo4jImport.bat in Windows at "C:\Program Files\Neo4j Community\bin" and you run it from command line (cmd).
In details i opened the cmd, i followed the path to Neo4jImport.bat and finaly i wrote:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter ","
The default delimiter of Neo4jImporter is the "," but you can change it. For example if your information in .csv file is seperated with tab you can do the following:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter "TAB"
That was the way that i loaded dynamically a whole model of almost 2.000 nodes with different Labels and Properties.
Keep in mind from the manual that you can add as many labels and as many properties you want on a node by adding to your csv more headers
Example of two Labels in a node:
PropertyTest,:LABEL,:LABEL
proptest,TEST,SECOND_LABEL
Example of Neo4jImport.bat for two Labels and comma seperated CSV file:
Neo4jImport.bat --into path-to-save-your-neo4j-database --nodes path-to-your-csv\test.csv
--delimiter ","
I hope that you will find it useful to this certain problem of Labels from .csv files and please read the official manual, it helped me a lot to find a solution for my problem.
Below is the way for two csv files MIP_nodes.csv and MIP_edges.csv:
//Load csv data into the database - with dynamic label(s)
WITH "file:///MIP_nodes.csv" AS uri
LOAD CSV WITH HEADERS FROM uri AS row
WITH * WHERE row.label <> ""
call apoc.merge.node ([row.label],{nodeId:row.nodeId, name: row.name, type: row.type, created: row.created, property1: row.property1, property2: row.property2})
YIELD node as n1
//RETURN n1
WITH * WHERE row.label = ""
call apoc.merge.node (['DefaultNode'],{nodeId:row.nodeId, name: row.name, type: row.type, created: row.created, property1: row.property1, property2: row.property2})
YIELD node as n2
RETURN n1, n2
//Load csv data into the database - with dynamic relationship(s)
//:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM 'file:///MIP_edges.csv' AS row
MATCH (s)
WHERE s.nodeId = row.sourceId
//RETURN s
MATCH (d)
WHERE d.nodeId = row.destinationId
//RETURN d
CALL apoc.merge.relationship(s, row.label,{type:row.type, created: row.created, property1: row.property1, property2: row.property2},{}, d,{})
YIELD rel
//REMOVE rel.noOp;
RETURN rel;