GROOVY - Parsing CSV: Ignore commas inside double quotes - csv

I'm looking for a groovy regex to be able to parse CSV file while ignoring commas insider double quotes.
The following regex works well in Java but not in Groovy:
it.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")
Would you please help me to solve this issue.
I want to validate a CSv file format, for example for following example, the format is correct or not:
Header1, Header2, Header3
1, 2, 3
4, "5, 6", 7
But in this case, the format is not valid:
Header1, Header2, Header3
1, 2
I check Groovy Split CSV but it didn't solve my problem. Because the solution shown in that article, after parsing following csv:
Header1, Header2, Header3
1, "2, 3", 4, 5
will match:
Header1: 1
Header2: "2, 3"
Header3: 4
and it ignores 5! But me, I want to print out a message that format is not correct.
Thanks in advance.

try to change it like this:
it.split(",(?=(?:[^\"]\"[^\"]\")[^\"]\${1})")
Let me know.

Related

Redshift JSON Parsing

I have some JSON data in Redshift table of type character varying. An example entry is:
[{"value":["*"], "key":"testData"}, {"value":"["GGG"], key: "differentData"}]
I want to return vales based on keys, how can i do this? I'm attempting to do something like
json_extract_path_text(column, 'value') but unfortunately it errors out. Any ideas?
So the first issue is that your string isn't valid JSON. There are mismatched and missing quotes. I think you mean:
[{"value":["*"], "key":"testData"}, {"value":["GGG"], "key": "differentData"}]
I don't know if this is a data issue or a transcription error but these functions won't work unless the json text is valid.
The next thing to consider is that at the top level this json is an array so you will need to use json_extract_array_element_text() function to pick up an element of the array. For example:
json_extract_array_element_text('json string', 0)
So putting this together we can extract the first "value" with (untested):
json_extract_path_text(
json_extract_array_element_text(
'[{"value":["*"], "key":"testData"}, {"value":["GGG"], "key": "differentData"}]', 0
), 'value'
)
Should return the string ["*"].

How can I loop with multiple conditional statements in OpenRefine (GREL)

I am geocoding using OpenRefine. I pulled data from OpenStreetMaps to my datasetstructure of data
I am adding a "column based on this column" for the coordinates.I want to check that the display_name contains "Rheinland-Pfalz" and if it does, I want to extract the latitude and longitude,i.e. pair.lat + ',' + pair.lon. I want to do this iteratively but I don't know how. I have tried the following:
if(display_name[0].contains("Rheinland-Pfalz"), with(value.parseJson()[0], pair, pair.lat + ',' + pair.lon),"nothing")
but I want to do this for each index [0] up to however many there are. I would appreciate if anyone could help.
Edit: Thanks for your answer b2m.
How would I extract the display_name corresponding to the coordinates that we get. I want the output to be display_name lat,lon for each match (i.e. contains "Rheinland-Pfalz", because I have a different column containing a piece of string that I want to match with the matches generated already.
For example, using b2m's code and incorporating the display_name in the output we get 2 matches:
Schaumburg, Balduinstein, Diez, Rhein-Lahn-Kreis, Rheinland-Pfalz, Deutschland 50.33948155,7.9784308849342604
Schaumburg, Horhausen, Flammersfeld, Landkreis Altenkirchen, Rheinland-Pfalz, Deutschland 52.622319,14.5865283
For each row, I have another string in a different column. Here the entry is "Rhein-Lahn-Kreis". I want to filter the two matches above to only keep those containing my string in the other column. In this case "Rhein-Lahn-Kreis" but the other column entry is different for each row. I hope this is clear and I would greatly appreciate any help
Assuming we have the following json data
[
{"display_name": "BW", "lat": 0, "lon": 1},
{"display_name": "NRW 1", "lat": 2, "long": 3},
{"display_name": "NRW 2", "lat": 4, "lon": 5}
]
You can extract the combined elements lat and long with forEach and filter using the following GREL expression e.g. in the Add column based on this column dialog.
forEach(
filter(
value.parseJson(), geodata, geodata.display_name.contains("NRW")
), el, el.lat + "," + el.lon)
.join(";")
This will result in a new field with the value 2,3;4,5.
You can then split the new multi valued field on the semicolon ";" to obtain separated values (2,3 and 4,5).
Another approach would be to split the JSON Array elements into separate rows, avoiding the forEach and filter functions.

How to select right values in JSON file in pyspark

I got a json file similar to this.
"code": 298484,
"details": {
"date": "0001-01-01",
"code" : 0
}
code appears twice, one is filled and the other one is empty. I need the first one with the data in details. What is the approach in pyspark?
I tried to filter
df = rdd.map(lambda r: (r['code'], r['details'])).toDF()
But it shows _1, _2 (no schema).
Please try the following:
spark.read.json("path to json").select("code", "details.date")

Highcharts: How to show only one data »category« from CSV as a line

My CSV file has three columns:
Date, Values1, Values2
1880.0417, -183.0, 24.2
1880.1250, -171.1, 24.2
1880.2083, -164.3, 24.2
of which I want to display only the second one (Values1) as a line (chart).
I could prepare a CSV via Excel with only that and the date column. But due to ongoing work with the file, it would be much easier to get that CSV parsed while ignoring the second value.
Is that possible? I tried it with using the »series« parameter - but in vain.
Thanks a lot for any hints!
You can use seriesMapping property:
data: {
...,
seriesMapping: [{
x: 0,
y: 1
}, {}]
}
Live demo: https://jsfiddle.net/BlackLabel/tyLahrow/
API Reference: https://api.highcharts.com/gantt/data.seriesMapping

Converting json like data to csv

data file looks like the below.
Note that the tags and values are not necesirly fixed (i.e. one file may have 10, next 30, 31 etc)
{
1=6487490
2=905629
3=14959
4=85
7=1
8=
9=
16=1
20=1252903800
21=557
22=13
}
i want output like this using python2
1, 2, 3,4,5,8,9,16,20,21,22
6487490,905629,14959,85,1,,,1,1252903800,557,13
where all the tags and values are converted to the csv structure.