I'm, writing a Puppet (3.6.2) module that reads data fields from a CSV file via the extlookup function and I cannot figure out how to tell extlookup that the first line is the header field. Does extlookup support this? If not, can anyone recommend an external function I could import and use?
thanks,
PS - Yes I know about hiera, and having the data in YAML or JSON files but my requirement is CSV files only.
Brandon
The behavior of extlookup() is pretty well documented. It makes no special provision for column headers, which are by no means an inherent feature of CSV format. Indeed, if your header line is not readable as a data line, then your file is not CSV at all.
Supposing that your file is indeed valid CSV, the absolute simplest solution would be to ignore the issue. It presents a problem only if the first column heading duplicates an actual or potential data name. If it does not, then you will never look up or use the psuedo-value represented by the first row.
If your file in fact is not CSV on account of its first line, or if the first column name conflicts with a real data name, then it seems the next best alternative would be to just remove that line, or to avoid creating it in the first place. I don't see any reason why one of these should not be possible.
I know about heira, and having the data in YAML or JSON files but my requirement is CSV files only.
How sad. Do be aware that extlookup() has long been deprecated, and it was removed from Puppet 4.
I'm inclined to suggest you implement a translator from CSV to Hiera-friendly YAML, and use Hiera in your module. Alternatively, Hiera supports custom backends, and it's not too hard to write one. I am unaware of an existing CSV backend for Hiera, but you could write one. Ignoring a header line would then be under your control, and you would simultaneously achieve a measure of future-proofing.
Related
I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.
Hi I'm trying to parse any of the files from the link underneath. I've tried reaching out to the owner of the data dumps, but nothing works in trying to parse the files as proper JSON files. No program we use (Power BI, Jupyter, Excel) anything really, wants to recognise the files as JSON and we can't figure out why this might be. I was wondering if anyone could help figuring out what the issue is here as this dataset is very interesting to me and my co-students. I hope I'm using the word 'parsing' correctly.
The link to the data dumps is linked underneath:
https://files.pushshift.io/reddit/comments/
The file I downloaded (I just tried one at random) was handled just fine by jq, my preferred command-line tool for processing JSON files.
jq accepts an input consisting of a sequence of JSON objects, which is what I found when I decompressed the test file. This format is commonly known as JSON lines, and many tools can handle it. The Wikipedia article on JSON streaming contains more information and a (possibly outdated) list of tools.
If your tools aren't capable of handling more than one JSON object in an input, you could turn the files into something which you can handle by adding a comma to the end of every line except the last one (since each JSON object is a single line) and then surrounding the whole input inside a pair of brackets to turn the sequence into a JSON list. Since JSON does not actually care about newlines, it would be sufficient to add a line containing [ at the beginning and a line containing ] at the end. I don't know what command-line tools you have available and are comfortable with, but the task shouldn't be too difficult.
I am trying to do some optimization in ADF. Setup is a third-party tool copies one JSON file per object to a BLOB storage container. These feed to a Mapping Data Flow. The individual files written by the third party tool work great. If I copy these files to a different BLOB folder using an Azure Copy Data activity, the MDF can no longer parse the files and gives an error: "JSON parsing error, unsupported encoding or multiline." I started this with a Merge Files, but outcome is same regardless of copy behavior I choose.
2ND EDIT: After another day's work, I have found that the Copy Activity Merge File from JSON to JSON definitely adds an EOL character to each single JSON object as it gets imported to the Merge file. I have also found that the MDF fails definitely with those EOL characters in the Merge file. If I remove all EOL characters from the Merge file, the same MDF will work. For me, this is a bug. The copy activity is adding a character that breaks the MDF. There seems to be a second issue in some of my data that doesn't fail as an individual file but does when concatenated that breaks the MDF when I try to pull all the files together, but I have tested the basic behavior on 1-5000 files and been able to repeat the fail/success tests.
I took the original file, and the copied file, ran them through all of sorts of test, what I eventually found when I dump into Notepad++:
Copied file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\r\n
Original file:
{"CustomerMasterData":{"Customer":[{"ID":"123456","name":"Customer Name",}]}}\n
If I change the copied file from ending with \r\n to \n, the MDF can read the file again. What is going on here? And how do I change the file write behavior or the MDF settings so that I can concatenate or copy files without the CRLF?
EDIT: NEW INFORMATION -- It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Other possibly relevant information:
Both the source and sink JSON datasets are set to use UTF-8 (not default(UTF-8), although I tried that). Would a different encoding fix this?
I have tried remapping schemas, creating new data sets, creating new Mapping Data Flows, still get the same error.
EDITED for clarity based on comments:
In the case of a single JSON element in a file, I can get this to work -- data preview returns same success or failure as pipeline when run
In the case of multiple documents merged by ADF I get the below instead. It seems on further review like maybe the minification/whitespace removal is the culprit. If I download the file created by the ADF copy and format it using a JSON formatter, it works. Maybe the CRLF -> LF masked something else. I'm not sure what to do at this point, but its super frustrating.
Repro: Create any valid JSON as a single file, put it in blob storage, use it as a source in a mapping data flow, to do any sink operation. Create a second file with same schema, get them both to run in same flow using wildcard paths. Use a Copy Activity with Merge Files as the Sink Copy Activity and Array of Objects as the File pattern. Try to make your MDF use this new file. If it fails, download the file created by ADF, run it through a formatter (I have used both VS Code -> "Format Document" from standard VS Code JSON extension, and VS 2019 "Unminify" command) and reupload... It should work now.
don't know if you already solved the problem: I came across the exact same problem 3 days ago and after several tries I found a solution:
in the copy data activity under sink settings, use "set of objects" (instead of "array of objects") under File Pattern, so that the merged big JSON has the value of the original small JSON files written per line
in the MDF after setting up the wildcard paths with the *.json pattern, under JSON Settings select: Document per line as the Document form.
After that you should be good to go, as least it solved my problem. The automatic written CRLF in "array of objects" setting in the copy data activity should be a default setting and MSFT should provide the option to omit it in the settings in the future.
According to my test:
1.copy data activity can't change unix(LF) to windows(CRLF).
2.MDF can also parse unix(LF) file and windows(CRLF) file.
Maybe there is something else wrong.
By the way,I see there is a comma after "name":"Customer Name" in your Original file,I delete it before my test.
I have a CSV template file, say, having 10 columns.
I would like to load this CSV file template, and then write data to the relevant cells(say only to 5 of the 10 cells) through a java program.
I went through JSAPAR, SuperCSV etc, but am not sure whether these libraries have the "stuff" what exactly I need.
Is there any framework supporting this kind of operations?
Checkout freemarker: http://freemarker.org/
Open your text file.
Enter freemarker paramerters for required cells.
Your template file may look something like below:
"Templatetext1","text2","text4", "${myVal4}",${myVal5}","text6", ${myVal7}",${myVal8}",${myVal9}","textInCell10"
Pass in the values, you have your csv from template.
If you want to pass for multiple rows you can use other elements like <#list> etc.
OpenCSV is generally considered the best CSV toolkit for Java. It's a very lightweight library that makes working with CSV dead simple. I would recommend looking at it since it's not among the list of things you've tried yet.
I've been trying to import a couple of .json files into LibreOffice Calc.
Although I can get the raw data in, it isn't sorting as I would think it might (by placing different pieces of info into each cell).
Does LibreOffice provide support for importing JSON files and sorting them out in cells? (In other words, import + sort)?
If there doesn't seem to be direct support for this, would converting to CSV be the next logical step in order to get the data into Calc?
Had the same problem myself (that's how I found this question).
So, for the next person finding this - the answer is no - LibreOffice Calc does not support direct import of JSON.
And the next logical step indeed is converting to CSV. There are free online JSON to CSV converters, and using one of them (http://www.convertcsv.com/json-to-csv.htm), I was easily able to make a correct CSV which Calc imports without a problem.
One possible caveat is if you have complex objects represented in JSON - that may not be convertible to CSV, but then again, if it doesn't fit into CSV, it probably doesn't fit into spreadsheet format either.
There's a LibreOffice GetRest plugin with documentation written in broken English, that has a "parseJSON" formula. It won't convert JSON to CSV (without a lot of grunt work) but it might help your use case.
If you can run Python scripts in Libreoffice Calc, then it should be possible when you see what's in here: http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/