I am trying to convert the xlsb files in Nifi to csv. I am using ConvertExcelToCSVProcessor in Nifi at the moment, but it gives me error and does not work. xlsb are the excel binary files. i have googled a lot and tried to make this work, but in vain. please help in this regard.
I just looked through our code base and checked up on POI. The long and short of it is that XLSB support in POI is fairly limited at this point, and the APIs that NiFi calls don't appear to support it. What you can try as a work around for now is look for a Python library that supports XLSB, write a Python script that generates XLSX or CSV from that and call that with ExecuteStreamCommand.
Related
I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.
I am trying to parse JSON data from an API into Flow, convert it into a CSV and then output the CSV to my Google Drive.
The API I am trying to work with is located here:
https://www.binance.com/api/v1/klines?symbol=BNBBTC&interval=1h&limit=24
Is this possible using Microsoft flow? I have tried various things without much success.
Thanks in advance.
I'd say it is possible. What have you tried so far?
First you have to get the response body. Then extract the "meat" from each element, which has to be done with flow expressions "body(response_body)[0]" - depending on format. Then feed all these data parts to a newly created excel file.
Im trying out the MarkLogic Java API and would want to bulk upload some files with the extension .csv
I'm not sure what to use, since the Java API only supports JSON, XML, and TXT files.
How do I batch upload files using the MarkLogic Java api? Do i convert everything to JSON?
Do i convert everything to JSON?
Yes, that is a common way to do it.
If you would like additional examples of how you can wrangle CSV with the Java Client API, check out OpenCSVBatcherExample and JacksonDatabindTest.testDatabindingThirdPartyPojoWithMixinAnnotations. The first demonstrates converting the csv to XML and using a custom REST extension. The second example (well, unit test...) demonstrates converting the csv to JSON and using the batch upload (Bulk Writes) capabilities Justin linked to.
If you have CSV files on your filesystem, I’d start with mlcp, as suggested above. It will handle all of the parsing and splitting into multiple transactions/batches for you. Take a look at the mlcp documentation for more details and some example configurations.
If you’d like more control over the parsing and splitting logic than mlcp gives you out-of-the-box or you’re getting CSV from some other source (i.e. not files on the filesystem), you can use the Java Client API. The Java Client API allows you to efficiently write batches using a WriteSet. Take a look at the “Bulk Writes” example.
According to your reply to Justin, you cannot use MLCP because it is command line and you need to integrate it into a web portal.
Well, MLCP is released as open cource software under the Apache2 licence. So if you are happy with this licence, then you have the source to integrate.
But what I see as your main problem statement is more specific:
How can I create miltiple XML OR JSON documents from a CSV file [allowing the use of the java API to then upload them as documents in MarkLogic]
With that specific problem statement:
1) have a look at SplitDelimitedTextReader.java from the mlcp source
2) try some java libraries for this purpose such as http://jsefa.sourceforge.net/quick-tutorial.html
I've been trying to import a couple of .json files into LibreOffice Calc.
Although I can get the raw data in, it isn't sorting as I would think it might (by placing different pieces of info into each cell).
Does LibreOffice provide support for importing JSON files and sorting them out in cells? (In other words, import + sort)?
If there doesn't seem to be direct support for this, would converting to CSV be the next logical step in order to get the data into Calc?
Had the same problem myself (that's how I found this question).
So, for the next person finding this - the answer is no - LibreOffice Calc does not support direct import of JSON.
And the next logical step indeed is converting to CSV. There are free online JSON to CSV converters, and using one of them (http://www.convertcsv.com/json-to-csv.htm), I was easily able to make a correct CSV which Calc imports without a problem.
One possible caveat is if you have complex objects represented in JSON - that may not be convertible to CSV, but then again, if it doesn't fit into CSV, it probably doesn't fit into spreadsheet format either.
There's a LibreOffice GetRest plugin with documentation written in broken English, that has a "parseJSON" formula. It won't convert JSON to CSV (without a lot of grunt work) but it might help your use case.
If you can run Python scripts in Libreoffice Calc, then it should be possible when you see what's in here: http://blog.appliedinformaticsinc.com/how-to-parse-and-convert-json-to-csv-using-python/
I have used the "extract" command, but it never was able to find as much information as FOCA found on these excel spreadsheets I am dealing with.
For example, I am using the FOCA application to harvest and download files from the web. Afterwards, it is extracting metadata from all of the files.
With regards to excel files, it appears that these files are containing more metadata than the average pdf file. That being said, FOCA is able to detect printer names, email addresses, and a few other things that are stored within this spreadsheet file. However, I cannot find any way to get this same information in Linux using the "extract" command.
Anyone know a way to extract files within Linux and grab ALL of its metadata? Seems like the extract command may be limited from what I understand.
Thanks,
Excel files store a lot of meta data within the file, so you would have to parse the file itself to get at it. Since you're on Linux and can't use the Excel interop, you could try to use an Excel library like ExcelWriter or something similar. ExcelWriter is written for .Net, so you'd have to use mono.