MarkLogic Java API batch upload files (.csv) - csv

Im trying out the MarkLogic Java API and would want to bulk upload some files with the extension .csv
I'm not sure what to use, since the Java API only supports JSON, XML, and TXT files.
How do I batch upload files using the MarkLogic Java api? Do i convert everything to JSON?

Do i convert everything to JSON?
Yes, that is a common way to do it.
If you would like additional examples of how you can wrangle CSV with the Java Client API, check out OpenCSVBatcherExample and JacksonDatabindTest.testDatabindingThirdPartyPojoWithMixinAnnotations. The first demonstrates converting the csv to XML and using a custom REST extension. The second example (well, unit test...) demonstrates converting the csv to JSON and using the batch upload (Bulk Writes) capabilities Justin linked to.

If you have CSV files on your filesystem, I’d start with mlcp, as suggested above. It will handle all of the parsing and splitting into multiple transactions/batches for you. Take a look at the mlcp documentation for more details and some example configurations.
If you’d like more control over the parsing and splitting logic than mlcp gives you out-of-the-box or you’re getting CSV from some other source (i.e. not files on the filesystem), you can use the Java Client API. The Java Client API allows you to efficiently write batches using a WriteSet. Take a look at the “Bulk Writes” example.

According to your reply to Justin, you cannot use MLCP because it is command line and you need to integrate it into a web portal.
Well, MLCP is released as open cource software under the Apache2 licence. So if you are happy with this licence, then you have the source to integrate.
But what I see as your main problem statement is more specific:
How can I create miltiple XML OR JSON documents from a CSV file [allowing the use of the java API to then upload them as documents in MarkLogic]
With that specific problem statement:
1) have a look at SplitDelimitedTextReader.java from the mlcp source
2) try some java libraries for this purpose such as http://jsefa.sourceforge.net/quick-tutorial.html

Related

document or project criteria needed to use JSON Diff?

JSON Diff is used when we run web projects that use API features. Does this JSON Diff not work on projects that don't use the API feature? Are there any special criteria?
yes, we're using JSONDiff for find a differences between 2 file code.
The following criteria for use JSON Diff:
The project using API feature.
Want to do compare 2 file with a file format JSON.

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

How properties.db is used in Forge Viewer?

The sqlite database file properties.db is usually the biggest file in the output from https://extract.autodesk.io/.
What is it used for in Forge Viewer, and if it's not used, why is it available in the ZIP file?
The reason this example is copying both is that the purpose of the sample is to demo how to extract the 'bubble' from the Autodesk server. The Design File' properties are extracted in 2 formats: aka json (json.gz) and sqlLite (sdb/db).
The Autodesk Viewer only uses the json format, but other systems may prefer using sqlLite. The json approach makes it easier when you code executes in client browsers.
It is fairly easier to modify the sample to exclude the sqlLite database if you are not interested to get this file. I can point you which code you need to modify if that's something you want to do.
That file contains the components properties as a sqlite database, which are also contained in objects_xxx.json.gz. The viewer only uses the json format.
That article shows how you can easily run the extraction code your your side, it doesn't extract the .db file:
Forge SVF Extractor in Node.js

Indexing flat XML files in elasticsearch

I'm working on a specific project where external data provided by external providers is to be indexed on our ElasticSearch Engine.
The data is provided as XML flat files.
The idea here is to script something out that reads each file, parse it and launch as many HTTP POST as needed for each one of them.
Is there a simpler way to do this? something like uploading the XML file that gets indexed automatically without any script?
You can use logstash with an xml filter to do this. Takes a bit of work to get setup the first time, but it's the most straightforward way to do it.

Migrating from Lighthouse to Jira - Problems Importing Data

I am trying to find the best way to import all of our Lighthouse data (which I exported as JSON) into JIRA, which wants a CSV file.
I have a main folder containing many subdirectories, JSON files and attachments. The total size is around 50MB. JIRA allows importing CSV data so I was thinking of trying to convert the JSON data to CSV, but all convertors I have seen online will only do a file, rather than parsing recursively through an entire folder structure, nicely creating the CSV equivalent which can then be imported into JIRA.
Does anybody have any experience of doing this, or any recommendations?
Thanks, Jon
The JIRA CSV importer assumes a denormalized view of each issue, with all the fields available in one line per issue. I think the quickest way would be to write a small Python script to read the JSON and emit the minimum CSV. That should get you issues and comments. Keep track of which Lighthouse ID corresponds to each new issue key. Then write another script to add things like attachments using the JIRA SOAP API. For JIRA 5.0 the REST API is a better choice.
We just went through a Lighthouse to JIRA migration and ran into this. The best thing to do is in your script, start at the top-level export directory and loop through each ticket.json file. You can then build a master CSV or JSON file to import into JIRA that contains all tickets.
In Ruby (which is what we used), it would look something like this:
Dir.glob("path/to/lighthouse_export/tickets/*/ticket.json") do |ticket|
JSON.parse(File.open(ticket).read).each do |data|
# access ticket data and add it to a CSV
end
end