Is there such thing as an event, hook or trigger that gets fired when a new file or file version gets added to HDFS? I'm working on a system where we would like to write JSON files to HDFS. The clients writing to the HDFS should be ignorant of the fact that a trigger gets fired that does downstream work with the JSON file.
Please feel free to tweak my question if the language I'm using doesn't fit the correct Hadoop terms.
I think you can use oozie for this. Check the input events.
Related
I just uploaded my last project to heroku and it includes in it a jsonfile that behave like database. You can add and delete data through the UI.
If I want to work on the project and push it again, and meanwhile the JSON file was modified (on the server), it won't be updated to the version in the development environment?
At first I thought that if I don't modify it manually, it won't be staged, but what if I do want to update it manually?
In that case I assume you need to clone, modify and push it again, but if meanwhile someone updated it his data will be deleted.
so maybe you should block the UI from updating the jsonfile while you modify it manually.
I got it wrong?
is there any way to do it better?
I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.
I want to have Solr watch and auto update an index from a JSON file. Is this doable and if so what is the best way?
No, Solr doesn't have any mechanism for watching for a file to change. You can however work around this - depending on your OS - to have a small program watch the file or directory for a change, and then submit the JSON document to Solr.
See How to execute a command whenever a file changes on Superuser.
I want to add scrapped data to my database. I like the fact that the API enables validation but I assume that the overhead is too high. I'm writing maybe 10k rows at a time, at most. Is that accurate?
Alright, so one other issue I was having, which was preventing me from testing this hypothesis is that I'm currently unable to import my models module. I get an error message claiming that my DJANGO_SETTINGS_MODULE is undefined.
my django.wsgi script does define it and it works within the context of django. I assume that when I try to execute a python file from the command line, the .wsgi script is not run. Again, assumptions, I know.
Do I have to add my django project to my PYTHONPATH within the bashrc file to make this work?
You'll need to set your project's settings file in ~/.bashrc if you want to use it in a script.
export PYTHONPATH=$PYTHONPATH:/path/to/django/project
export DJANGO_SETTINGS_MODULE=settings
or
export DJANGO_SETTINGS_MODULE=/path/to/django/project/settings
Is there an iSeries command to export the data in a table to CSV format?
I know about the Windows utilities, but since this needs to be run automatically I need to run this from a CL program.
You can use CPYTOIMPF and specify the TOSTMF option to place a CSV file on the IFS.
Example:
CPYTOIMPF FROMFILE(DBFILE) TOSTMF('/outputfile.csv') STMFCODPAG(*PCASCII) RCDDLM(*CRLF)
If you want the data to be downloaded directly to a PC, you can use the "Data Transfer from iSeries" function of IBM iSeries Client Access to create a .CSV file. In the file output details dialog, set the file type to Comma Separated Variable (CSV).
You can save the transfer description to be reused later.
You could use a trigger. The iSeries Client Access software wont do since that is a windows application, what I understand is that you need the data to be exported each time that the file is written. Check this link to know more about triggers.
You are going to need FTP to perform that action.
If your iSeries shop uses ZMOD/FTP your shortest solution is a few lines of code away -- 3 lines to be exact -- the three lines are to Start FTP, Put DBF, and finally, End FTP.
IF you don't use ZMOD/FTP:
- You could use native FTP/400 to accomplish what you need to do, but it is quite involved!!!
- you may probably need to use an RPGLE program to parse, format, and move, data into a "flatfile", then use native FTP/400 to FTP the file out
- and yes, a CL will need as a wrapper!
You can do it all in one very simple CL program:
CPYTOIMPF the file TOSTMF -> the cvs file will be in the IFS
FTP the file elsewhere (to a server or a PC)
It works like a charm