Abinitio graph extract Information - ab-initio

I have an Abinitio graph with multiple subgraphs in it. I need to extract following information about the graphs: List of i/p files, o/p files, i/o tables, lookup files, run program, etc. How can I automate this process of extraction of all graphs without doing it manually on GDE.

you can create the script and call below commands for details:
air sandbox get-required-files
air sandbox run $1 -script-only
Let me know if this resolves the issue.

Related

How to upload csv data that contains newline with dbt

I have a 3rd party generated CSV file that I wish to upload to Google BigQuery using dbt seed.
I manage to upload it manually to BigQuery, but I need to enable "Quoted newlines" which is off by default.
When I run dbt seed, I get the following error:
16:34:43 Runtime Error in seed clickup_task (data/clickup_task.csv)
16:34:43 Error while reading data, error message: CSV table references column position 31, but line starting at position:304 contains only 4 columns.
There are 32 columns in the CSV. The file contains column values with newlines. I guess that's where the dbt parser fails. I checked the dbt seed configuration options, but I haven't found anything relevant.
Any ideas?
As far as I know - the seed feature is very limited by what is built into dbt-core. So seeds is not the way that I go here. You can see the history of requests for the expansion of seed options here on the dbt-cre issues repo (including my own request for similar optionality #3990 ) but I have to see any real traction on this.
That said, what has worked very well for me is to store flat files within the gcp project in a gcs bucket and then utilize the dbt-external-tables package for very similar but much more robust file structuring. Managing this can be a lot of overhead I know but becomes very very worth it if your seed files continue expanding in a way that can take advantage of partitioning for instance.
And more importantly - as mentioned in this answer from Jeremy on stackoverflow,
The dbt-external-tables package supports passing a dictionary of options for BigQuery external tables, which maps to the options documented here.
Which for your case, should be either the quote or allowQuotedNewlines options. If you did choose to use dbt-external-tables your source.yml for this would look something like:
gcs.yml
version: 2
sources:
- name: clickup
database: external_tables
loader: gcloud storage
tables:
- name: task
description: "External table of Snowplow events, stored as CSV files in Cloud Storage"
external:
location: 'gs://bucket/clickup/task/*'
options:
format: csv
skip_leading_rows: 1
quote: "\""
allow_quoted_newlines: true
Or something very similar.
And if you end up taking this path and storing task data on a daily partition like, tasks_2022_04_16.csv - you can access that file name and other metadata the provided pseudocolumns also shared with me by Jeremy here:
Retrieve "filename" from gcp storage during dbt-external-tables sideload?
I find it to be a very powerful set of tools for files specifically with BigQuery.

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

splitting and merging the json files from batch jobs in aws

I am working on a project where I am splitting a single file with a bunch of sentences into chunks for further sending to a third-party API for sentiment analysis.
The third-party API has a limitation of up to 5000 characters of limitation and which is why I am splitting the file into chunks of 40 sentences each. Each chunk will be sent to a batch job via AWS SQS and processed for sentiment analysis from a third-party API. I wanted to merge all of the processed files into one file. I couldn't find the logic to merge the files.
For example,
the input file,
chunk1: sentence1....sentence1... sentence1....
chunk2: sentence2....sentence2... sentence2....
The input file is separated into chunks. Each of these chunks is sent separately to a batch job via SQS. The batch job will call the external API for sentiment analysis. Each file will be uploaded to the S3 bucket as separate files.
Output file:
{"Chunk1": "sentence1....sentence1...sentence1....",
"Sentiment": "positive."}
All I wanted is to have the output in a single file but couldn't find the logic to merge the output files.
Logic I tried:
For each input file, I send a UUID to every chunk as ametadata and merge them with another lambda function. But the problem here is I am not sure when all of the chunks are processed and when to invoke the lambda function to merge the files.
If you have any better logic to merge the files, please share it here.
This sounds like a perfect use case for the AWS Step Function service. Step Functions allow you to have ordered tasks (which can be implemented by Lambdas). One of the state types, called Map, allows you to kick off many tasks in parallel and wait for all of those to finish before proceeding to the next step.
So a quick high level state flow would be something like:
First state takes a file as input and breaks up the file into multiple chunks
The second state would be a map state with a task that takes a file as input and sends to the sentiment analysis and saves output. The map state will kick off a task for each small file and retrieve the sentiment analysis.
The third and final task state will take all of the files and combine them in whatever way you deem appropriate.
It may take a bit of googling and reading the user guides but your workflow is exactly the use case this service was designed for and it sounds like you already have some of these steps implemented as their own Lambda functions, you'll just need to tweak those to be compatible with how Step Functions receive and push data out instead of using SQS.
That being said, I'm not sure how you want to merge the files as each section was analyzed separately and may have it's own sentiment and I'm not sure how to summarize the sentiment as a whole.
Resources:
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/

Managing a large SPSS (*.sav) file (4.2 GB)

I have received an SPSS file from survey fielded by another company that allegedly only contains ~1500 respondents, but the file size somehow has ballooned 4.2GB. My hunch is that the reason for this is that the file was from a global survey and the 1500 records that have been selected are from the US only so there are a series of blank variables, metadata for those variables that are included in this file and may also be in multiple languages/alphabets.
I only need a subset of this data, and can likely work with it if I removed the metadata but my issue has been that I can't get the damn thing open to cut down on the number of variables. I have been using the tools at my disposal to try the following workarounds, though I'm sure there are better options:
Opening the file using PSPP (freeware SPSS) - this causes the PSPP to stop responding
Using the R command read.spss (from the foreign package) to write a .csv - this claims that the file has a duplicate variable name and won't proceed further
Using the R command spss.system.file to write a .csv - when I tried this, R has spend a lot of time thinking as it as it attempts to run this and has been running for a couple hours with no apparent success.
Using the PSPP text conversion tool (https://pspp.benpfaff.org/) to create either a dictionary or a .csv file - both of these options crash after the file has completed uploading.
I've gone back to the other company to try have them work on reducing the file size, however I wasn't sure if anyone else had any ideas to do either of the following:
Open the file using another program/converter that could turn it into a .csv or other similarly skinny file format
Use another program to at least read only the variable names included in the file so that I can provide the other company with the specific variables I need
The following command from PSPP should do what you need:
$ pspp-convert originalFile.sav output.csv
In case it doesn't, please provide terminal error message.

Custom dynamic inventory scripts/plugins in Ansible

Ansible allows devs
to write programs (in any language) that will return JSON describing the dynamic “snapshot” of current hosts. I’m using vSphere, which is currently not supported by Ansible OSS, and so I need to write such a "custom inventory plugin".
I can handle the querying of vSphere for a list of hosts, as well as constructing the JSON that is compatible with what Ansible is expecting.
Where the documentation completely (seemingly) falls flat is:
How do I “connect” Ansible with my inventory app? That is, say my inventory app is a simple bash script (inventory.sh)..how do I configure Ansible to call bash inventory.sh and obtain JSON from it? In reality the app will likely be a Java executable (inventory.jar) but I figure that if I can figure out how to get it working with bash, I can extrapolate to Java; and
How does Ansible actually capture/fetch the JSON back from the app? STDOUT? Is this all supposed to happen over an HTTP connection? Examples? How does inventory.sh or inventory.jar communicate that JSON back to Ansible?
The inventory script has to be located on the same machine where Ansible runs. It is not communicating through http, Ansible will simply parse the STDOUT of your program. The location does not matter at all, you have to pass the path to Ansible when you call Ansible:
ansible-playbook ... -i /path/to/your/inventory.sh
To avoid passing the inventory location every time you could add this to you ansible.cfg:
inventory = /path/to/your/inventory.sh
You could also copy the script to /etc/ansible/hosts, which is the default location Ansible will look for inventory files/scripts, but I prefer to keep things together so I suggest to place it close to your playbooks/roles etc.
And (3) Is any of this documented, anywhere? Don't see anything in the Ansible docs...
It is not mentioned on the page Developing Dynamic Inventory Sources but it is to be seen on some examples on the page Dynamic Inventory. The docs are community managed and from times litte unstructured and lacking important information.
BTW, there is a VMware inventory script included. By looking at the source I have seen it imports some vSphere stuff. I have little experience with VMware so I can't judge if this is actually what you need and don't need to write your own.
This is completely user defined. Typically you would write your dynamic inventory in Python and use a json dump of the output to create the inventory.
Here is an example for the use case you mentioned (vSphere): https://github.com/RaymiiOrg/ansible-vmware/blob/master/query.py
In a nutshell you create it like a normal Python file and create the options (as he does in main) and selectively execute functions based on which options are passed. These will make REST calls and return the output in the form of a JSON dump, which Ansible can parse for use in inventory.