splitting and merging the json files from batch jobs in aws - json

I am working on a project where I am splitting a single file with a bunch of sentences into chunks for further sending to a third-party API for sentiment analysis.
The third-party API has a limitation of up to 5000 characters of limitation and which is why I am splitting the file into chunks of 40 sentences each. Each chunk will be sent to a batch job via AWS SQS and processed for sentiment analysis from a third-party API. I wanted to merge all of the processed files into one file. I couldn't find the logic to merge the files.
For example,
the input file,
chunk1: sentence1....sentence1... sentence1....
chunk2: sentence2....sentence2... sentence2....
The input file is separated into chunks. Each of these chunks is sent separately to a batch job via SQS. The batch job will call the external API for sentiment analysis. Each file will be uploaded to the S3 bucket as separate files.
Output file:
{"Chunk1": "sentence1....sentence1...sentence1....",
"Sentiment": "positive."}
All I wanted is to have the output in a single file but couldn't find the logic to merge the output files.
Logic I tried:
For each input file, I send a UUID to every chunk as ametadata and merge them with another lambda function. But the problem here is I am not sure when all of the chunks are processed and when to invoke the lambda function to merge the files.
If you have any better logic to merge the files, please share it here.

This sounds like a perfect use case for the AWS Step Function service. Step Functions allow you to have ordered tasks (which can be implemented by Lambdas). One of the state types, called Map, allows you to kick off many tasks in parallel and wait for all of those to finish before proceeding to the next step.
So a quick high level state flow would be something like:
First state takes a file as input and breaks up the file into multiple chunks
The second state would be a map state with a task that takes a file as input and sends to the sentiment analysis and saves output. The map state will kick off a task for each small file and retrieve the sentiment analysis.
The third and final task state will take all of the files and combine them in whatever way you deem appropriate.
It may take a bit of googling and reading the user guides but your workflow is exactly the use case this service was designed for and it sounds like you already have some of these steps implemented as their own Lambda functions, you'll just need to tweak those to be compatible with how Step Functions receive and push data out instead of using SQS.
That being said, I'm not sure how you want to merge the files as each section was analyzed separately and may have it's own sentiment and I'm not sure how to summarize the sentiment as a whole.
Resources:
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/

Related

Best data processing software to parse CSV file and make API call per row

I'm looking for ideas for an Open Source ETL or Data Processing software that can monitor a folder for CSV files, then open and parse the CSV.
For each CSV row the software will transform the CSV into a JSON format and make an API call to start a Camunda BPM process, passing the cell data as variables into the process.
Looking for ideas,
Thanks
You can use a Java WatchService or Spring FileSystemWatcher as discussed here with examples:
How to monitor folder/directory in spring?
referencing also:
https://www.baeldung.com/java-nio2-watchservice
Once you have picked up the CSV you can use my example here as inspiration or extend it: https://github.com/rob2universe/csv-process-starter specifically
https://github.com/rob2universe/csv-process-starter/blob/main/src/main/java/com/camunda/example/service/CsvConverter.java#L48
The example starts a configurable process for every row in the CSV and includes the content of the row as a JSON process data.
I wanted to limit the dependencies of this example. The CSV parsing logic applied is very simple. Commas in the file may break the example, special characters may not be handled correctly. A more robust implementation could replace the simple Java String .split(",") with an existing CSV parser library such as Open CSV
The file watcher would actually be a nice extension to the example. I may add it when I get around to it, but would also accept a pull request in case you fork my project.

Copying multiple files from one folder to another in the same S3 bucket

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

Using Apache Nifi to collect files from 3rd party Rest APi - Flow advice

I am trying to create a flow within Apache-Nifi to collect files from a 3rd party RESTful APi and I have set my flow with the following:
InvokeHTTP - ExtractText - PutFile
I can collect the file that I am after, as I have specified this within my Remote URL however when I get all of the data from said file it is outputting multiple (100's) of the same files to my output directory.
3 things I need help with:
1: How do I get the flow to output the file in a readable .csv rather than just a file with no ext
2: How can I stop the processor once I have all of the data that I need
3: The Json file that I have been supplied with gives me the option to get files from a certain date range:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/1553731200000
Or I can choose a specific file:
https://api.3rdParty.com/reports/v1/scheduledReports/download/877800/201904/CTDDaily/2019-04-02T01:50:00Z.csv
But how can I create a command in Nifi to automatically check for newer files, as this process will be running daily and we will be looking at downloading a new file each day.
If this is too broad, please help me by letting me know so I can edit this post.
Thanks.
Note: 3rdParty host name has been renamed to comply with security - therefore links will not directly work. Thanks.
1) You change the filename of the flow file to anything you want using the UpdateAttribute processor. If you want to make it have a ".csv" extension then you can add a property named "filename" with a value of "${filename}.csv" (without the quotes when you enter it).
2) By default most processors have a scheduling strategy of timer-driver 0 seconds, which means keep running as fast as possible. Go to the configuration of the processor on the scheduling tab and configure the appropriate schedule, it sounds like you probably want CRON scheduling to schedule it daily.
3) You can use NiFi expression language statements to create dynamic time ranges. I don't fully understand the syntax for the API that you have to communicate with, but you could do something like this for the URL:
https://api.3rdParty.com/reports/v1/scheduledReports/877800/${now()}
Where now() would return the current timestamp as an epoch.
You can also format it to a date string if necessary:
${now():format('yyyy-MM-dd')}
https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html

Data Masking on huge CSV files stored in AWS S3

I have huge csv files of size ~15GB in aws s3 (s3://bucket1/rawFile.csv). Lets say if the schema looks like this:
cust_id, account_num, paid_date, cust_f_name
1001, 1234567890, 01/01/2001, Jonathan
I am trying to mask the account number column and the customer name and create a new maskedFile.csv and store it in another aws s3 bucket (s3://bucket2/maskedFile.csv) as follows:
cust_id, account_num, paid_date, cust_f_name
1001, 123*******, 01/01/2001, Jon*******
This needs to be done just once with one snapshot of the payment data.
How can i do this? and what tools should I use to achieve this? Please let me know.
AWS Glue is AWS' managed ETL and data catalog tool, and it was made for exactly this kind of task.
You point it to the source folder on S3, tell it the destination folder where you want the results to land, and you are guided through the transformations you want. Basically if you can write a bit of Python you can do a simply masking transform in no time.
Once that's set up, Glue will automatically transform any new file you drop into the source folder, so you have not only created the code necessary to do the masking, you have a completely automated pipeline that runs when new data arrives. I saw that your case only calls for it to run once, but it's really not much easier to write the code to do it once.
To see an example of using Glue to set up a simple ETL job, take a look at: https://gorillalogic.com/blog/in-search-of-happiness-a-quick-etl-use-case-with-aws-glue-redshift/. And there are plenty of other tutorials out there to get you started.
You can try FileMasker.
It will mask CSV (and JSON) files in S3 buckets.
You can run it as an AWS Lambda function although the Lambda restrictions will limit the input file sizes to a few GB each.
If you can split the input files into smaller files then you'll be fine. Otherwise, contact the vendor for options. See https://www.dataveil.com/filemasker/
Disclaimer: I work for DataVeil.

Abinitio graph extract Information

I have an Abinitio graph with multiple subgraphs in it. I need to extract following information about the graphs: List of i/p files, o/p files, i/o tables, lookup files, run program, etc. How can I automate this process of extraction of all graphs without doing it manually on GDE.
you can create the script and call below commands for details:
air sandbox get-required-files
air sandbox run $1 -script-only
Let me know if this resolves the issue.