Data Masking on huge CSV files stored in AWS S3 - csv

I have huge csv files of size ~15GB in aws s3 (s3://bucket1/rawFile.csv). Lets say if the schema looks like this:
cust_id, account_num, paid_date, cust_f_name
1001, 1234567890, 01/01/2001, Jonathan
I am trying to mask the account number column and the customer name and create a new maskedFile.csv and store it in another aws s3 bucket (s3://bucket2/maskedFile.csv) as follows:
cust_id, account_num, paid_date, cust_f_name
1001, 123*******, 01/01/2001, Jon*******
This needs to be done just once with one snapshot of the payment data.
How can i do this? and what tools should I use to achieve this? Please let me know.

AWS Glue is AWS' managed ETL and data catalog tool, and it was made for exactly this kind of task.
You point it to the source folder on S3, tell it the destination folder where you want the results to land, and you are guided through the transformations you want. Basically if you can write a bit of Python you can do a simply masking transform in no time.
Once that's set up, Glue will automatically transform any new file you drop into the source folder, so you have not only created the code necessary to do the masking, you have a completely automated pipeline that runs when new data arrives. I saw that your case only calls for it to run once, but it's really not much easier to write the code to do it once.
To see an example of using Glue to set up a simple ETL job, take a look at: https://gorillalogic.com/blog/in-search-of-happiness-a-quick-etl-use-case-with-aws-glue-redshift/. And there are plenty of other tutorials out there to get you started.

You can try FileMasker.
It will mask CSV (and JSON) files in S3 buckets.
You can run it as an AWS Lambda function although the Lambda restrictions will limit the input file sizes to a few GB each.
If you can split the input files into smaller files then you'll be fine. Otherwise, contact the vendor for options. See https://www.dataveil.com/filemasker/
Disclaimer: I work for DataVeil.

Related

How to create a JSON from a template?

I have a JSON file that is used in a monitoring software for monitor an specific device.
I want to monitor a similar device which I don't have the JSON file.
I know everything from the new device that is needed to "fill the blanks" in the existent JSON structure.
A brute force approach would be create a script that reads the data (of new device) from a craft input file and output the nodes and leafs in the same JSON structure.
Before I take this path I would like to know if there is some tool that could help me in this task. For sure this "wheel" is not new, I don't want to re-invent it again.
Anyone knows about a tool that uses a JSON as template and generate another one changing the values from other source ?
I am on linux, can write scripts in bash and python.

Autodesk-Forge bucket system: New versioning

I am wondering of what is the best practise for handling new version of the same model in the Data Management API Bucket system
Currently, I have one bucket per user and the files with same name overwrites the existing model when doing a svf/svf2 conversion.
In order to handle model versioning in be the best manner, should I :
create one bucket per file converted
or
continue with one bucket per user.
If 1): is there a limitation of number of buckets which is possible to create?
else 2): How do I get the translation to accept an bucketKey different than the file name? (As it is now, the uploaded file need to be the filename to get the translation going.)
In advance, cheers for the assistance.
In order to translate a file, you do not have to keep the original file name, but you do need to keep the file extension (e.g. *.rvt), so that the Model Derivative service knows which translator to use. So you could just create files with different names: perhaps add a suffix like "_v1" etc or generate random names and keep track of which file is what version of what model in a database. Up to you.
There is no limit on number of buckets, but it might be an overkill to have a separate one for each file.

splitting and merging the json files from batch jobs in aws

I am working on a project where I am splitting a single file with a bunch of sentences into chunks for further sending to a third-party API for sentiment analysis.
The third-party API has a limitation of up to 5000 characters of limitation and which is why I am splitting the file into chunks of 40 sentences each. Each chunk will be sent to a batch job via AWS SQS and processed for sentiment analysis from a third-party API. I wanted to merge all of the processed files into one file. I couldn't find the logic to merge the files.
For example,
the input file,
chunk1: sentence1....sentence1... sentence1....
chunk2: sentence2....sentence2... sentence2....
The input file is separated into chunks. Each of these chunks is sent separately to a batch job via SQS. The batch job will call the external API for sentiment analysis. Each file will be uploaded to the S3 bucket as separate files.
Output file:
{"Chunk1": "sentence1....sentence1...sentence1....",
"Sentiment": "positive."}
All I wanted is to have the output in a single file but couldn't find the logic to merge the output files.
Logic I tried:
For each input file, I send a UUID to every chunk as ametadata and merge them with another lambda function. But the problem here is I am not sure when all of the chunks are processed and when to invoke the lambda function to merge the files.
If you have any better logic to merge the files, please share it here.
This sounds like a perfect use case for the AWS Step Function service. Step Functions allow you to have ordered tasks (which can be implemented by Lambdas). One of the state types, called Map, allows you to kick off many tasks in parallel and wait for all of those to finish before proceeding to the next step.
So a quick high level state flow would be something like:
First state takes a file as input and breaks up the file into multiple chunks
The second state would be a map state with a task that takes a file as input and sends to the sentiment analysis and saves output. The map state will kick off a task for each small file and retrieve the sentiment analysis.
The third and final task state will take all of the files and combine them in whatever way you deem appropriate.
It may take a bit of googling and reading the user guides but your workflow is exactly the use case this service was designed for and it sounds like you already have some of these steps implemented as their own Lambda functions, you'll just need to tweak those to be compatible with how Step Functions receive and push data out instead of using SQS.
That being said, I'm not sure how you want to merge the files as each section was analyzed separately and may have it's own sentiment and I'm not sure how to summarize the sentiment as a whole.
Resources:
https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/

Copying multiple files from one folder to another in the same S3 bucket

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

Adding static files to Talend jobs

I'm using Talend Open Studio for Big Data and I have a job where I use tFileInputDelimited to load a CSV file and use it as a lookup with a tMap.
Currently the file is loaded from the disk using an absolute path (C:\work\jobs\lookup.csv) and everything works fine locally.
The issue is that when I deploy the task, it obviously doesn't take the lookup.csv file with it.
Which begs a question:
Is there any way to "bundle" this file (lookup.csv) into the job so I can later deploy them together?
With static data such as this your best bet is to hard code the data into the job using a tFixedFlowInput instead.
As an example, if we want to use a list of country names, their ISO2 and ISO3 codes you might have these in a CSV that you'd normally access with a tFileInputDelimited. However, to save bundling this CSV with every build (which could be done with ANT/Maven) you can just hard code this data into a tFixedFlowInput:
You then just need to make sure your schema is set up as the same as your delimited file would have been (so in this case we have 3 columns: Country_Name, ISO2 and ISO3).