Copying multiple files from one folder to another in the same S3 bucket - json

I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.

If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.

Related

Best Approach to read large number of JSON Files (1 JSON per file) in Databricks

Hope everyone is doing well.
I am trying to read a large number of JSON files i.e. around 150,000 files in a folder using Azure Databricks. Each file contains a single JSON i.e. 1 record per file. Currently the read process is taking over an hour to just read all the files despite having a huge cluster. The files are read using pattern as shown below.
val schema_variable = <schema>
val file_path = "src_folder/year/month/day/hour/*/*.json"
// e.g. src_folder/2022/09/01/10/*/*.json
val df = spark.read
.schema(schema_variable)
.json(file_path)
.withColumn("file_name", input_file_name())
Is there any approach or option we can try to make the reads faster.
We have already considered copying the file contents into a single file and then reading it, but we are losing lineage of file content i.e. which record came from which file.
I have also gone through various links in SO, but most of them seem to be around a single/multiple files of huge size say 10GB to 50GB.
Environment - Azure Databricks 10.4 Runtime.
Thank you for all the help.

Azure Data Factory - Unzip single file with multiple csv files being copied to different destinations

We are ingesting a zip file from blob storage with Azure Data Factory. Within this zipped file there are multiple csv files which need to be copied to Azure SQL tables.
For example, lets say zipped.zip contains Clients.csv, Accounts.csv, and Users.csv. Each of these csv files will need to copied to a different destination table - dbo.Client, dbo.Account, and dbo.User.
It seems that I would need to use a Copy Activity to unzip and copy these files to a staging location, and from there copy those individual files to their respective tables.
Just want to confirm that there is not a different way to unzip and copy all in one action, without using a staging location? Thanks!
Your thoughts are correct, there is no direct way without using a staging aspect as long as you are not writing any custom code logic and leveraging just the Copy activity
a somewhat similar thread: https://learn.microsoft.com/en-us/answers/questions/989436/unzip-only-csv-files-from-my-zip-file.html

How can I pass multiple CSV files in a directory with same column headers to single REST API in JMETER and test with 1000 users

Test scenario: The folder contains multiple CSVs. Columns are same in all the CSVs.I have to pass multiple csv files one after the other to the single REST API (GET CALL).
Each user (Total 1000 users) should get assigned a set of records/rows from csv file currently in use.
I am new to the JMeter and finding a solution using the CSV Data Set Config. And I realize I could not pass multiple csv files using this.
I also see that __CSVRead() function but I could not pass dynamically the csv file using BeanShell scripting.
Can someone please help me with this?
The CSV file names from the folder can be read one by one using Directory Listing Config plugin
Depending on the CSV file nature you might want to use either __CSVRead() or __StringFromFile() functions directly in your HTTP Request sampler, you don't need to go for any scripting.

avoid splitting json output by pyspark (v. 2.1)

using spark v2.1 and python, I load json files with
sqlContext.read.json("path/data.json")
I have problem with output json. Using the below command
df.write.json("path/test.json")
data is saved in a folder called test.json (not a file) which includes two empty files: one empty and the other with a strange name:
part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f
Is there anyway to have a clean single json output file?
thanks
Yes, spark writes the output in multiple file when you try to save. Since the computation is distributed the output files are written in multiples part files like (part-r-00000-f9ec958d-ceb2-4aee-bcb1-fa42a95b714f). The number of files created are equal to the number of partition.
If your data is small and can fits in the memory then you can save your output file in a single file. But if your data is large saving on a single file is not the suggested way.
Actually the test.json is a directory and not a json file. It contains multiple part files inside it. This does not create any problem for you you can easily read this later.
If you still want your output in a single file then you need to repartition to 1, which brings your all data to single node and saves. This may cause issue if you have large data.
df.repartition(1).write.json("path/test.json")
Or
df.collect().write.json("path/test.json")

How can I add file locations to a database after they are uploaded using a Perl CGI script?

I have a CGI program I have written using Perl. One of its functions is to upload pics to the server.
All of it is working well, including adding all kinds of info to a MySQL db. My question is: How can I get the uploaded pic files location and names added to the db?
I would rather that instead of changing the script to actually upload the pics to the db. I have heard horror stories of uploading binary files to databases.
Since I am new to all of this, I am at a loss. Have tried doing some research and web searches for 3 weeks now with no luck. Any suggestions or answers would be greatly appreciated. I would really hate to have to manually add all the locations/names to the db.
I am using: a Perl CGI script, MySQL db, Linux server and the files are being uploaded to the server. I AM NOT looking to add the actual files to the db. Just their location(s).
It sounds like you have your method complete where you take the upload, make it a string and toss it unto mysql similar to reading file in as a string. However since your given a filehandle versus a filename to read by CGI. You are wondering where that file actually is.
If your using CGI.pm, the upload, uploadInfo, the param for the upload, and upload private files will help you deal with the upload file sources. Where they are stashed after the remote client and the CGI are done isn't permanent usually and a minimum is volatile.
You've got a bunch of uploaded files that need to be added to the db? Should be trivial to dash off a one-off script to loop through all the files and insert the details into the DB. If they're all in one spot, then a simple opendir()/readdir() type loop would catch them all, otherwise you can make a list of file paths to loop over and loop over that.
If you've talking about recording new uploads in the server, then it would be something along these lines:
user uploads file to server
script extracts any wanted/needed info from the file (name, size, mime-type, checksums, etc...)
start database transaction
insert file info into database
retrieve ID of new record
move uploaded file to final resting place, using the ID as its filename
if everything goes file, commit the transaction
Using the ID as the filename solves the worries of filename collisions and new uploads overwriting previous ones. And if you store the uploads somewhere outside of the site's webroot, then the only access to the files will be via your scripts, providing you with complete control over downloads.