I am developing a node.js application. I am using an AWS EC2 instance with MySQL. I am using Amazon S3 for my storage. In my application, each user has a repository. Each repository has multiple folders and each folder has multiple files.
Is it a good idea to programmatically create an S3 folder for each user to achieve a directory structure?
Amazon offloads the pain to create the parent and nested sub folders when you have to put the keys inside multiple sub folders.
You can certainly consider using folders programmatically.
For instance
If you want to create a file under subdolder, then under a subsubfolder- you can simply put key as subfolder/subsubfolder/file.txt
The operation would be performing like -
create if not exists
Related
We are ingesting a zip file from blob storage with Azure Data Factory. Within this zipped file there are multiple csv files which need to be copied to Azure SQL tables.
For example, lets say zipped.zip contains Clients.csv, Accounts.csv, and Users.csv. Each of these csv files will need to copied to a different destination table - dbo.Client, dbo.Account, and dbo.User.
It seems that I would need to use a Copy Activity to unzip and copy these files to a staging location, and from there copy those individual files to their respective tables.
Just want to confirm that there is not a different way to unzip and copy all in one action, without using a staging location? Thanks!
Your thoughts are correct, there is no direct way without using a staging aspect as long as you are not writing any custom code logic and leveraging just the Copy activity
a somewhat similar thread: https://learn.microsoft.com/en-us/answers/questions/989436/unzip-only-csv-files-from-my-zip-file.html
I am trying to copy files from one folder to another. However source folder has multiple folders in it and then multiple files. My requirement is to move all the files from each of these folder into single folder. I have about millions of file and each of these files have hardly 1 or 2 records.
Example -
source_folder - dev-bucket/data/
Inside this source_folder, I have following -
folder a - inside this folder, 10000 json files
folder b - inside this folder, 10000 json files
My aim - Target_folder - dev-bucket/final/20000 json files.
I tried writing below code, however, the processing time is also huge. Is there any other way to approach this?
try:
for obj in bucket.objects.filter(Prefix=source_folder):
old_source = {'Bucket': obj.bucket_name,'Key': obj.key}
file_count = file_count+1
new_obj = bucket.Object(final_file)
new_obj.copy(old_source)
except Exception as e:
logger.print("The process has failed to copy files from sftp location to base location", e)
exit(1)
I was thinking of merging the data into 1 single json file before moving the file. However, I am new to Python and AWS and am struggling to understand how should I read and write the data. I was trying to do below but am kind of stuck.
paginator = s3_client.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=s3_bucket, Prefix=FOLDER)
response = []
for page in pages:
for obj in page['Contents']:
read_files = obj["Key"]
result = s3_client.get_object(Bucket=s3_bucket, Key=read_files)
text = result["Body"].read().decode()
response = response.append(text)
Can you please guide me? Many thanks in advance.
If you need copy one time, I sugget to use aws cli
aws s3 cp source destination --recursive
https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
If possible, it is best to avoid having high numbers of objects. They are slow to list and to iterate through.
From your question is seems that they contain JSON data and you are happy to merge the contents of files. A good way to do this is:
Use an AWS Glue crawler to inspect the contents of a directory and create a virtual 'table' in the AWS Glue Catalog
Then use Amazon Athena to SELECT data from that virtual table (which reads all the files) and copy it into a new table using CREATE TABLE AS
Depending upon how you intend to use the data in future, Athena can even convert it into a different format, such as Snappy-compressed Parquet files that are very fast for querying
If you instead just wish to continue with your code for copying files, you might consider activating Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. Your Python program could then use that inventory file as the input list of files, rather than having to call ListObjects.
However, I would highly recommend a strategy that reduces the number of objects you are storing unless there is a compelling reason to keep them all separate.
If you receive more files every day, you might even consider sending the data to an Amazon Kinesis Data Firehose, which can buffer data by size or time and store it in fewer files.
I am building my Cloudformation template to create an S3 Bucket.
I wanted to create folders in the bucket at the same time but I have read that I need to use a lambda backed resource.
So I've prepared the lambda part of my template but I need to add a condition :
If the lambda refers to a bucket which already exists
The bucket already exists and it has been created in this ( everything has to reside in one cloudformation stack) file
Call the lambda to create my folders.
I do not want to check if my bucket exists in S3 or if my folders already exist as S3 objects in the bucket.
I would like the lambda backed resource to be created after the bucket has been created.
First of all - why you need directories at all? S3 is in fact a key-value store, "paths" are just prefixes. Usually there is no benefit of doing so other than human-friendly presentation.
Secondly - you can either use DependsOn to enforce proper order or resources provisioning, or (that would be a good practice I think) if you make Lambda generic and accept bucket name in your custom resource parameter, you just do it by using the Ref function, which implicitly creates dependency.
I have two Google Cloud Platform projects, one for staging and one for production. I would like the dynamic inventory system to retrieve the machines from both account (http://docs.ansible.com/ansible/guide_gce.html). How can I do it?
Inventories are composable, so you can have as many dynamic inventory sources as you want. Just create a directory called inventory next to your playbooks, and create subdirs for each of your projects. Put a copy of the gce dynamic inventory script with its ini file in each of the project dirs, set it up the way you want, and just point ansible-playbook at the inventory directory with -i. Voila- you should see the hosts from both inventories.
You could also put two copies of the inventory scripts in the same directory (or symlinks) and rename them to be unique (ie, so they'd run twice), but in order to have a unique config for each, you'd have to hack the script to use a different config filename (or to dynamically use the name of the script itself, in the symlink case).
If you're using group_vars/host_vars dirs, it's best if you put them at the top level of the inventory directory (not down in the project subdirs). I haven't checked it under 2.0, but nested group_vars/host_vars didn't compose correctly in 1.9 last time I tried it.
My application has couchbase views (map-reduce). Presently, I am writing them on a text file and loading them for each new couchbase server from the couchbase admin page (tedious and error-prone process).
Is there anyway I can load all those views from text files into couchbase when am deploying a fresh couchbase server or when I create a fresh bucket ?
I remember in mysql, we used to write all insert queries and procedures onto a file and feed the file to mysql (via command prompt) for each new instance. Is there any such strategy available for couchbase ?
From your previous couchbase related questions, it seems you are using the java SDK?
Both 1.4 and 2.0 lines of the SDK allow for programmatically creating desing documents and views.
With Java SDK 1.4.x
You have to load your view definitions (map functions, reduce functions, in which design document to put them) somehow, as Strings. See the documentation at http://docs.couchbase.com/couchbase-sdk-java-1.4/#design-documents.
Basically you create a ViewDesignin a DesignDocument that you insert in the database via the CouchbaseClient:
DesignDocument designDoc = new DesignDocument("beers");
designDoc.setView(new ViewDesign("by_name", "function (doc, meta) {" +
" if (doc.type == \"beer\" && doc.name) {" +
" emit(doc.name, null);" +
" }" +
"}"));
client.createDesignDoc(designDoc);
With Java SDK 2.0.x
In the same way, you have to load your view definitions (map functions, reduce functions, in which design document to put them) somehow, as Strings.
Then you deal with DesignDocument, adding DefaultView to it, and insert the design document in the bucket via Bucket's BucketManager:
List<View> viewsForCurrentDesignDocument = new ArrayList<View>(viewCountForCurrentDesignDoc);
//... for each view definition you loaded
View v = DefaultView.create(viewName, viewMapFunction, viewReduceFunction);
viewsForCurrentDesignDocument.add(v);
//... then create the designDocument proper
DesignDocument designDocument = DesignDocument.create(designDocName, viewsForCurrentDesignDocument);
//optionally you can insert it as a development design doc, retrieve an existing one and update, etc...
targetBucket.bucketManager().insertDesignDocument(designDocument);
At Rounds, we use couchbase for some of our server side apps and use docker images for development environment.
I wrote 2 scripts for dumping an existing couchbase and re-creating couchbase buckets and views from the dumped data.
The view map and reduce functions are dumped as plain javascript files in a directory hierarchy that represents the design docs and buckets in couchbase. It is very helpful to commit the whole directory tree into your repo so you can track changes made to your views.
As the files are plain javascript, you can edit them with your favourite IDE and enjoy automatic syntax checks.
You can use the scripts from the following github repo:
https://github.com/rounds/couchbase-dump
Dump all your couchbase buckets and views as javascript files in a directory hierarchy that you can commit to your repo. Then you can recreate the couchbase buckets and views from previously dumped data.
If you find this helpful and have something to add please create an issue or contribute on github.