script file has to change property names in aws s3 - json

I have more than 1000 files in aws s3 bucket in different folders, all the files are json files only, these json files have 30 properties, now I have to change the name of 2 properties (Ex: code to httpCode and time to responseTime). Can we write a script file which can change these property names in all files

Note: You should run this command without -i switch in sed command just to verify that you are getting desired results. -i will make changes in the file. If you are getting desired results then only put -i switch.
// Get the files from s3 bucket
aws s3 sync s3://mybucket .
find . -iname "*.json" -type f -exec sed -i 's/code/httpCode/g;s/time/responseTime/g' {} \;
// sync the files with s3 from current local directory
aws s3 sync . s3://mybucket
ps: this is untested.

Related

Wget -i gives no output or results

I'm learning data analysis in Zeppelin, I'm a mechanical engineer so this is outside my expertise.
I am trying to download two csv files using a file that contains the urls, test2.txt. When I run it I get no output, but no error message either. I've included a link to a screenshot showing my code and the results.
When I go into Ambari Sandbox I cannot find any files created. I'm assuming the directory the file is in is where the csv files will be downloaded too. I've tried using -P as well with no luck. I've checked in man wget but it did not help.
So I have several questions:
How do I show the output from running wget?
Where is the default directory that wget stores files?
Do I need additional data in the file other than just the URLs?
Screenshot: Code and Output for %sh
Thanks for any and all help.
%sh
wget -i /tmp/test2.txt
%sh
# list the current working directory
pwd # output: home/zeppelin
# make a new folder, created in "tmp" because it is temporary
mkdir -p /home/zeppelin/tmp/Folder_Name
# change directory to new folder
cd /home/zeppelin/tmp/Folder_Name
# transfer the file from the sandbox to the current working directory
hadoop fs -get /tmp/test2.txt /home/zeppelin/tmp/Folder_Name/
# download the URL
wget -i test2.txt

Convert Multiple Files To UTF-8

I have a number of files that get dumped every morning on a server. There are 9 CSV comma delimited files, varying in size from 59kb to as large as 127mb. I need to import these into a MySQL database, however the files are currently in western iso-8859-16 format and I need to convert these to UTF-8 for my import. The manual process of converting these is quite teadious as you can imagine.
Is there a script / batch file I can create to run every morning via task scheduler? Or what is the best way to try automating this process on a Windows Server 2012?
Current Script, that reads the files, but doesn't seem to convert them:
!/bin/bash
Recursive file convertion CP1252 --> utf-8
Place this file in the root of your site, add execute permission and run
Converts *.php, *.html, *.css, *.js files.
To add file type by extension, e.g. *.cgi, add '-o -name "*.cgi"' to the find command
find ./ -name "*.csv" -o -name "*.html" -o -name "*.css" -o -name "*.js" -type f |
while read file
do
echo " $file"
mv $file $file.icv
iconv -f WINDOWS-1252 -t UTF-8 $file.icv > $file
rm -f $file.icv
done
Any help is greatly appreciated.
Thanks.

AWS CLI Command

I'm trying to execute the following command using AWS CLI command -
aws s3 cp s3://my_bucket/folder/file_1234.txt -| pipe to sed command | pipe to jq command | aws s3 cp - s3://my_bucket/new_folder/final_file.txt
The above code is working fine - basically pulling data from s3, doing some operations and pushing it back to s3.
Now, I have some files in s3 that have a pattern - for instance - file_771.txt, file_772.txt, file_773.txt and so on.
Now in order to get all the files that match the pattern I'm doing the following operation which is not working as expected. Its generating an empty output file in s3.
aws s3 cp --include file_77* s3://my_bucket/folder/ -| pipe to sed command | pipe to jq command | aws s3 cp - s3://my_bucket/new_folder/final_file.txt
This code is generating empty final_file.txt. Any reason ? Am I missing something in the code ?
To copy multiple files at once, you would have to use --recursive, in your case with --exclude "*" --include "file_77*", but:
Downloading as a stream is not currently compatible with the
--recursive parameter
cp

What will happen with a gsutil command if a DRA bucket's contents are unavailable?

I'm on an DRA (Durable Reduced Availability) bucket and I perform the gsutil rsync command quite often to upload/download files to/from the bucket.
Since file) could be unavailable (because of the DRA), what exactly will happen during a gsutil rsync session when such a scenario is being hit?
Will gsutil just wait until the unavailable files becomes available and complete the task, thus always downloading everything from the bucket?
Or will gsutil exit with a warning about a certain file not being available, and if so exactly what output is being used (so that I can make a script to look for this type of message)?
What will the return code be of the gsutil command in a session where files are found to be unavailable?
I need to be 100% sure that I download everything from the bucket, which I'm guessing can be difficult to keep track of when downloading hundreds of gigabytes of data. In case gsutil rsync completes without downloading unavailable files, is it possible to construct a command which retries the unavailable files until all such files have been successfully downloaded?
If your files exceed the resumable threshold (as of 4.7, this is 8MB), any availability issues will be retried with exponential backoff according to the num_retries and max_retry_delay configuration variables. If the file is smaller than the threshold, it will not be retried (this will be improved in 4.8 so small files also get retries).
If any file(s) fail to transfer successfully, gsutil will halt and output an exception depending on the failure encountered. If you are using gsutil -m rsync or gsutil rsync -C, gsutil will continue on errors and at the end, you'll get a CommandException with the message 'N file(s)/object(s) could not be copied/removed'
If retries are exhausted and/or either of the failure conditions described in #2 occur, the exit code will be nonzero.
In order to ensure that you download all files from the bucket, you can simply rerun gsutil rsync until you get a nonzero exit code.
Note that gsutil rsync relies on listing objects. Listing in Google Cloud Storage is eventually consistent. So if you are upload files to the bucket and then immediately run gsutil rsync, it is possible you will miss newly uploaded files, but the next run of gsutil rsync should pick them up.
I did some tests on a project and could not get gsutil to throw any errors. Afaik, gsutil operates on the directory level, it is not looking for a specific file.
When you run, for example $ gsutil rsync local_dir gs://bucket , gsutil is not expecting any particular file, it just takes whatever you have in "local_dir" and uploads it to gs://bucket, so :
gsutil will not wait, it will complete.
you will not get any errors - the only errors I got is when the local directory or bucket are missing entirely.
if, let´s say a file is missing on local_dir, but it is available in the bucket and then you run $ gsutil rsync -r local_dir gs://bucket, then nothing will change in the bucket. with the "-d" option, the file will be deleted on the bucket side.
As a suggestion, you could just add a crontab entry to rerun the gstuil command a couple of times a day or at night.
Another way is to create a simple script and add it to your crontab to run every hour or so. this will check if your file exists, and if so it will run the gsutil command:
#!/bin/bash
FILE=/home/user/test.txt
if [ -f $FILE ];
then
echo "file exists..or something"
else
gsutil rsync /home/user gs://bucket
fi
UPDATE :
I think this may be what you need. In ~/ you should have a .boto file .
~$ more .boto | grep max
# num_retries = <integer value>
# max_retry_delay = <integer value>
Uncomment those lines and add your numbers. Default is 6 retries, so you could do something like 24 retries and put 3600s in between. This in theory should always keep looping .
Hope this helps !

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase