How to accelerate a shell script for a mongoDB application? - json

I started shell scripting for my work, but I must admit I'm still far away from even being a rookie. Therefore I wanted to ask you for your help/advice.
I build a script for a big data application (taking the quick and dirty approach, patching stuff from the internet together) to recursively go through a folder structure and convert convert all XML files to JSON.
The status quo of my script is:
#!/bin/sh
# Shell script to find out all the files under a directory and
#its subdirectories. This also takes into consideration those files
#or directories which have spaces or newlines in their names
cd /Users/q337498/Desktop/Archiv/2014/01/10
DIR="."
function list_files()
{
if !(test -d "$1")
then echo $1; return;
fi
cd "$1"
#echo; echo `pwd`:; #Display Directory name
for i in *
do
if test -d "$i"; then # if dictionary
if [ "$(ls -A $i)" ]; then
list_files "$i" #recursively list files
cd ..
else
echo "$i is Empty"
fi
else
java -jar /Users/q337498/Desktop/XML2JSON/SaxonEE9-5-1-4J/saxon9ee.jar -s:"$i" -xsl:/Users/q337498/Desktop/xsltjson-master/conf/xml-to-json.xsl -o:output/$(pwd)/${i%%[.]*}
# if jsonlint /Users/q337498/Desktop/Archiv/2014/01/08/$(pwd)/${i%%[.]*} -q; then
# echo "GOOD"
# else
# echo "NOT GOOD"
# fi
# echo ${i%%[.]*}
# echo "$i"; #Display File name
fi
done
}
if [ $# -eq 0 ]
then list_files .
exit 0
fi
for i in $*
do
DIR="$1"
list_files "$DIR"
shift 1 #To read next directory/file namedone
done
This code works, but the problem is that for 60000 files it takes up to 15 hours on a macbook pro with 16gb RAM and an 2.8 Ghz i7. And I need to convert 10 million files.
How do you think that I could accelerate the script? parallelize? take some commands out? What options do I have, and how would I actually implement them?
The files are ultimately going to end up in a MongoDB, so if someone knows a better way to convert xml to json and upload it to mongo his input is also welcome.
Cheers,
Dudu

I see 2 immediate problems here:
You are invoking java once for each file, therefore incurring the JVM startup time for every file which is going to add up to a huge chunk of time.
You are running single threaded
So I would suggest that you:
Write a Java program that does the directory traversal and does your transformation
Benchmark the performance difference
Try other Java libraries for doing the XML->JSON conversion https://github.com/beckchr/staxon/wiki/Benchmark
If necessary for performance, add multi-threading to your application using java.util.concurrent.

Related

How to do 'svn checkout URL' using exec?

I'm trying to checkout Joe English's tile-extras from Github using svn using a Tcl script.
The required command is
svn checkout https://github.com/jenglish/tile-extras.git path
I have some code that boils down to
exec C:/cygwin64/bin/svn.exe checkout \
https://github.com/jenglish/tile-extras.git C:/cygwin64/tmp/TCL61416]
which fails with the message
couldn't execute "C:\cygwin64\bin\svn.exe checkout
https:\github.com\jenglish\tile-extras.git
C:\cygwin64\tmp\TCL61416": No error
Pasting the command quoted in the error message into a Windows Command Prompt window, I see
svn: E125002: 'https:\github.com\jenglish\tile-extras.git' does not appear to be a URL
So, the problem seems to be that exec converts Tcl-style paths to Unix-style a little over-enthusiastically. Is there any way I can prevent it from converting https://github.com/jenglish... to https:\github.com\jenglish...?
For information, I'm running on Windows 10, with cygwin (setup version 2.889 (64 bit)), svn 1.9.7 and tcl version 8.6.7 (via ActiveTcl 8.6.7.0).
UPDATE
Here is my actual code, which I'm only slightly embarrassed by:
# svn wrapper proposed by Donal Fellows at
# http://stackoverflow/questions/49224268
proc svn {args} {
exec {*}[auto_execok svn] {*}$args <#stdin >#stdout }
# Checkout from github to a temporary repository
set repository https://github.com/jenglish/tile-extras.git set
svnImage [auto_execok svn]
set fil [file tempfile tempfnm] close $fil file delete $tempfnm
set tempRepo [file rootname $tempfnm] puts stdout tempRepo:\ $tempRepo
file mkdir $tempRepo
set svnCmd [list svn checkout $repository [file nativename $tempRepo]]
puts stdout svnCmd:\ $svnCmd eval $svnCmd
# Determine the tile-extras sources
set sourceFiles {keynav.tcl icons.tcl}
set targets [file nativename [file join $tempRepo trunk *.tcl]]
foreach filnam [split [svn ls $targets] \n] {
if {[string match *.tcl $filnam] && [lsearch $sourceFiles $filnam] < 0} {
lappend sourceFiles $filnam
}
}
And here is the result
$ tclsh foo.tcl
tempRepo: C:/cygwin64/tmp/TCL61838
svnCmd: svn checkout
https://github.com/jenglish/tile-extras.git {C:\cygwin64\tmp\TCL61838}
A C:\cygwin64\tmp\TCL61838/branches
A C:\cygwin64\tmp\TCL61838/trunk
A C:\cygwin64\tmp\TCL61838/trunk/README.md
A C:\cygwin64\tmp\TCL61838/trunk/dialog.tcl
A C:\cygwin64\tmp\TCL61838/trunk/doc
A C:\cygwin64\tmp\TCL61838/trunk/doc/dialog.n
A C:\cygwin64\tmp\TCL61838/trunk/doc/keynav.n
A C:\cygwin64\tmp\TCL61838/trunk/icons.tcl
A C:\cygwin64\tmp\TCL61838/trunk/keynav.tcl
A C:\cygwin64\tmp\TCL61838/trunk/license.terms
A C:\cygwin64\tmp\TCL61838/trunk/pkgIndex.tcl
Checked out revision 7.
svn: E155007: '/home/alan/C:\cygwin64\tmp\TCL61838\trunk\*.tcl' is not a working copy
while executing "exec {*}[auto_execok svn] {*}$args <#stdin >#stdout"
(procedure "svn" line 2)
invoked from within "svn ls $targets"
invoked from within "split [svn ls $targets] \n"
invoked from within "foreach filnam [split [svn ls $targets] \n] {
if {[string match *.tcl $filnam] && [lsearch $sourceFiles $filnam] < 0} {
lappend sourceFiles $filn..."
(file "foo.tcl" line 30)
$ ls /tmp/TCL61838/
$
The directory /tmp/TCL61838 is empty, so it seems the svn checkout command didn't complete completely happily. I also see an unpleasant mixture of forward slashes and backslashes being reported by svn.
Thanks in advance for any more help.
Given the error message, it looks like you're getting word boundaries wrong in the code that you've not shown us; while you might believe the code “boils down to” to that exec, it's not actually done that. Also, you've flipped the slashes in the URL which won't work, but that's probably a side-effect of something else.
Alas, I can't quite guess how to fix things for you. There's just too many options. I provide a suggestion below, but there's no telling for sure whether it will work out.
Diagnosis Methodology
The evidence for why I believe that the problem is what I say? This interactive session log (on OSX, but the generic behaviour should be the same):
% exec cat asdkfajh
cat: asdkfajh: No such file or directory
% exec "cat akjsdhfdkj"
couldn't execute "cat akjsdhfdkj": no such file or directory
% exec "cat aksdjhfkdf" skdjfghd
couldn't execute "cat aksdjhfkdf": no such file or directory
The first case shows an error from an external program. The second case shows an error due to no-such-program. The third case shows that arguments are not reported when erroring due to to no-such-program.
This lets me conclude that both C:\cygwin64\bin\svn.exe and its arguments (checkout, https:\github.com\jenglish\tile-extras.git and C:\cygwin64\tmp\TCL61416) were actually passed as a single argument to exec, a fairly common error, and that the problems lie in the preparatory code. You don't show us the preparatory code, so we can't truly fix things but we can make suggestions that address the common problems.
Suggested Approach
A good way to reduce these errors is to write a small wrapper procedure:
proc svn {args} {
# I add in the I/O redirections so svn can ask for a password
exec {*}[auto_execok svn] {*}$args <#stdin >#stdout
}
This would let you write your call to svn as:
svn checkout $theURL [file nativename $theDirectory]
and it would probably Just Work™. Also note that only the directory name goes through file nativename; the URL does not. (We could embed the call to file nativename in the procedure if we were making a specialised procedure to do checkouts, but there's too much variation in the full svn program to let us do that. The caller — you — has to deal with it.)

How can I automatically kill idle GCE instances based on CPU usage?

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).
However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.
Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.
The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.
Credit, and more detailed explanation: Rohit Rawat's blog.
#!/bin/bash
threshold=0.4
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done
It seems like there are two parts to this question:
Identifying dead instances.
Killing off those instances.
In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.
Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.
I wish I could add this as a comment to viswajithiii answer but I'm just shy of the reputations necessary to comment.
I found the static threshold variable to be inappropriate when I am using cloud VM's with variable numbers of cpu's as the output of uptime scales with the number of CPU's as discussed here.
My updated script adds two lines below the threshold assignment to scale the threshold by the number of cpu's. This allows me to set a percentage cpu utilization that will work across VM's with different numbers of cpu's.
Otherwise, the script is the same as viswajithiii's.
#!/bin/bash
threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done
This works without bc (not in GCP Container OS) using viswajithiii's answer and this post:
How can I replace 'bc' tool in my bash script?
It also appends the history list to file before poweroff. I set my threshold very low, but the load is showing 0.00 even when I'm editing files via cli. Might work better if instance is under heavy load.
#!/bin/bash
threshold=10
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
load2=$(awk -v a="$load" 'BEGIN {print a*100}')
echo $load2
if [ $load2 -lt $threshold ]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
That's not working for my low cpu, but this seems too:
#!/bin/bash
threshold=1
count=0
while true
do
load=$(awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
load2=$(printf "%.0f\n" $load)
echo $load
echo $load2
if [[ $load2 -lt $threshold ]]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
It only works with both echo loads for some reason.
credits:
How to get overall CPU usage (e.g. 57%) on Linux
https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer
FYI: according to here, GCP monitoring agent is not available for N type instances: Google Cloud Platform: how to monitor memory usage of VM instances
Put this in a startup script in /etc/my_init.d and make it executable:
sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh
Actually, that's being deleted. Instead add to Custom Metadata in Edit for the instance: startup-script and #! /bin/bash \n~./autooff.sh

TCL script not working anymore

I used to run a tcl script for the Cadence tools on a server, however, now the script fails to run.
The script is based on the following one:
#### Template Script for RTL->Gate-Level Flow
#### all basic steps except for DFT-scan
#### Fill in the <...> fields for your module
#### and update library search paths for your system
if {[file exists /proc/cpuinfo]} {
sh grep "model name" /proc/cpuinfo
sh grep "cpu MHz" /proc/cpuinfo
}
#### Set up
set DESIGN test
set SYN_EFF medium
set MAP_EFF medium
set DATE test
set global_map_report 1
set map_fancy_names 1
set iopt_stats 1
set SYN_PATH "."
set _OUTPUTS_PATH outputs_${DATE}
set _LOG_PATH logs_${DATE}
set _REPORTS_PATH reports_${DATE}
set_attribute lib_search_path {. ./lib} /
set_attribute hdl_search_path {. ./rtl} /
set_attribute information_level 7 /
set_attribute map_timing true /
set_attribute retime_reg_naming_suffix __retimed_reg /
set_attribute library lib
... continues
First I open a csh in order to run a csh script to setup the Cadence tools on the server, then I run source script.tcl. This used to work, however, now it fails with the following error:
Missing ].
And if I comment the first if:
set: Syntax Error.
What may have changed in the server for this to happen and how can I fix this? The script did not change, so its syntax is correct.
As your comments in code described, you should call source script.tcl inside the tool, not in csh which doesn't understand Tcl syntax, nor in tclsh which doesn't understand those Cadence-specific Tcl commands.
Also, the two line
sh grep "model name" /proc/cpuinfo
sh grep "cpu MHz" /proc/cpuinfo
should be
exec grep "model name" /proc/cpuinfo
exec grep "cpu MHz" /proc/cpuinfo
since exec is the correct Tcl command for calling shell commands.

Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)

I had a quick look on the forums and I don't think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD.
Each CPU has its own GPU.
My task is to gather data by running the (already working) code, and implement extra things.
Turning this code into a single CPU / Multi-GPU one is not an option at the moment (later, possibly.).
I would like to make use of performance profiling tools to analyse the whole thing.
For now an idea is to have each CPU launch nvvp for its own GPU and gather data, while another profiling tool will take care of general CPU/MPI part (I plan to use TAU, as I usually do).
Problem is, launching nvvp's interface 8 simultaneous times (if running with 8 CPU/GPUs) is extremely annoying. I would like to avoid going through the interface, and get a command line that directly writes the data in a file, that I can feed to nvvc's interface later and analyse.
I'd like to get a command line that will be executed by each CPU and will produce for each of them a file giving data about their own GPU. 8 (GPUs/CPUs) = 8 files.
Then I plan to individually feed and analyse these files with nvcc one by one, comparing the data manually.
Any idea ?
Thanks !
Take a look at nvprof, part of the CUDA 5.0 Toolkit (currently available as a release candidate). There are some limitations - it can only collect a limited number of counters in a given pass and it cannot collect metrics (so for now you'd have to script multiple launches if you want more than a few events). You can get more information from the nvvp built-in help, including an example MPI launch script (copied here but I suggest you check out the nvvp help for an up-to-date version if you have anything newer than the 5.0 RC).
#!/bin/sh
#
# Script to launch nvprof on an MPI process. This script will
# create unique output file names based on the rank of the
# process. Examples:
# mpirun -np 4 nvprof-script a.out
# mpirun -np 4 nvprof-script -o outfile a.out
# mpirun -np 4 nvprof-script test/a.out -g -j
# In the case you want to pass a -o or -h flag to the a.out, you
# can do this.
# mpirun -np 4 nvprof-script -c a.out -h -o
# You can also pass in arguments to nvprof
# mpirun -np 4 nvprof-script --print-api-trace a.out
#
usage () {
echo "nvprof-script [nvprof options] [-h] [-o outfile] a.out [a.out options]";
echo "or"
echo "nvprof-script [nvprof options] [-h] [-o outfile] -c a.out [a.out options]";
}
nvprof_args=""
while [ $# -gt 0 ];
do
case "$1" in
(-o) shift; outfile="$1";;
(-c) shift; break;;
(-h) usage; exit 1;;
(*) nvprof_args="$nvprof_args $1";;
esac
shift
done
# If user did not provide output filename then create one
if [ -z $outfile ] ; then
outfile=`basename $1`.nvprof-out
fi
# Find the rank of the process from the MPI rank environment variable
# to ensure unique output filenames. The script handles Open MPI
# and MVAPICH. If your implementation is different, you will need to
# make a change here.
# Open MPI
if [ ! -z ${OMPI_COMM_WORLD_RANK} ] ; then
rank=${OMPI_COMM_WORLD_RANK}
fi
# MVAPICH
if [ ! -z ${MV2_COMM_WORLD_RANK} ] ; then
rank=${MV2_COMM_WORLD_RANK}
fi
# Set the nvprof command and arguments.
NVPROF="nvprof --output-profile $outfile.$rank $nvprof_args"
exec $NVPROF $*
# If you want to limit which ranks get profiled, do something like
# this. You have to use the -c switch to get the right behavior.
# mpirun -np 2 nvprof-script --print-api-trace -c a.out -q
# if [ $rank -le 0 ]; then
# exec $NVPROF $*
# else
# exec $*
# fi
Another option is since you are already using TAU to profile the CPU side of the application you could also use TAU to collect the GPU performance data. TAU supports multi-gpu execution along with MPI, take a look at http://www.nic.uoregon.edu/tau-wiki/Guide:TAUGPU for instructions on how to get started using TAU's GPU profiling capabilites. TAU uses CUPTI (CUda Performance Tools Interface) underneath and so the data you will be able to collect with TAU will be very similar to what to can collect with nVidia's Visual Profiler.
Things have changed since CUDA 5.0 and now we can simply use %h, %p and %q{ENV} as mentioned here instead of using a wrapper script:
$ mpirun -np 2 -host c0-0,c0-1 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./my_mpi_app
Apparently since 2015 it is possible to auto-annotated MPI calls via NVTX and mpi_interceptions.so library when using nvprof profiler:
https://devblogs.nvidia.com/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/
http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf
TAO still does not support distributed deep learning according to this presentation:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7684-allen-malony-performance-analysis-of-cuda-deep-learning-networks-using-tau.pdf

Saving security camera stream

I've an axis M1011 camera and i want to continuosly save the flow of the camera and divide it in multiple file?
Than register it to a database mysql (i think only the information of the file).
How it is possible to do it?
I saw ffmpeg but i think i would lose some frame between the various connection.
One simple script is this: this would save a video every minute, the fps = 1.
The video are saved in the directory in year/month/day/hour/... With a path created as it is, i don't know if it is useful to store the path in the database.
b=.avi;
while true; do
path=`date +%Y/%m/%d/%k/`;
file=`date +%k:%M-%d_%m_%Y`;
mkdir -p $path;
e=$file$b;
echo $e;
ffmpeg -r 1 -t "00:01:00" -f mjpeg -i http://address/mjpg/video.mjpg? streamprofile=lowprofile $path$e &
sleep 60;
i=`expr $i + 1`;
done