Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid) - cuda

I had a quick look on the forums and I don't think this question has been asked already.
I am currently working with an MPI/CUDA hybrid code, made by somebody else during his PhD.
Each CPU has its own GPU.
My task is to gather data by running the (already working) code, and implement extra things.
Turning this code into a single CPU / Multi-GPU one is not an option at the moment (later, possibly.).
I would like to make use of performance profiling tools to analyse the whole thing.
For now an idea is to have each CPU launch nvvp for its own GPU and gather data, while another profiling tool will take care of general CPU/MPI part (I plan to use TAU, as I usually do).
Problem is, launching nvvp's interface 8 simultaneous times (if running with 8 CPU/GPUs) is extremely annoying. I would like to avoid going through the interface, and get a command line that directly writes the data in a file, that I can feed to nvvc's interface later and analyse.
I'd like to get a command line that will be executed by each CPU and will produce for each of them a file giving data about their own GPU. 8 (GPUs/CPUs) = 8 files.
Then I plan to individually feed and analyse these files with nvcc one by one, comparing the data manually.
Any idea ?
Thanks !

Take a look at nvprof, part of the CUDA 5.0 Toolkit (currently available as a release candidate). There are some limitations - it can only collect a limited number of counters in a given pass and it cannot collect metrics (so for now you'd have to script multiple launches if you want more than a few events). You can get more information from the nvvp built-in help, including an example MPI launch script (copied here but I suggest you check out the nvvp help for an up-to-date version if you have anything newer than the 5.0 RC).
#!/bin/sh
#
# Script to launch nvprof on an MPI process. This script will
# create unique output file names based on the rank of the
# process. Examples:
# mpirun -np 4 nvprof-script a.out
# mpirun -np 4 nvprof-script -o outfile a.out
# mpirun -np 4 nvprof-script test/a.out -g -j
# In the case you want to pass a -o or -h flag to the a.out, you
# can do this.
# mpirun -np 4 nvprof-script -c a.out -h -o
# You can also pass in arguments to nvprof
# mpirun -np 4 nvprof-script --print-api-trace a.out
#
usage () {
echo "nvprof-script [nvprof options] [-h] [-o outfile] a.out [a.out options]";
echo "or"
echo "nvprof-script [nvprof options] [-h] [-o outfile] -c a.out [a.out options]";
}
nvprof_args=""
while [ $# -gt 0 ];
do
case "$1" in
(-o) shift; outfile="$1";;
(-c) shift; break;;
(-h) usage; exit 1;;
(*) nvprof_args="$nvprof_args $1";;
esac
shift
done
# If user did not provide output filename then create one
if [ -z $outfile ] ; then
outfile=`basename $1`.nvprof-out
fi
# Find the rank of the process from the MPI rank environment variable
# to ensure unique output filenames. The script handles Open MPI
# and MVAPICH. If your implementation is different, you will need to
# make a change here.
# Open MPI
if [ ! -z ${OMPI_COMM_WORLD_RANK} ] ; then
rank=${OMPI_COMM_WORLD_RANK}
fi
# MVAPICH
if [ ! -z ${MV2_COMM_WORLD_RANK} ] ; then
rank=${MV2_COMM_WORLD_RANK}
fi
# Set the nvprof command and arguments.
NVPROF="nvprof --output-profile $outfile.$rank $nvprof_args"
exec $NVPROF $*
# If you want to limit which ranks get profiled, do something like
# this. You have to use the -c switch to get the right behavior.
# mpirun -np 2 nvprof-script --print-api-trace -c a.out -q
# if [ $rank -le 0 ]; then
# exec $NVPROF $*
# else
# exec $*
# fi

Another option is since you are already using TAU to profile the CPU side of the application you could also use TAU to collect the GPU performance data. TAU supports multi-gpu execution along with MPI, take a look at http://www.nic.uoregon.edu/tau-wiki/Guide:TAUGPU for instructions on how to get started using TAU's GPU profiling capabilites. TAU uses CUPTI (CUda Performance Tools Interface) underneath and so the data you will be able to collect with TAU will be very similar to what to can collect with nVidia's Visual Profiler.

Things have changed since CUDA 5.0 and now we can simply use %h, %p and %q{ENV} as mentioned here instead of using a wrapper script:
$ mpirun -np 2 -host c0-0,c0-1 nvprof -o output.%h.%p.%q{OMPI_COMM_WORLD_RANK} ./my_mpi_app

Apparently since 2015 it is possible to auto-annotated MPI calls via NVTX and mpi_interceptions.so library when using nvprof profiler:
https://devblogs.nvidia.com/gpu-pro-tip-track-mpi-calls-nvidia-visual-profiler/
http://on-demand.gputechconf.com/gtc/2017/presentation/s7495-jain-optimizing-application-performance-cuda-profiling-tools.pdf
TAO still does not support distributed deep learning according to this presentation:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7684-allen-malony-performance-analysis-of-cuda-deep-learning-networks-using-tau.pdf

Related

Why are the binaries generated by using Meson / Ninja much larger than those compiled by plain valac?

Same source file.
directe compile use valac.
⭕ valac --pkg gtk+-3.0 -X -lm --pkg libcanberra src/Application.vala
⭕ ls Application
-rwxrwxr-x 1 eexpss 48K 05-13 19:59 Application
here is part of my meson.build.
project('com.github.eexpress.cairo-timer', 'vala', 'c')
# i18n = import('i18n')
executable(
meson.project_name(),
'src/Application.vala',
dependencies: [
dependency('gtk+-3.0'),
# dependency('cairo'),
dependency('libcanberra')
],
# link_args : '-X',
# link_args : '-lm',
link_args : ['-X', '-lm',],
install: true
)
and use ninja to compile it.
⭕ cd build; ninja
⭕ ls com.github.eexpress.cairo-timer
-rwxrwxr-x 1 eexpss 98K 05-13 17:02 com.github.eexpress.cairo-timer
So the binary file is more larger than above one. why?
Because you didn't enable debugging for valac, but meson enables it by default. Add -g to valac and the output size should be close to equal.
To see how ninja and valac run the tools to build, enable verbose option by given -v to both commands.
The minor size differences are, as I assume, from file names in them. Compare the outputs, for example, from readelf --debug-dump=line hello to see the diff.

How to send binary flashing file to embedded system with only serial console?

I have an embedded Linux system that uses ramdisk boot so it has run time no persistent storage available (it does have Flash to store kernel and ramdisk).
The only connectivity is RS-232 serial login console. So I am limited by what is provided by its built in busybox. I want to retrieve the ramdisk, modify it, and rewrite the ramdisk. The kernel does not have Flash filesystem support built-in. The ramdisk partition size is about 10 MBytes. When all files in the user directory are deleted, the free ramdisk size is about 14 MBytes.
The command dd is available so I can copy the ramdisk partition to the ramdisk, and can write to the flash from a ramdisk file. flashcp is also available.
So my problem is now how to receive and send binary files through the RS-232 serial console?
I research the followings and none is useful for me:
Linux command to send binary file to serial port with HW flow control? on stackoverflow
Binary data over serial terminal on stackoverflow
Transferring files using serial console on k.japko.eu
File transfer over a serial line on superuser.com
How to get file to a host when all you have is a serial console? on stackexchange
Mostly because x/y/zmodem are not available in the busybox.
Any idea? Thanks!
Per the request, here's what I should have included in the first place.
Available u-boot commands:
U-Boot >?
? - alias for 'help'
askenv - get environment variables from stdin
base - print or set address offset
bdinfo - print Board Info structure
boot - boot default, i.e., run 'bootcmd'
bootd - boot default, i.e., run 'bootcmd'
bootm - boot application image from memory
cmp - memory compare
coninfo - print console devices and information
cp - memory copy
crc32 - checksum calculation
crc32_chk_uimage- checksum calculation of an image for u-boot
echo - echo args to console
editenv - edit environment variable
env - environment handling commands
exit - exit script
false - do nothing, unsuccessfully
fatinfo - print information about filesystem
fatload - load binary file from a dos filesystem
fatls - list files in a directory (default /)
fatwrite- write file into a dos filesystem
go - start application at address 'addr'
gpio - input/set/clear/toggle gpio pins
help - print command description/usage
i2c - I2C sub-system
iminfo - print header information for application image
imxtract- extract a part of a multi-image
itest - return true/false on integer compare
loadb - load binary file over serial line (kermit mode)
loads - load S-Record file over serial line
loady - load binary file over serial line (ymodem mode)
loop - infinite loop on address range
md - memory display
mdc - memory display cyclic
mm - memory modify (auto-incrementing address)
mw - memory write (fill)
mwc - memory write cyclic
nm - memory modify (constant address)
printenv- print environment variables
reset - Perform RESET of the CPU
run - run commands in an environment variable
saveenv - save environment variables to persistent storage
saves - save S-Record file over serial line
setenv - set environment variables
sf - SPI flash sub-system
showvar - print local hushshell variables
sleep - delay execution for some time
source - run script from memory
sspi - SPI utility command
test - minimal test like /bin/sh
true - do nothing, successfully
usb - USB sub-system
usbboot - boot from USB device
version - print monitor, compiler and linker version
U-Boot >
Available busybox commands:
BusyBox v1.13.2 (2015-03-16 10:50:56 EDT) multi-call binary
Copyright (C) 1998-2008 Erik Andersen, Rob Landley, Denys Vlasenko
and others. Licensed under GPLv2.
See source distribution for full notice.
Usage: busybox [function] [arguments]...
or: function [arguments]...
BusyBox is a multi-call binary that combines many common Unix
utilities into a single executable. Most people will create a
link to busybox for each function they wish to use and BusyBox
will act like whatever it was invoked as!
Currently defined functions:
[, [[, addgroup, adduser, ar, ash, awk, basename, blkid,
bunzip2, bzcat, cat, chattr, chgrp, chmod, chown, chpasswd,
chroot, chvt, clear, cmp, cp, cpio, cryptpw, cut, date,
dc, dd, deallocvt, delgroup, deluser, df, dhcprelay, diff,
dirname, dmesg, du, dumpkmap, dumpleases, echo, egrep, env,
expr, false, fbset, fbsplash, fdisk, fgrep, find, free,
freeramdisk, fsck, fsck.minix, fuser, getopt, getty, grep,
gunzip, gzip, halt, head, hexdump, hostname, httpd, hwclock,
id, ifconfig, ifdown, ifup, inetd, init, insmod, ip, kill,
killall, klogd, last, less, linuxrc, ln, loadfont, loadkmap,
logger, login, logname, logread, losetup, ls, lsmod, makedevs,
md5sum, mdev, microcom, mkdir, mkfifo, mkfs.minix, mknod,
mkswap, mktemp, modprobe, more, mount, mv, nc, netstat,
nice, nohup, nslookup, od, openvt, passwd, patch, pidof,
ping, ping6, pivot_root, poweroff, printf, ps, pwd, rdate,
rdev, readahead, readlink, readprofile, realpath, reboot,
renice, reset, rm, rmdir, rmmod, route, rtcwake, run-parts,
sed, seq, setconsole, setfont, sh, showkey, sleep, sort,
start-stop-daemon, strings, stty, su, sulogin, swapoff,
swapon, switch_root, sync, sysctl, syslogd, tail, tar, tcpsvd,
tee, telnet, telnetd, test, tftp, tftpd, time, top, touch,
tr, traceroute, true, tty, udhcpc, udhcpd, udpsvd, umount,
uname, uniq, unzip, uptime, usleep, vconfig, vi, vlock,
watch, wc, wget, which, who, whoami, xargs, yes, zcat
In uboot you could use loady/loadx to get file from pc via uart.I usually use teraterm to send file.
The process should be this:
run loady in uboot
use teraterm send data
the file is transfer to you device's memory located in 0x01000000.
Independently I found a way to upload binary files through the Linux console and I'll document the steps here in case others find it useful since I had a hard time looking for this information on the net.
Here's the theory: change the console mode to raw so all the binary traffic are't interpretted as console command, e.g. ctrl-C. Turn off echo so it doesn't add extra serial traffic. Run tar to accept input from the stdin. Since ctrl-C won't work, and tar won't know when to terminate, use a background task to kill the login shell so you can login again to do your staff.
Steps:
Create a script to run in the background. Change myvar variable so it kills the login shell after the transfer is complete. Currently 120 corresponds to 1200 seconds, sufficient for a 10 MBytes file. In addition edit the 808 to match your login shell PID:
create bg file:
myvar=120
while [ $myvar -gt 0 ]
do
myvar=$(( $myvar-1 ))
echo -e " $myvar \n"
ls -l
sleep 10
done
kill -9 808
Launch the script in the background:
in console type:
source ./bg &
Use stty to change console to raw mode and do not echo
in console type:
stty raw -echo
Start tar to untar stdin. Note: I have to use ctrl-J since no longer work after the stty command
in console type and ends with ctrl-j, not :
tar zx -f - 1> 1.log 2> 2.log
Start Teraterm to send binary file
Wait for completion and the new login prompt
I forgot I asked this question. I figured out how to make ssh connection which in turn allows many more things to be done more easily. Of course it requires sshd in addition to nc and stty so you are out of luck if these are not available on your embedded Linux. I have tried it several times and it seems to work well, allowing multiple ssh sessions to be established, and mc to transfer files.
You will need two shell sessions on the host computer, one to loop the serial port to socket, and the other for the ssh, and more if you want to establish more ssh sessions.
First you need to setup the serial port. The '--noreset' option for picocom does this:
sudo picocom --noreset -b 115200 -e b /dev/ttyUSB3
Quit picocom once this is done (^B^X to exit).
Next we need to verify that the line endings are not translated or else ssh won't work. In the first shell run:
cat /dev/ttyUSB3 | hexdump -C
In the second shell run:
echo "echo -e \"LFLF\\n\\nCRCR\\r\\rEND\"" > /dev/ttyUSB3
You may see that \n (0x0A) is translated to \r\n (0x0D0x0A)
Use stty to set raw mode without echo and you should see no more translation:
echo "stty raw -echo" > /dev/ttyUSB3
echo "echo -e \"LFLF\\n\\nCRCR\\r\\rEND\"" > /dev/ttyUSB3
Finally in the first shell run nc to funnel local traffic between the serial port and ssh socket:
cat /dev/ttyUSB3 | nc -l -p 2222 > /dev/ttyUSB3
and funnel remote serial traffic to sshd:
echo "while true ; do nc localhost 22 ; done" > /dev/ttyUSB3
and connect ssh with port forwarding:
ssh -vvv root#localhost -p 2222 -L 0.0.0.0:22022:localhost:22
you can make more ssh connections simultaneously:
ssh -vvv root#localhost -p 22022
if you use mc, you can connect to it so you can easily browse the remote file system and copy files:
sh://root#localhost:22022
Last words: nc strips the TCP headers so the ssh packets are no checksumed and are not retried. If there were data error, the connection will break. If you remember your login shell PID, you can kill it and login again, otherwise you have to reboot. The '-vvv' flag for the ssh is for debugging.

Does qemu emulator have checkpoint function?

I am using qemu emulator for aarch64 and want to create an outside checkpoint (or fast forwarding) to save all I need to restart the system just from the point when I create checkpoint. (In fact, I want to skip the booting step) I only found something on qemu VM snapshot and fast forwarding, but it does not work for the emulator. Is there any checkpoint function for qemu emulator?
A savevm snapshot should do what you want. The short answer is that you need to set up a QCOW2 disk for the snapshots to be saved to, and then in the monitor you can use the 'savevm' command to take the snapshot. Then the command line '-loadvm' option will let you resume from there. This all works fine in emulation of AArch64.
https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ has a more detailed tutorial.
Minimal example
Peter's answer just worked for me, but let me provide a fully reproducible example.
I have fully automated everything at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/1e0f0b492855219351b0bfa2eec4d3a6811fcaaa#snapshot
The key step is to convert the image to qcow2 as explained at: https://docs.openstack.org/image-guide/convert-images.html
cd buildroot/output.x86_64~/images
qemu-img convert -f raw -O qcow2 rootfs.ext2 rootfs.ext2.qcow2
And the final QEMU command used was:
./buildroot/output.x86_64~/host/usr/bin/qemu-system-x86_64 -m 128M -monitor telnet::45454,server,nowait -netdev user,hostfwd=tcp::45455-:45455,id=net0 -smp 1 -M pc -append 'root=/dev/vda nopat nokaslr norandmaps printk.devkmsg=on printk.time=y console=ttyS0' -device edu -device virtio-net-pci,netdev=net0 -drive file=./buildroot/output.x86_64~/images/rootfs.ext2.qcow2,if=virtio,format=qcow2 -kernel ./buildroot/output.x86_64~/images/bzImage -nographic
To test it out, login into the VM, and run:
i=0; while true; do echo $i; i=$(($i + 1)); sleep 1; done
Then on another shell, open the monitor:
telnet localhost 45454
savevm my_snap_id
The counting continues. Then, if we load the vm:
loadvm my_snap_id
the counting goes back to where we saved. This shows that CPU and memory states were reverted.
We can also verify that the disk state is also reversed. Guest:
echo 0 >f
Monitor:
savevm my_snap_id
Guest:
echo 1 >f
Monitor:
loadvm my_snap_id
Guest:
cat f
And the output is 0.

How can I automatically kill idle GCE instances based on CPU usage?

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).
However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.
Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.
The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.
Credit, and more detailed explanation: Rohit Rawat's blog.
#!/bin/bash
threshold=0.4
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done
It seems like there are two parts to this question:
Identifying dead instances.
Killing off those instances.
In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.
Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.
I wish I could add this as a comment to viswajithiii answer but I'm just shy of the reputations necessary to comment.
I found the static threshold variable to be inappropriate when I am using cloud VM's with variable numbers of cpu's as the output of uptime scales with the number of CPU's as discussed here.
My updated script adds two lines below the threshold assignment to scale the threshold by the number of cpu's. This allows me to set a percentage cpu utilization that will work across VM's with different numbers of cpu's.
Otherwise, the script is the same as viswajithiii's.
#!/bin/bash
threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
res=$(echo $load'<'$threshold | bc -l)
if (( $res ))
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
sudo poweroff
fi
sleep 60
done
This works without bc (not in GCP Container OS) using viswajithiii's answer and this post:
How can I replace 'bc' tool in my bash script?
It also appends the history list to file before poweroff. I set my threshold very low, but the load is showing 0.00 even when I'm editing files via cli. Might work better if instance is under heavy load.
#!/bin/bash
threshold=10
count=0
while true
do
load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
load2=$(awk -v a="$load" 'BEGIN {print a*100}')
echo $load2
if [ $load2 -lt $threshold ]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
That's not working for my low cpu, but this seems too:
#!/bin/bash
threshold=1
count=0
while true
do
load=$(awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
load2=$(printf "%.0f\n" $load)
echo $load
echo $load2
if [[ $load2 -lt $threshold ]]
then
echo "Idling.."
((count+=1))
fi
echo "Idle minutes count = $count"
if (( count>10 ))
then
echo Shutting down
# wait a little bit more before actually pulling the plug
sleep 300
history -a
sudo poweroff
fi
sleep 60
done
It only works with both echo loads for some reason.
credits:
How to get overall CPU usage (e.g. 57%) on Linux
https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer
FYI: according to here, GCP monitoring agent is not available for N type instances: Google Cloud Platform: how to monitor memory usage of VM instances
Put this in a startup script in /etc/my_init.d and make it executable:
sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh
Actually, that's being deleted. Instead add to Custom Metadata in Edit for the instance: startup-script and #! /bin/bash \n~./autooff.sh

How to accelerate a shell script for a mongoDB application?

I started shell scripting for my work, but I must admit I'm still far away from even being a rookie. Therefore I wanted to ask you for your help/advice.
I build a script for a big data application (taking the quick and dirty approach, patching stuff from the internet together) to recursively go through a folder structure and convert convert all XML files to JSON.
The status quo of my script is:
#!/bin/sh
# Shell script to find out all the files under a directory and
#its subdirectories. This also takes into consideration those files
#or directories which have spaces or newlines in their names
cd /Users/q337498/Desktop/Archiv/2014/01/10
DIR="."
function list_files()
{
if !(test -d "$1")
then echo $1; return;
fi
cd "$1"
#echo; echo `pwd`:; #Display Directory name
for i in *
do
if test -d "$i"; then # if dictionary
if [ "$(ls -A $i)" ]; then
list_files "$i" #recursively list files
cd ..
else
echo "$i is Empty"
fi
else
java -jar /Users/q337498/Desktop/XML2JSON/SaxonEE9-5-1-4J/saxon9ee.jar -s:"$i" -xsl:/Users/q337498/Desktop/xsltjson-master/conf/xml-to-json.xsl -o:output/$(pwd)/${i%%[.]*}
# if jsonlint /Users/q337498/Desktop/Archiv/2014/01/08/$(pwd)/${i%%[.]*} -q; then
# echo "GOOD"
# else
# echo "NOT GOOD"
# fi
# echo ${i%%[.]*}
# echo "$i"; #Display File name
fi
done
}
if [ $# -eq 0 ]
then list_files .
exit 0
fi
for i in $*
do
DIR="$1"
list_files "$DIR"
shift 1 #To read next directory/file namedone
done
This code works, but the problem is that for 60000 files it takes up to 15 hours on a macbook pro with 16gb RAM and an 2.8 Ghz i7. And I need to convert 10 million files.
How do you think that I could accelerate the script? parallelize? take some commands out? What options do I have, and how would I actually implement them?
The files are ultimately going to end up in a MongoDB, so if someone knows a better way to convert xml to json and upload it to mongo his input is also welcome.
Cheers,
Dudu
I see 2 immediate problems here:
You are invoking java once for each file, therefore incurring the JVM startup time for every file which is going to add up to a huge chunk of time.
You are running single threaded
So I would suggest that you:
Write a Java program that does the directory traversal and does your transformation
Benchmark the performance difference
Try other Java libraries for doing the XML->JSON conversion https://github.com/beckchr/staxon/wiki/Benchmark
If necessary for performance, add multi-threading to your application using java.util.concurrent.