TPU training fails with certain metric, succeeds on CPU

TPU training fails with certain metric, succeeds on CPU - google-compute-engine

I'm trying to train a simple EfficientNet style model on some images. Training works fine on a CPU, but when I switch across to using a TPU I get the following error:
(0) Invalid argument: {{function_node
__inference_train_function_38255}} Output shapes of then and else branches do not match: (s64[1,<=4]) vs. (s64[<=4])
[[{{node cond}}]]
[[TPUReplicate/_compile/_5430787790498024493/_4]]
[[tpu_compile_succeeded_assert/_6318656678166656164/_5/_289]]
(1) Invalid argument: {{function_node
__inference_train_function_38255}} Output shapes of then and else branches do not match: (s64[1,<=4]) vs. (s64[<=4])
[[{{node cond}}]]
[[TPUReplicate/_compile/_5430787790498024493/_4]]
[[tpu_compile_succeeded_assert/_6318656678166656164/_5/_225]]
This error only occurs when I'm using a particular metric, Cohen's Kappa. If I remove this metric, the model trains fine.
I've tried to figure out the offending section in CohensKappa and narrowed it down to _update_confusion_matrix - if I overload this and result, the model trains fine.
When I start training, I see this log message:
TPU has inputs with dynamic shapes: [<tf.Tensor 'Const:0' shape=() dtype=int32>, <tf.Tensor 'cond_8/Identity:0' shape=(None, 456, 456, 3) dtype=float32>, <tf.Tensor 'cond_8/Identity_1:0' shape=(None,) dtype=int64>]
Which may be related, however given that the model trains fine when I leave out this metric and I still get that log, it might be a red herring.
Any suggestions on solutions, or how to debug this would be very helpful. Switching to eager execution mode isn't an option, as it all works fine on CPU.

Please share the code snippet that leads to this error. From what it shows, you seem to have tensor dimension inconsistency problem (i.e. (s64[1,<=4]) vs. (s64[<=4]))

Related

what is wrong when training an autoencoder on mnist dataset with caffe?

I want to use mnist dataset to train a simple autoencoder in caffe and with nvidia-digits.
I have:
caffe: 0.16.4
DIGITS: 5.1
python 2.7
I use the structure provided here:
https://github.com/BVLC/caffe/blob/master/examples/mnist/mnist_autoencoder.prototxt
Then I face 2 problems:
When I use the provided structure I get this error:
Traceback (most recent call last):
File "digits/scheduler.py", line 512, in run_task
task.run(resources)
File "digits/task.py", line 189, in run
self.before_run()
File "digits/model/tasks/caffe_train.py", line 220, in before_run
self.save_files_generic()
File "digits/model/tasks/caffe_train.py", line 665, in save_files_generic
'cannot specify two val image data layers'
AssertionError: cannot specify two val image data layers
when I remove the layer for ''test-on-test'', I get a bad result like this:
https://screenshots.firefox.com/8hwLmSmEP2CeiyQP/localhost
What is the problem??

The first problem occurs because the .prototxt has two layers with name data and TEST phase. The first layer that uses data, i.e. flatdata, does not know which data to use (the test-to-train or test-to-test). That's why when you remove one of the data layers with TEST phase, the error does not happen. Edit: I've checked the solver file and it has a test_stage parameter that should switch between the test files, but it's clearly not working in your case.
The second problem is a little more difficult to solve. My knowledge in autoencoders is limited. It seems your euclidean loss changes very little during your iterations; I would check the base learning rate in your solver.prototxt and decrease it. Check how the losses fluctuate.
Besides that, for the epochs/iterations that achieved a low error, have you checked the output data/images? Do they make sense?

How to measure number of asserts per line of code in SonarQube

We are attempting to get another view of our code coverage over the standad line and branch-coverage. We would like to get the number of asserts per line/method/class in order to see if we just are running though the code or if we are getting expected results.
So, how to measure the number of asserts in a codebase in sonarcube?

There are a product called pitest that answers to the goal of my question. And there is a plugin for sonar and pitest. So the answer to somehow verify if we actually checks for anything in the tests is: pitest.

lme4 glmm model convergence issue

I am trying to use the lme4 package for a glmm and am getting a convergence code of 0 and a statement: Model failed to converge with max|grad| = 0.00791467 (tol = 0.001, component 1). I am interested in using the lme4 package because I would like to have AIC values to determine the appropriate model as I add in additional covariates.
Two weeks ago when I tried the same approach I got a warning message that the model failed to converge because of the max|grad| issue, but am not getting the warning message this time, just the statement at the end of the summary output.
Does this mean that the model is not converging? I also used the glmmPQL method. The coefficient parameter estimates are similar between the two model types.
Here is glmer (lme4) model code. I increased the maxfun to deal with other issues I had when I ran the model last time.
l1<-glmer(Meat_Weight~logsh+SAMS_region_2015+(1|StationID),
family="Gamma"(link="log"),data=datad,control=glmerControl(optCtrl=list(maxfun=100000)))
Here is the glmmPQL code.
m1<-glmmPQL(fixed=Meat_Weight~logsh+SAMS_region_2015,random=~1|StationID,
family=Gamma(link="log"),data=datad)
I am sure this is not information to diagnosis the problem, but if anyone has suggestions I can provide more data.
Thanks

Try to change the optimizer
l1<-glmer(Meat_Weight~logsh+SAMS_region_2015+(1|StationID),
family="Gamma"(link="log"),data=datad, control = glmerControl(optimizer="bobyqa"))

How to throw exception in a .oct file in octave?

I am currently developing geotiff reading and writing functions for octave using .oct files. I went through the octave documentation but could not find much on throwing exceptions. Does that mean I can throw exception the way I do it in C++ by just simply writing throw "error message"?

There are two ways, admittedly they are documented in two utterly separate places, not cross-linked/cross-referenced, which makes no sense, and if you didn't know the function/keyword you wouldn't find them:
error() raises an error, which stops the program. See 12.1 Raising Errors.
error("[%s] Here be wyrms", pkgname)
assert() both tests the condition then raises the error() with a customizable message (so don't do if (cond) ... error(...) ... endif).
See B.1 Test Functions.
% 1. Produce an error if the specified condition is zero (not met).
assert (cond)
assert (cond, errmsg)
assert (cond, errmsg, …)
assert (cond, msg_id, errmsg, …)
% 2a. Produce an error if observed (expression) is not the same as expected (expression); Note that observed and expected can be scalars, vectors, matrices, strings, cell arrays, or structures.
assert (observed, expected)
% 2b. a version that includes a (typically floating-point) tolerance
assert (observed, expected, tol)
See also the command fail()

Yes, you could just use something like
error ("mynewlib: Hello %s world!", "foo");
to signal errors which are catched and viewed.
(Personally I think such questions should really go to the GNU Octave mailing list where you'll find the core developers and octave-forge package maintainers).
I guess you want to build a wrapper around libgeotiff? Have a look at the octave-image package! Where do you host your code?
./examples/code/unwinddemo.cc might also be interesting for you. It shows how to use unwind_protect and define user error handlers.
http://hg.savannah.gnu.org/hgweb/octave/file/3b0a9a832360/examples/code/unwinddemo.cc
Perhaps your function should then be merged into the octave-forge mapping package: "http://sourceforge.net/p/octave/mapping/ci/default/tree/"

How to force gfortran to stop at First NaN?

My Configuration: Cygwin64, Win7 64 bit, gfortran 4.8.3
I did go through all the Q/A pairs I managed to find, this one is the closest to my problem
Force gfortran to stop program at first NaN
PROBLEM: The fortran code
real :: init_with_NaN ! -finit-real=nan
res = 10 * init_with_NaN
The code is compiled with the following gfortran flags
COLLECT_GCC_OPTIONS='-funderscoring' '-O0' '-g' '-c' '-fmessage-length=0' '-O0' '-g' '-ffpe-trap=invalid,zero,overflow,underflow,precision,denormal' '-Wall' '-c' '-fmessage-length=0' '-Wpedantic' '-std=f2003' '-fall-intrinsics' '-fbacktrace' '-fimplicit-none' '-Wextra' '-Wformat=1' '-Wconversion' '-Wfatal-errors' '-finit-real=nan' '-Wuninitialized' '-fbounds-check' '-v' '-o' 'fsignal_test.o' '-mtune=generic' '-march=x86-64'
Once the code is run, instead of getting a SIGFPE, res is set to be NaN and execution continues.
This is a sub-problem of a million dollar question:
Does gfortran -ffpe-trap=invalid flag really stops the program once it detects a NaN anywhere in the program ? Especially on Cygwin64 env?
A little research showed that:
1) Acc.to https://gcc.gnu.org/onlinedocs/gfortran/Debugging-Options.html, -ffpe-trap works on MOST systems, not all :(
2) Acc. to Division by zero does not throw SIGFPE
"You don't get a signal because the default behavior on most machines
is to pollute your data with NaNs (not-a-number) and infinities. You
have to enable floating point exceptions, and how you do that is
machine specific. Look at the system header fenv.h, if you have one.
The function fesettrapenable enables catching floating point
exceptions on many machines.
Unfortunately, there is no standard function to turn floating point
exceptions handling on"
3) I have checked my cygwin env, fenv.h exists in multiple places but there is no fesettrapenable function anywhere under any *.h file under cygwin64.
4) I do get SIGFPEs: like sqrt(-1), div by zero, overflow, underflow, precision
Here is the full Fortran test project in Eclipse, use it at your hearts content to learn about exceptions on your own platform:
https://drive.google.com/file/d/0B9U1dlb_UCPqU194XzhDNTNkUUU/view?usp=sharing

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

TPU training fails with certain metric, succeeds on CPU - google-compute-engine

Please share the code snippet that leads to this error. From what it shows, you seem to have tensor dimension inconsistency problem (i.e. (s64[1,<=4]) vs. (s64[<=4]))

Related

what is wrong when training an autoencoder on mnist dataset with caffe?

How to measure number of asserts per line of code in SonarQube

lme4 glmm model convergence issue

How to throw exception in a .oct file in octave?

How to force gfortran to stop at First NaN?

Categories

Resources