i used deep learning model and when i use Attention Mechanisms decrease accuracy !
is there some case Attention Mechanisms can decrease accuracy !
any advice ?
Related
I have working with deep reinforcement learning and in the literature, usually the learning rates are lower than I found in other settings.
My model is the following one:
def create_model(self):
model = Sequential()
model.add(LSTM(HIDDEN_NODES, input_shape=(STATE_SIZE, STATE_SPACE), return_sequences=False))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(ACTION_SPACE, activation='linear'))
# Compile the model
model.compile(loss=tf.keras.losses.Huber(delta=1.0), optimizer=Adam(lr=LEARNING_RATE, clipnorm=1))
return model
Where the initial learning rate (lr) is 3e-5. For the fine-tuning, I freeze the first two layers (this step is essential in my settings) and decrease the learning rate to 3e-9. During the fine-tuning, the model might suffer from a distributional shift once the source of samples is perturbed data. Is there another source of problems besides this for such a low learning rate to keep my model improving?
First, Show me your data sample.
Theoretical Answer:
We have learned how perturbation helps in solving various issues related to neural network training or trained model. Here, we have seen perturbation in three components (gradients, weights, inputs) associated with neural-network training and trained model; perturbation, in gradients is to tackle vanishing gradient problem, in weights for escaping saddle point, and in inputs to avoid malicious attacks. Overall, perturbations in different ways play the role of strengthening the model against various instabilities, for example, it can avoid staying at correctness wreckage point since such position will be tested with perturbation (input, weight, gradient) which will make the model approach towards correctness attraction point.
As of now, perturbation is mainly contingent to empirical-experimentation designed from intuition to solve encountering problems. One needs to experiment if perturbing a component of the training process makes sense intuitively, and further verify empirically if it helps mitigate the problem. Nevertheless, in future, we will see more on perturbation theory in deep learning or machine learning in general which might also be backed by a theoretical guarantee.
I read everywhere that, in addition to improving performances regarding accuracy, "Batch Normalization makes Training Faster".
I probably misunderstand something (cause BN has been proven efficient more than once) but it seems king of unlogical to me.
Indeed, adding BN to a network, increases the number of parameters to learn : With BN comes "Scales" and "offset" parameters that are to learn. See: https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
How can the network train faster while having "more work to do" ?
(I hope my question is legitimate or at least not too stupid).
Thank you :)
Batch normalization accelerates training by requiring less iterations to converge to a given loss value. This can be done by using higher learning rates, but with smaller learning rates you can still see an improvement. The paper shows this pretty clearly.
Using ReLU also has this effect when compared to a sigmoid activation, as shown in the original AlexNet paper (without BN).
Batch normalization also makes the optimization problem "easier", as minimizing the covariate shift avoid lots of plateaus where the loss stagnates or decreases slowly. It can still happen but it is much less frequent.
Batch normalization fixes the distributions of a lower layer activation to its next layer. The Scales and offset just "move" that distribution to a more effective position, but it is still a fixed distribution at every training step. This fixation means the parameters adjustment on the higher layer do not need to worry about the modification of parameters in the lower layer(s), which makes the training more efficient.
Doing binary classification with infected/uninfected RBCs (something the pretrained DL models have never seen before) using models and weights from Keras. I find the performance of the models (vgg16,19,xception) decrease with increase in the number of training and validation instances. Why?
Maybe it is related to resource management where you are doing inference and the model expands in the memory and it can decrease the performance. This situation will create a lot of Main memory access to perform the forward pass computations and page faults are occurring and it can decrease the performance.
Hope this helps.
I am looking into estimating a markov regime switching model with time varying probs. Please help me if you know a simpler way to estimate such model.
This paper might answer your needs.
I read recently about OpenCL/CUDA for FPGA vs. GPU
As I understood FPGA wins in power criteria.
The explanation for that ,I`ve found in some article:
Reconfigurable devices can have much lower power consumption from peak
values since only configured portions of the chip are active
Based on said above I have a question - does it mean that ,if some CU [Compute Unit] doen`t execute any work-item,it still consumes power? (and if yes - what for it consumes power?)
Yes, idle circuitry still consumes power. It doesn't consume as much, but it still consumes some. The reason for this is down to how transistors work, and how CMOS logic gates consume power.
Classically, CMOS logic (the type on all modern chips) only consumes power when it switches state. This made is very low power when compared to the technologies that came before it which consumed power all the time. Even so, every time a clock edge occurs, some logic changes state even if there's no work to do. The higher the clock rate, the more power used. GPUs tend to have high clock rates so they can do lots of work; FPGAs tend to have low clock rates. That's the first effect, but it can be mitigated by not clocking circuits that have no work to do (called 'clock gating')
As the size of transistors became smaller and smaller, the amount of power used when switching became smaller, but other effects (known as leakage) became more significant. Now we're at a point where the leakage power is very significant, and it's multiplied up by the number of gates you have in a design. Complex designs have high leakage power; Simple designs have low leakage power (in very basic terms). This is a second effect.
Hence, for a simple task it may be more power efficient to have a small dedicated low speed FPGA rather than a large complex, but high speed / general purpose CPU/GPU.
As always, it depends on the workload. For workloads that are well-supported by native GPU hardware (e.g. floating point, texture filtering), I doubt an FPGA can compete. Anecdotally, I've heard about image processing workloads where FPGAs are competitive or better. That makes sense, since GPUs are not optimized to operate on small integers. (For that reason, GPUs often are uncompetitive with CPUs running SSE2-optimized image processing code.)
As for power consumption, for GPUs, suitable workloads generally keep all the execution units busy, so it's a bit of an all-or-nothing proposition.
Based on my research on FPGAs and the way they work, these devices can be designed to be very power efficient and really fine-tuned for one special task (e.g., an algorithm) and use the smallest resources possible (therefore the lower amount of energy consumption among all possible choices except ASIC)
When implementing turning-complete algorithms using FPGAs, the designers have the option of either unrolling their algorithms to use the maximum parallelism offered or use a compact sequential design. Each method has its own cost-benefits; the former helps maximizing performance at the cost of higher resource consumption, and the latter helps minimizing area and resource consumption by reusing hardware at the cost of minimizing the performance.
This level of control over implementation of algorithms doesn’t exist when developing for GPUs. The developers have the control to use the most efficient algorithms yet they are not the one determining the final precise hardware implementation of their algorithms. Unlike FPGA designers who even count “nano-seconds” when calculating their design’s hardware implementation (using post-layout tools), GPU developers rely on available frameworks to enhance all implementation details for them automatically. They develop at much higher levels compared to FPGA designers.
So the well known topic of trade-offs pops up here too; you want exact control over the hardware implementation at the cost of longer development times? Choose FPGAs. You want parallelism, yet have made up your mind to give up exact control over hardware implementation and want to develop using your existing software skills? use OpenCL.
Kudos to #hamzed, but OpenCL is not taking control away from the designer of OpenCL on FPGAs. It actually gives the best of the both worlds: full programmability of FPGA with all custom parallel algorithm benefits as well as much better design closure speed vs. RTL. By being clever about your algorithm moving and not moving data you can get to near theoretical performance of FPGAs. Please see the last chart in this reference: https://www.iwocl.org/wp-content/uploads/iwocl2017-andrew-ling-fpga-sdk.pdf