Slow prediction speed for translation model opus-mt-en-ro - deep-learning

I'm using the model Helsinki-NLP/opus-mt-en-ro from huggingface.
To produce output, I'm using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)
For small inputs (i.e. the number of sentences in questions), it works fine. However, when the number of sentences increases (e.g., batch size = 128), it is very slow.
I have a dataset of 100K examples and I have to produce the output. How to make it faster? (I already checked the usage of GPU and it varies between 25% and 70%).
Update: Following the comment of dennlinger, here is the additional information:
Average question length: Around 30 tokens
Definition of slowness: With a batch of 128 questions, it takes around 25 seconds. So given my dataset of 100K examples, it will take more than 5 hours. I'm using GPU Nvidia V100 (16GB) (hence to('cuda') in the code). I cannot increase the batch size because it results in out of memory error.
I didn't try different parameters, but I know by default, the number of beams equals 1.

Related

Need a formula to get total LUN size using lunSizeLow and lunSizeHigh SNMP objects

I have 2 SNMP Objects/OIDs. Below are the details:
Object1:
Name: lunSizeLow
OID: 1.3.6.1.4.1.43906.1.4.3.2.3.1.9
Description: `LUN` size in bytes - low order bytes
Object2:
Name: lunSizeHigh
OID: 1.3.6.1.4.1.43906.1.4.3.2.3.1.10
Description: `LUN` size in bytes - high order bytes
My requirement:
I want to monitor LUN size through some script. But i didn't found any SNMP object, which can give total LUN size directly. I found 2 separate objects (lunSizeLow and lunSizeHigh) to get LUN total size, so i need a formula to get total LUN size using these 2 low order and high order SNMP objects (lunSizeLow and lunSizeHigh).
I gone through many articles over internet and i found couple of formulas in community.hpe.com.
But I'm not sure which one is correct.
Formula 1:
Max unsigned number that can be stored in 32bits counter is 4294967295.
Total size would be: LOW_ORDER_BYTES + HIGH_ORDER_BYTES * 4294967296
Formula 2:
Total size in GB is LOW_ORDER_BYTES / 1073741824 + HIGH_ORDER_BYTES * 4
Could any one help me to get correct formula.
Most languages will have the bit-shift operator, allowin you to do something similar to the below (pseudo-Java):
long myBigInteger = lunSizeHigh
myBigInteger << 32 # Shifts the high bits 32 positions to the left, into the high half of the long
myBigInteger = myBigInteger + lunSizeLow
This has two advantages over multiplying:
Bit shifting is often faster than multiplication, even though most compilers would optimize that particular multiplication into a bit shift anyway.
It is easier to read the code and understand why this would provide the correct answer, given the description from the MIB. Magic numbers should be avoided where possible.
That aside, putting some numbers into the Windows Calculator (using Programmer Mode) and trying formula 1, we can see that it works.
Now, you don't specify what language or environment you're working in, and in some languages you won't have any number type that supports the size of numbers you want to manipulate. (Same reason that this number had to be split into two counters to begin with - it's larger than the largest number representation available on some (primitive) platforms.) If you want to do it using multiplication, you'll have to make sure your implementation language can do better.

Faster RCNN (caffe) Joint Learning: Out of Memory with dedicated memory 5376 MB (changed batch sizes and no.of RPN proposals) what else?

I have worked with the alternative optimization matlab code before, currently I am trying to get joint learning running. I am able to run the test demo with my GPU Tesla 2070. For training, I have set all the batch sizes to 1:
__C.TRAIN.IMS_PER_BATCH = 1
__C.TRAIN.BATCH_SIZE = 1
__C.TRAIN.RPN_BATCHSIZE = 1
(updated yaml to 1 as well since it was overridden)
But I still have the error == cudaSuccess (2 vs. 0) out of memory.
I have tried to experiment with lowering the number of proposals. (the originals are below:)
train:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TRAIN.RPN_POST_NMS_TOP_N = 2000
test:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TEST.RPN_PRE_NMS_TOP_N = 6000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TEST.RPN_POST_NMS_TOP_N = 300
I tried as low as pre: 100 post:50 for sanity check.
And I still am not able to run without the out of memory problem. What am I missing here?? I have a Tesla 5376 MB dedicated memory and I use the Tesla only for this (have a separate GPU for my screen) I am positive about reading 5376 MB should be enough by an author himself.
Thanks.

CURAND properties of generators

CURAND comes with an array of random number generators, but I have failed to find any comparison of the performance (and randomness) properties of each of them; mostly, I'd be interested in which generator to use for which application to gain maximum performance. I'd be happy if someone could quickly outline the differences between them or link me a resource that does so.
Thanks in advance.
This picture shows the performance for different RNGs.
For randomness, it should be only related to the RNG type/algorithm. So you can refer to Intel MKL doc. There's detail info and research papers in it. The type names in both CURAND and MKL are very similar.
http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-3D7D2650-A414-4C95-AF33-BE291BAB2AC3.htm
First difference is efficiency. XORWOW is default generator, but isn't always most efficient. For instance, Philox is faster for generating normally distributed floats.
Another difference is, that in practice You can generate more than one float with each call with some generators.
For example, with Philox You can generate even 4 floats normally or uniformly distributed with each call, while with XORWOW you can generate max two floats normally or uniformly distributed.
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
Next difference is period of pseudorandom sequence (Total state space of the PRNG before
you start to see repeats). Xorwow has period about 2^190 (with the state set up after 2^67 for the same seed)*. For Philox, subsequence and offset together define offset in a sequence with period 2^128.
Note that if You run millions of threads with the same seed You could run out of state space per thread and start seeing repeats. ((2^190) / (10^6)) / (2^67) = 1.0633824 × 10^31
One more difference is size of the states. For Xorwow sizeof(curandState_t) is 48 bytes and sizeof(curandStatePhilox4_32_10_t) is 64 bytes.
When You run millions of threads (each thread has its own curand state) you can run out of device memory. 1024^2*64 ~= 64 megabytes per million threads.
XORWOW, Philox, MRR32k3a, MTGP32 are Pseudo-random generators while both Sobols are Quasi-ranom generators.
*When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties)
sources:
https://developer.nvidia.com/cuRAND
http://cs.brown.edu/courses/cs195v/lecture/week11.pdf

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

FFT Magnitude Output

I'm just getting into signal processing and need to do some DFT/FFT work.
If I take a signal with two freqs of 2Hz and 5Hz: x(t)=sin(2*2pi*t)+sin(5*2pi*t). I sample at 100Hz for 5 sec (so my DFT size is 500).
Because my inputs are real values I get a symmetric DFT, so can discard the 2nd half and convert the DFT values into magnitude by doing sqrt(re^2+c^2).
My bin widths are 100/500 = 0.2Hz, and so I get:
With peaks at 2Hz and 5Hz as expected.
My question is: why are the magnitudes different?
On a related note, why are there not two perfect spikes at 2hz and 5Hz, i.ee the graph has non-zero values at 1.5 and 2.5 etc. Is this spectral leakage?
I expect your 500 data points are being processed as a 512 point FFT (most FFT libraries do not support arbitrary size inputs and so typically they zero pad to the next highest power of 2). If that is the case then you will be seeing the effects of spectral leakage. Applying a window function prior to the FFT should fix this. Note that you will still see "skirts" on either side of your peaks - this is due to the uncertainty introduced by a finite sampling window.