rootbeer CUDA example code quantified throughput gain

rootbeer CUDA example code quantified throughput gain - cuda

The following is the rootbeer example code for Nvidia CUDA that I ran on a laptop with Ubuntu 12.04 (Precise) with bumblebee and optirun. The laptop features Nvidia Optimus, hence the optirun. The GPU happens to be a Nvidia GeForce GT 540M which the Nvidia website says has 96 cores. I get almost no throughput gain. What is the problem?
package com.random.test;
import java.util.ArrayList;
import java.util.Formatter;
import java.util.List;
import edu.syr.pcpratts.rootbeer.runtime.Kernel;
import edu.syr.pcpratts.rootbeer.runtime.Rootbeer;
public class ArraySumApp {
final static int numberOfJobs = 1024; // 1024 in the original example
final static int sizeOfArray = 512; // 512 in the original example
final static int theAnswer = 130816;
public int[] sumArrays(List<int[]> arrays) {
List<Kernel> jobs = new ArrayList<Kernel>();
int[] ret = new int[arrays.size()];
for (int i = 0; i < arrays.size(); ++i) {
jobs.add(new ArraySum(arrays.get(i), ret, i));
}
Rootbeer rootbeer = new Rootbeer();
rootbeer.runAll(jobs);
return ret;
}
private static long measureOneJob() {
int[] source = new int[ArraySumApp.sizeOfArray];
int[] destination = new int[1];
for (int i = 0; i < ArraySumApp.sizeOfArray; i++)
source[i] = i;
Kernel job = new ArraySum(source, destination, 0);
ElapsedTimer et = new ElapsedTimer();
job.gpuMethod();
long timeInMs = et.stopInMilliseconds();
System.out.println("measureOneJob " + et.stringInMilliseconds());
assert destination[0] == ArraySumApp.theAnswer : "cosmic rays";
return timeInMs;
}
public static void main(String[] args) {
Helper.assertAssertionEnabled();
// measure the time to do one job
ArraySumApp.measureOneJob();
long oneJob = ArraySumApp.measureOneJob();
ArraySumApp app = new ArraySumApp();
List<int[]> arrays = new ArrayList<int[]>();
// you want 1000s of threads to run on the GPU all at once for speedups
for (int i = 0; i < ArraySumApp.numberOfJobs; ++i) {
int[] array = new int[ArraySumApp.sizeOfArray];
for (int j = 0; j < array.length; ++j) {
array[j] = j;
}
arrays.add(array);
}
ElapsedTimer et = new ElapsedTimer();
int[] sums = app.sumArrays(arrays);
long allJobs = et.stopInMilliseconds();
System.out.println("measureAllJobs " + et.stringInMilliseconds());
double gainFactor = ((double) ArraySumApp.numberOfJobs) * oneJob
/ allJobs;
System.out.println(String.format(
"throughput gain factor %.1f\nthroughput gain %.1f\n",
gainFactor, gainFactor - 1.0d));
// check the number of answers is correct
assert sums.length == ArraySumApp.numberOfJobs : "cosmic rays";
// check they all have the answer
for (int i = 0; i < ArraySumApp.numberOfJobs; i++)
assert sums[i] == ArraySumApp.theAnswer : "cosmic rays";
}
}
class ArraySum implements Kernel {
final static int repetitionFactor = 100000;
private int[] source;
private int[] ret;
private int index;
public ArraySum(int[] src, int[] dst, int i) {
source = src;
ret = dst;
index = i;
}
public void gpuMethod() {
for (int repetition = 0; repetition < ArraySum.repetitionFactor; repetition++) {
int sum = 0;
for (int i = 0; i < source.length; ++i) {
sum += source[i];
}
ret[index] = sum;
}
}
}
class Helper {
private Helper() {
}
static void assertAssertionEnabled() {
try {
assert false;
} catch (AssertionError e) {
return;
}
Helper.noteCosmicRays();
}
static void noteCosmicRays() // programmer design or logic error
{
throw new RuntimeException("cosmic rays");
}
}
class ElapsedTimer {
private org.joda.time.DateTime t0;
private long savedStopInMilliseconds;
public ElapsedTimer() {
this.t0 = new org.joda.time.DateTime();
}
public long stopInMilliseconds() {
return stop();
}
public String stringInMilliseconds() // relies on a saved stop
{
Formatter f = new Formatter();
f.format("%d ms", this.savedStopInMilliseconds);
String s = f.toString();
f.close();
return s;
}
public String stopStringInMilliseconds() {
stop();
return stringInMilliseconds();
}
public String stringInSecondsAndMilliseconds() // relies on a saved stop
{
Formatter f = new Formatter();
f.format("%5.3f s", this.savedStopInMilliseconds / 1000.0d);
String s = f.toString();
f.close();
return s;
}
public String stopStringInSecondsAndMilliseconds() {
stop();
return stringInSecondsAndMilliseconds();
}
public long stopInSeconds() {
return (stop() + 500L) / 1000L; // rounding
}
public String stringInSeconds() // relies on a saved stop
{
Formatter f = new Formatter();
long elapsed = (this.savedStopInMilliseconds + 500L) / 1000L; // rounding
f.format("%d s", elapsed);
String s = f.toString();
f.close();
return s;
}
public String stopStringInSeconds() {
stop();
return stringInSeconds();
}
/**
* This is private. Use the stopInMilliseconds method if this is what you
* need.
*/
private long stop() {
org.joda.time.DateTime t1 = new org.joda.time.DateTime();
savedStopInMilliseconds = t1.getMillis() - this.t0.getMillis();
return savedStopInMilliseconds;
}
}
This is the output:
measureOneJob 110 ms
measureOneJob 26 ms
CudaRuntime2 ctor: elapsedTimeMillis: 609
measureAllJobs 24341 ms
throughput gain factor 1.1
throughput gain 0.1

The rootbeer developer said the example code that takes the sum of array elements is not the best example and an alternative example would show throughput gains.

You can see: https://github.com/pcpratts/rootbeer1/tree/develop/gtc2013/Matrix
This is an example for the 2013 NVIDIA GTC conference. I obtained a 20x speedup over a 4-core Java Matrix Multiply that uses transpose.
The example is a tiled Matrix Multiply using shared memory on the GPU. From the NVIDIA literature, using shared memory is one of the most important apsects of getting good speedups. To use shared memory you have each thread in a block load values into a shared array. Then you have to reuse these shared values several times. This saves the time to fetch from global memory.
A fetch from global memory takes about 200-300 clock cycles and a fetch from shared memory takes about 2-3 clock cycles on the Tesla 2.0 archicture.

Related

How to use LSTM with a 2d array with DeepLearning4j

i am trying to learn how to use LSTM with deeplearning4j lib.
I created a dummy scenario where i want to get an output (3 classes) based on data that i collected.
I got the data from here (http://www.osservatoriodioropa.it/meteoropa/NOAAMO.TXT) if someone is curious :)
Back to the scenario.
I created 2 matrix, one with features, other with classes that i want to output, just as a test.
When i try the classifier i got
Exception in thread "main" java.lang.IllegalStateException: 3D input expected to RNN layer expected, got 2
i think because the RnnOutputLayer expect a 3d matrix, but i am not able to understand how to populate it. How can i convert a 2d matrix into a 3d matrix correlating the previous event with the new one? The data are a time serie, and i want to relate the classification of the new day based on previous days as well. (I know that probably the data won't fit this scenario and that there are better way to do that, but that's just to learning how to use LSTM, not how to classify this specific dataset)
this is the code so far
public class Test {
public static void main(String args[]) {
int events = 5;
int features = 6;
int classes = 3;
double[][] featureMatrix = new double[events][features];
double[][] labelMatrix = new double[events][classes];
for (int i = 0; i < events; i++) {
for (int f = 0; f < features; f++) {
featureMatrix[i][f] = getFeature(i, f);
}
for (int c = 0; c < classes; c++) {
labelMatrix[i][c] = getResult(i, c);
}
}
INDArray trainingIn = Nd4j.create(featureMatrix);
INDArray trainingOut = Nd4j.create(labelMatrix);
DataSet myData = new DataSet(trainingIn, trainingOut);
MultiLayerNetwork multiLayerNetwork = createModel(features,classes);
multiLayerNetwork.init();
multiLayerNetwork.fit(myData);
}
private static double getFeature(int i, int f) {
//dummy
return 1.;
}
private static double getResult(int i, int c) {
//dummy
return 1.;
}
public static MultiLayerNetwork createModel(int inputNum, int outputNum) {
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.trainingWorkspaceMode(ENABLED).inferenceWorkspaceMode(ENABLED)
.seed(123456)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.updater(new RmsProp.Builder().learningRate(0.05).rmsDecay(0.002).build())
.l2(0.0005)
.weightInit(WeightInit.XAVIER)
.activation(Activation.TANH)
.list()
.layer(new LSTM.Builder().name("1").nIn(inputNum).nOut(inputNum).build())
.layer(new LSTM.Builder().name("2").nIn(inputNum).nOut(inputNum).build())
.layer(new RnnOutputLayer.Builder().name("output").nIn(inputNum).nOut(outputNum)
.activation(Activation.IDENTITY).lossFunction(LossFunctions.LossFunction.MSE).build())
.build();
MultiLayerNetwork net = new MultiLayerNetwork(conf);
net.init();
return net;
}
}

TypedFactory vs Performance Monitor

I'm confused about how new windsor3 perfmonace counter shows tracking of objects generated via TyepedFactory.
considering following scenario
public interface IBFactory
{
IB[] GetAll();
void FreeUp(IB cmps);
}
public class B1 : IB, IDisposable
{
public void Add(int i){}
public void Dispose()
{
Console.WriteLine("Disposing " + GetType().Name);
}
}
public class B2 : IB, IDisposable
{
public void Add(int i){}
public void Dispose()
{
Console.WriteLine("Disposing " + GetType().Name);
}
}
public class B3 : IB
{
public void Add(int i){}
public void Dispose()
{
Console.WriteLine("Disposing " + GetType().Name);
}
}
var container = new WindsorContainer();
var diagnostic = LifecycledComponentsReleasePolicy.GetTrackedComponentsDiagnostic(container.Kernel);
var counter = LifecycledComponentsReleasePolicy.GetTrackedComponentsPerformanceCounter(new PerformanceMetricsFactory());
container.Kernel.ReleasePolicy = new LifecycledComponentsReleasePolicy(diagnostic, counter);
Console.WriteLine("Enter number of iterations:");
int iterations = int.Parse(Console.ReadLine());
container.AddFacility<TypedFactoryFacility>();
container.Register
(
Component.For<IBFactory>()
.AsFactory()
.LifeStyle.Transient,
Classes.FromAssemblyContaining<IB>()
.BasedOn(typeof(IB))
.WithService.Base()
.Configure(c => c.LifestyleTransient())
);
Console.WriteLine("Create Memory Leak Y or N?");
var leak = Console.ReadLine().ToUpper() == "Y";
var sleepFor = 100;// int.Parse(Console.ReadLine());
for (var i = 1; i < iterations+1; i++)
{
var factory = container.Resolve<IBFactory>();
Console.WriteLine("Factory created.");
var cmp = factory.GetAll();
foreach (var b in cmp)
{
b.Add(i);
}
Console.WriteLine("Iteration {0} completed", i);
Thread.Sleep(sleepFor);
if (!leak)
{
foreach (var b in cmp)
{
factory.FreeUp(b);
}
}
Console.WriteLine("Releasing factory.");
container.Release(factory);
}
Console.WriteLine("container disposing.....");
container.Dispose();
Console.WriteLine("container disposed");
Console.ReadLine();
If I dispose objects, as I should, via FreeUp factory method, perf counter shows expected tracking.
Instead if I do not expliclty dispose objects, but if I'll do implicitly disposing the factory, created as transient for testing purpose, IB instances are disposed when I dispose the factory (as per documentation), but perf counter does not get updated and shows IB instance still as tracked...
What that means?
Perf counter has not been updated or objects are still tracked(that's would be very scary) even if Dispose has been called on IB instances due to factory disposing.

JCuda. Reusing already used pointer

I have a trouble working with JCUDA. I have a task to make 1D FFT using CUFFT library, but the result should be multiply on 2. So I decided to make 1D FFT with type CUFFT_R2C. Class responsible for this going next:
public class FFTTransformer {
private Pointer inputDataPointer;
private Pointer outputDataPointer;
private int fftType;
private float[] inputData;
private float[] outputData;
private int batchSize = 1;
public FFTTransformer (int type, float[] inputData) {
this.fftType = type;
this.inputData = inputData;
inputDataPointer = new CUdeviceptr();
JCuda.cudaMalloc(inputDataPointer, inputData.length * Sizeof.FLOAT);
JCuda.cudaMemcpy(inputDataPointer, Pointer.to(inputData),
inputData.length * Sizeof.FLOAT, cudaMemcpyKind.cudaMemcpyHostToDevice);
outputDataPointer = new CUdeviceptr();
JCuda.cudaMalloc(outputDataPointer, (inputData.length + 2) * Sizeof.FLOAT);
}
public Pointer getInputDataPointer() {
return inputDataPointer;
}
public Pointer getOutputDataPointer() {
return outputDataPointer;
}
public int getFftType() {
return fftType;
}
public void setFftType(int fftType) {
this.fftType = fftType;
}
public float[] getInputData() {
return inputData;
}
public int getBatchSize() {
return batchSize;
}
public void setBatchSize(int batchSize) {
this.batchSize = batchSize;
}
public float[] getOutputData() {
return outputData;
}
private void R2CTransform() {
cufftHandle plan = new cufftHandle();
JCufft.cufftPlan1d(plan, inputData.length, cufftType.CUFFT_R2C, batchSize);
JCufft.cufftExecR2C(plan, inputDataPointer, outputDataPointer);
JCufft.cufftDestroy(plan);
}
private void C2CTransform(){
cufftHandle plan = new cufftHandle();
JCufft.cufftPlan1d(plan, inputData.length, cufftType.CUFFT_C2C, batchSize);
JCufft.cufftExecC2C(plan, inputDataPointer, outputDataPointer, fftType);
JCufft.cufftDestroy(plan);
}
public void transform(){
if (fftType == JCufft.CUFFT_FORWARD) {
R2CTransform();
} else {
C2CTransform();
}
}
public float[] getFFTResult() {
outputData = new float[inputData.length + 2];
JCuda.cudaMemcpy(Pointer.to(outputData), outputDataPointer,
outputData.length * Sizeof.FLOAT, cudaMemcpyKind.cudaMemcpyDeviceToHost);
return outputData;
}
public void releaseGPUResources(){
JCuda.cudaFree(inputDataPointer);
JCuda.cudaFree(outputDataPointer);
}
public static void main(String... args) {
float[] inputData = new float[65536];
for(int i = 0; i < inputData.length; i++) {
inputData[i] = (float) Math.sin(i);
}
FFTTransformer transformer = new FFTTransformer(JCufft.CUFFT_FORWARD, inputData);
transformer.transform();
float[] result = transformer.getFFTResult();
HilbertSpectrumTicksKernelInvoker.multiplyOn2(transformer.getOutputDataPointer(), inputData.length+2);
transformer.releaseGPUResources();
}
}
Method which responsible for multiplying uses cuda kernel function.
Java method code:
public static void multiplyOn2(Pointer inputDataPointer, int dataSize){
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
// Create the PTX file by calling the NVCC
String ptxFileName = null;
try {
ptxFileName = FileService.preparePtxFile("resources\\HilbertSpectrumTicksKernel.cu");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Load the ptx file.
CUmodule module = new CUmodule();
cuModuleLoad(module, ptxFileName);
// Obtain a function pointer to the "add" function.
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "calcSpectrumSamples");
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
int N = (dataSize + 1) / 2 + 1;
int pair = (dataSize + 1) % 2 > 0 ? 1 : -1;
Pointer kernelParameters = Pointer.to(Pointer.to(inputDataPointer),
Pointer.to(new int[] { dataSize }),
Pointer.to(new int[] { N }), Pointer.to(new int[] { pair }));
// Call the kernel function.
int blockSizeX = 128;
int gridSizeX = (int) Math.ceil((double) dataSize / blockSizeX);
cuLaunchKernel(function, gridSizeX, 1, 1, // Grid dimension
blockSizeX, 1, 1, // Block dimension
0, null, // Shared memory size and stream
kernelParameters, null // Kernel- and extra parameters
);
cuCtxSynchronize();
// Allocate host output memory and copy the device output
// to the host.
float freq[] = new float[dataSize];
cuMemcpyDtoH(Pointer.to(freq), (CUdeviceptr)inputDataPointer, dataSize
* Sizeof.FLOAT);
And the kernel function is next:
extern "C"
__global__ void calcSpectrumSamples(float* complexData, int dataSize, int N, int pair) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if(i >= dataSize) return;
complexData[i] = complexData[i] * 2;
}
But when I'm trying to pass the pointer which points to the result of FFT (in device memory) to the multiplyOn2 method, it throws the exception on cuCtxSynchronize() call. Exception:
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_UNKNOWN
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:263)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1709)
at com.ifntung.cufft.HilbertSpectrumTicksKernelInvoker.multiplyOn2(HilbertSpectrumTicksKernelInvoker.java:73)
at com.ifntung.cufft.FFTTransformer.main(FFTTransformer.java:123)
I was trying to do the same using Visual Studion C++ and there no problems with this. Could you please help me.
P.S.
I can solve this prolem, but I need to copy data from device memory to host memory and then copy back with creating new pointers every time before calling new cuda functions, which slows my program executing.

Where exactly does the error occurs at which line?
The Cuda error can also be a previous error.
Why do you use Pointer.to(inputDataPointer), you already have that device pointer. Now you pass a pointer to the device pointer to the device?
Pointer kernelParameters = Pointer.to(Pointer.to(inputDataPointer),
I also recommend to use "this" qualifier or any other marking to detect instance variables. I hate and refuse to look through code, especially as nested and long as your example if I cannot see which scope the variable in methods have trying to debug it by just reading it.
I don't wanna ask myself always where the hell comes this variable from.
If a complex code in a question at SO is not formatted properly I don't read it.

Java Reflection Problem

Hi I am currently doing my final year project; I need to develop an algorithm visualization tool. I need to cater for user-defined algo; that is animate the algorithm the user types in a text-editor provided in my tool.
I am using the Java Compiler API to compile the code that the user has typed and saved. My tool offers a set of classes that the user can use in his/her algo.
For example:
myArray(this class is provided by my tool)
import java.awt.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.accessibility.AccessibleContext;
import javax.swing.*;
public class myArray extends JComponent {
int size = 0;
int count = 0;
int[]hold;
Thread th;
public myArray(int[]arr)//pass user array as parameter
{
//th = new Thread();
size=arr.length;
hold = arr;//make a copy of the array so as to use later in swap operation
}
public int length()
{
return hold.length;
}
public void setAccessibleContext(AccessibleContext accessibleContext) {
this.accessibleContext = accessibleContext;
}
public void paintComponent(Graphics g)
{
super.paintComponent(g);
Graphics2D g2d = (Graphics2D) g;
this.setPreferredSize(new Dimension(360,100));
for(int i=1; i<=size; i++)
{
g2d.drawRect((i*30), 30, 30, 50);
}
for(int i=1; i<=size; i++)
{
g2d.drawString(Integer.toString(hold[i-1]), (i*30)+15, 30+25);
}
}
public void set(int i, int j)//position of the two elements to swap in the array
{
try {
th.sleep(2000);//sleep before swapping because else user won't see original array since it would swap and then sleep
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
int temp = hold[i];
hold[i] = hold[j];
hold[j] = temp;
hold[i]=j;
this.repaint();//can use eapint with a class that extends JPanel
}
public void swap(int i, int j)//position of the two elements to swap in the array
{
try {
th.sleep(2000);//sleep before swapping because else user won't see original array since it would swap and then sleep
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
int temp = hold[i];
hold[i] = hold[j];
hold[j] = temp;
this.repaint();//can use eapint with a class that extends JPanel
}
public int get(int pos)
{
return hold[pos];
}
}
This is a portion of my GUI that will cause the compilation:
JavaCompiler jc = null;
StandardJavaFileManager sjfm = null;
File javaFile = null;
String[] options = null;
File outputDir = null;
URL[] urls = null;
URLClassLoader ucl = null;
Class clazz = null;
Method method = null;
Object object = null;
try
{
jc = ToolProvider.getSystemJavaCompiler();
sjfm = jc.getStandardFileManager(null, null, null);
File[] files = new File[1];
//files[0] = new File("C:/Users/user/Documents/NetBeansProjects/My_Final_Year_Project/myArray.java");
//files[1] = new File("C:/Users/user/Documents/NetBeansProjects/My_Final_Year_Project/Tool.java");
files[0] = new File("C:/Users/user/Documents/NetBeansProjects/My_Final_Year_Project/userDefined.java");
// getJavaFileObjects’ param is a vararg
Iterable fileObjects = sjfm.getJavaFileObjects(files);
jc.getTask(null, sjfm, null, null, null, fileObjects).call();
// Add more compilation tasks
sjfm.close();
options = new String[]{"-d", "C:/Users/user/Documents/NetBeansProjects/My_Final_Year_Project"};
jc.getTask(null, sjfm, null, Arrays.asList(options), null, fileObjects).call();
outputDir = new File("C:/Users/user/Documents/NetBeansProjects/My_Final_Year_Project");
urls = new URL[]{outputDir.toURL()};
ucl = new URLClassLoader(urls);
clazz = ucl.loadClass("userDefined");
method = clazz.getMethod("user", null);
object = clazz.newInstance();
Object ob = method.invoke(object, null);
}
This is an example of a user-defined algo(userDefined.java):
import java.awt.*;
import javax.swing.*;
public class userDefined
{
public void user()
{
int [] numArr = {1,3,1,-1,5,-5,0,7,12,-36};
myArray myArray = new myArray(numArr);
JFrame frame = new JFrame("Rectangles");
frame.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
frame.setSize(360, 300);
frame.setLocationRelativeTo(null);
frame.setVisible(true);
frame.add(myArray);
for (int i=myArray.length(); i>1; i--)
{
for (int j=0; j<i-1; j++)
{
if (myArray.get(j) > myArray.get(j+1))
{
myArray.swap(j, j+1);
}
}
}
}
}
The problem I am getting is that if I try to use reflection like above; I only get a white window which does not show the animation) but just displays the result at the very end.
However if I use this instead of reflection(and change the method void user() to static void main(string args) in userDefined.java):
JavaCompiler compiler = ToolProvider.getSystemJavaCompiler();
if(compiler.run(null, null, null, "userDefined.java") != 0) {
System.err.println("Could not compile.");
System.exit(0);
}
try {
Runtime rt = Runtime.getRuntime();
Process pr = rt.exec("java "+"userDefined");
BufferedReader input = new BufferedReader(new InputStreamReader(pr.getInputStream()));
String line=null;
while((line=input.readLine()) != null) {
System.out.println(line);
}
} catch(Exception e) {
System.out.println(e.toString());
e.printStackTrace();
it woks provided that after first compilation I place the myArray class in the same folder as the userDefined.java. In this case I can see the animation take place correctly.
How do I use reflection to invoke the main method instead of using an instance of the class.
Please I really need some help with this. Thanks!

You a violating / missusing the first rule of swing: acces swing components only in the EDT (Event Dispatch Thread).
When you start your program using the main method, you are violating that rule. This happens to work, but might have all kinds of weird effects. This is not a theoretic warning, it happend to me and it is not nice.
When you run it using reflection from your code, you are most likely in the EDT, so your algorithm runs completely before the GUI gets updated again (which also happens on the EDT). Thats why you see only the final result of the algorithm.
The correct way to do this would be:
Run the algorithm in a seperate thread and make sure all changes to your myArray Component happen in the EDT, using SwingUtilities.invokeAndWait or SwingUtilities.invokeLater

AS3: Optimizing Object Memory Size

I have have a class that I wrote, and it seems bigger than it should be. It doesn't extend anything, and has very little going on - or so I thought - but each one is taking up just under 100k100 bytes ( thanks back2dos ). I guess that I don't have a very good understanding of what really affects how much memory an object takes up in AS3.
If anyone can point me to some reading on the subject that might be helpful, or perhaps explain some insight into how to think about this, that would be awesome.
I would like to keep a LOT of these objects in memory - and I thought I could until now, but at this size I'm going to have to create them or use an object pooling technique of some kind.
Thanks for the assistance.
Edit: Although I've got this in order, I'm keeping the code I posted here for completeness. The class has been heavily modified from the original version. Values that were referencing other files have been made static as to allow the code to run for someone else ( in theory hehehe... ).
Although my situation is sorted out, I'll give the answer to a good reference for information on classes and memory.
In this case the class has 15 variables. I'm only using a single String and a bunch of ints, Numbers, and Booleans with some references to more of the same in globally available XML data. It also imports Point for the constructor, though no points are stored. In testing, even without the global XML references or Point class it's still around a ~84k each. There are getters for 7 of the variables and a couple methods in addition to the constructor. All of which are less than 20 lines ( and I have a very sparse coding style ).
The class mentioned for reference, but feel free to generalize:
package
{
public class AObject
{
private var _counter:int;
private var _frames:int;
private var _speed:int;
private var _currentState:String;
private var _currentFrame:int;
private var _offset:int;
private var _endFrame:int;
private var _type:int;
private var _object:int;
private var _state:int;
private var _x:Number;
private var _y:Number;
private var _w:int;
private var _h:int;
private var _update:Boolean;
public function AObject( targetX : int, targetY : int, state : int, object : int, type : int )
{
_x = targetX;
_y = targetY;
_type = type;
_object = object;
_state = state;
_counter = 0;
_w = 32;
_h = 32
_update = true;
setState( state );
}
public function setState( state:int ) : void
{
_currentState = "bob";
var frameCounter : int = 0;
var stateCounter : int = state - 1;
while ( state > 0 )
{
frameCounter += 4;
--stateCounter;
}
_offset = frameCounter;
_currentFrame = _offset;
_speed = 10;
_frames = 4;
_endFrame = _offset + _frames - 1;
}
public function get state() : int
{
return _state;
}
public function animate() : Boolean
{
if ( count() )
{
if( _currentFrame < _endFrame )
{
++_currentFrame;
}
else
{
_currentFrame = _offset;
}
_speed = 10;
return true;
}
else
{
return false;
}
}
private var adder: Number = 0;
private function count():Boolean
{
_counter++;
if ( _counter == _speed )
{
_counter = 0;
return true;
}
else
{
return false;
}
}
public function get x():int
{
return _x;
}
public function get y():int
{
return _y;
}
public function get type():int
{
return _type;
}
public function get object():int
{
return _object;
}
public function get currentFrame():int
{
return _currentFrame;
}
public function get w():int
{
return _w;
}
public function get h():int
{
return _h;
}
}
}

i am amazed, this compiles at all ... when i try to compile it with the flex SDK, it creates an enormous collision with the built-in class Object, which is the base class of any class, making my trace output overflow ...
other than that, this is an infinite loop if you pass a value for state bigger than 0
while ( state > 0 )
{
frameCounter += 4;
--stateCounter;
}
but it seems really strange these objects are so big ... after renaming and taking care not to pass in 0 for the state, i ran a test:
package {
import flash.display.Sprite;
import flash.sampler.getSize;
import flash.system.System;
public class Main extends Sprite {
public function Main():void {
const count:int = 100000;
var start:uint = System.totalMemory;
var a:Array = [];
for (var i:int = 0; i < count; i++) {
a.push(new MyObject(1, 2, 0, 4, 5));
}
var mem:uint = System.totalMemory - start - getSize(a);
trace("total of "+mem+" B for "+count+" objects, aprox. avg. size per object: "+(mem/count));
}
}
}
it yields:
total of 10982744 B for 100000 objects, aprox. avg. size per object: 109.82744
so that's quite ok ... i think the actual size should be 4 (for the bool) + 4 * 11 (for the ints) + 4 (for the reference to the string) + 8 * 3 (for the three floats (you have the adder somewhere over the count) + 8 for an empty class (reference to the traits objects + something else), giving you a total of 88 bytes ... which is, what you get, if you getSize the object ... please note however, that getSize will only give you the size of the object itself (as calculated here) ignoring the size of what strings or other objects your object references ...
so yeah, apart from that name you definitely should change, the problem must be somewhere else ...
greetz
back2dos

If you really want to save on space, you can fake shorts by using unsigned integers, and using upper/lower bits for one thing or another.
ints are 4 bytes by nature, you can reuse that int on anything less than 2^8.
width height
0xFFFF + 0xFFFF
offset endframe
0xFFFF + 0xFFFF
This though gets ugly when you want to write anything or read anything, as to write width or height you'd have to:
writing:
size = (width & 0x0000FFFF) << 16 | (height & 0x0000FFFF);
reading:
get width():uint { return (size & 0xFFFF0000) >> 16 };
That's ugly. Since you're using getters anyways, and assuming computation speed is not an issue, you could use internal byte arrays which could give you even more granularity for how you want to store your information. Assuming your strings are more than 4 bytes, makes more sense to use a number rather than a string.
Also, I believe you will actually get some memory increase by declaring the class as final, as I believe final functions get placed into the traits object, rather than

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

rootbeer CUDA example code quantified throughput gain - cuda

The rootbeer developer said the example code that takes the sum of array elements is not the best example and an alternative example would show throughput gains.

Related

How to use LSTM with a 2d array with DeepLearning4j

TypedFactory vs Performance Monitor

JCuda. Reusing already used pointer

Java Reflection Problem

AS3: Optimizing Object Memory Size

Categories

Resources