I also understand I can also use this programatically by creating a performance counter by using System.Diagnostics.PerformanceCounter, and get the counter value using NextValue() method.
Process p = Process.GetProcessById(10204);
PerformanceCounter ramCounter = new PerformanceCounter("Process", "Working Set - Private", p.ProcessName);
PerformanceCounter cpuCounter = new PerformanceCounter("Process", "% User Time", p.ProcessName);
while (true)
{
Thread.Sleep(1000);
double ram = ramCounter.NextValue();
double cpu = cpuCounter.NextValue();
Console.WriteLine("RAM: " + (ram / 1024 / 1024) + "MB");
Console.WriteLine("CPU: " + (cpu) + " %");
}
I found this code online and In Here I am more insterested in Calculating Average CPU and Average RAM at the end of this test and store it in a Var to compare it against another variable. any nice ideas
thanks
using System;
using System.Diagnostics;
using System.Threading;
namespace ConsoleApplication4
{
class Program
{
static void Main(string[] args)
{
double totalRam = 0.0d;
double cpu = 0.0d;
Process p = Process.GetProcessById(1188);
var ramCounter = new PerformanceCounter("Process", "Working Set - Private", p.ProcessName);
var cpuCounter = new PerformanceCounter("Process", "% User Time", p.ProcessName);
int n = 0;
while (n < 20)
{
Thread.Sleep(1000);
double ram = ramCounter.NextValue();
cpu += cpuCounter.NextValue();
totalRam += (ram / 1024 / 1024);
n++;
}
double avgRam = totalRam/n;
double avgCpu = cpu/n;
Console.WriteLine("Average Ram is {0} ", avgRam);
Console.WriteLine("Average Cpu is {0} ", avgCpu);
Console.ReadLine();
}
}
}
Related
I am working on a project to write to and read from a TP Link / Kaza power strip or smart plug.
The data that is sent is encrypted json that has been "autokey encrypted".
So far I have been able to convert a typescript encrypt function and it works well. I get the expected result. However, I need to add a "header" to my encrypted data. That data is 3 null bytes followed by a byte that is a measure of the length of the encrypted bytes.
The typescript example has this bit of code to "encrypt with headers", however, I've hit a bit of a wall trying to convert it to something usable. Can someone nudge me along the path ?
First are the two typescript functions: (borrowed from https://github.com/plasticrake/tplink-smarthome-crypto/blob/master/src/index.ts)
/**
* Encrypts input where each byte is XOR'd with the previous encrypted byte.
*
* #param input - Data to encrypt
* #param firstKey - Value to XOR first byte of input
* #returns encrypted buffer
*/
export function encrypt(input: Buffer | string, firstKey = 0xab): Buffer {
const buf = Buffer.from(input);
let key = firstKey;
for (let i = 0; i < buf.length; i += 1) {
// eslint-disable-next-line no-bitwise
buf[i] ^= key;
key = buf[i];
}
return buf;
}
/**
* Encrypts input that has a 4 byte big-endian length header;
* each byte is XOR'd with the previous encrypted byte.
*
* #param input - Data to encrypt
* #param firstKey - Value to XOR first byte of input
* #returns encrypted buffer with header
*/
export function encryptWithHeader(
input: Buffer | string,
firstKey = 0xab
): Buffer {
const msgBuf = encrypt(input, firstKey);
const outBuf = Buffer.alloc(msgBuf.length + 4);
outBuf.writeUInt32BE(msgBuf.length, 0);
msgBuf.copy(outBuf, 4);
return outBuf;
}
Second is what I have so far.
// This part works well and produces the expected results
String encrypt(String input)
{
int16_t firstKey = 0xab;
String buf;
int key;
int i;
buf = input;
key = firstKey;
i = 0;
for (;i < buf.length();(i = i + 1))
{
buf[i] ^= key;
key = buf[i];
}
return buf;
}
// This does not function yet, as I'm pretty lost..
// This was orginally converted from typescript with https://andrei-markeev.github.io/ts2c/
// I started work on converting this, but ran into errors I don't know how to solve.
String encryptWithHeader(String input){
String msgBuf;
String outBuf;
int16_t firstKey = 0xab;
char * null = NULL;
msgBuf = encrypt(input);
outBuf = msgBuf.length() +1;
//this is where I got lost...
assert(null != NULL);
null[0] = '\0';
strncat(null, outBuf, msgBuf.length());
str_int16_t_cat(null, 4);
outBuf = msgBuf + 4
return outBuf;
}
Finally, the data:
//this is the unencrypted json
String offMsg = "{\"system\":{\"set_relay_state\":{\"state\":0}}}";
//current encrypt function produces:
d0f281f88bff9af7d5ef94b6c5a0d48bf99cf091e8b7c4b0d1a5c0e2d8a381f286e793f6d4eedea3dea3
//the working "withheaders" should produce:
00002ad0f281f88bff9af7d5ef94b6c5a0d48bf99cf091e8b7c4b0d1a5c0e2d8a381f286e793f6d4eedea3dea3
Admittedly my C/C++ ability is very limited and I can spell typescript, that's about all. I have a very extensive history with PHP. As useful as that is. So, I understand the basics of data structures and whatnot, but I'm venturing off into areas I've never been in. Any help would be greatly appreciated.
It looks like the encryption is fairly simple: write the current character XORed with the key to the buffer and make that newly written character the new key. It also looks like the "withHeaders" version adds the length of the encrypted string as a 4 byte integer to the start of the buffer. I think it might be easier to allocate a character array and pass that array to a function that writes the result to that buffer. For example:
void encryptWithHeader(byte buffer[], int bufferLength, byte key, String message) {
int i;
uint32_t messageLength = message.length();
Serial.println(message);
Serial.println(message.length());
// check that we won't overrun the buffer
if ( messageLength + 5 < bufferLength) {
buffer[0] = messageLength >> 24 & 0xFF;
buffer[1] = messageLength >> 16 & 0xFF;
buffer[2] = messageLength >> 8 & 0xFF;
buffer[3] = messageLength & 0xFF;
for (i = 0; i < messageLength; i++) {
buffer[i + 4] = message[i] ^ key;
key = buffer[i + 4];
}
}
else { // we would have overrun the buffer
Serial.println("not enough room in buffer for message");
}
}
void setup() {
// put your setup code here, to run once:
Serial.begin(9600);
}
void loop() {
byte theBuffer[64];
int i;
String offMsg = "{\"system\":{\"set_relay_state\":{\"state\":0}}}";
encryptWithHeader(theBuffer, 64, 0xab, offMsg);
// now print it out to check
for (i = 0; i < offMsg.length() + 4; i++) {
if (theBuffer[i] < 0x10) // adds an extra zero if a byte prints as on1y 1 char
Serial.print("0");
Serial.print(theBuffer[i], HEX);
}
while (true)
;
}
If you want to send the character buffer to a remote device you can send it out one byte at a time:
for (i = 0; i < offMsg.length() + 4; i++)
Serial.write(theBuffer[i]);
I'm using the ESP32 module and I am trying to get the NTP time in milliseconds. I managed to get the time in seconds without any problem using a struct tm and the function getLocalTime().
I read on forums and on the internet that I had to use struct timeval and the function gettimeofday() instead to achieve this. So I replaced the struct and the function accordingly in my code but now I can't get the time anymore...
My code is as follows:
void printLocalTime()
{
//When using struct tm and getLocalTime() I can get the time without poblem in seconds
struct timeval tv;
if (!gettimeofday(&tv, NULL)) {
Serial.println("Failed to obtain time");
return;
}
long int sec = tv.tv_sec*1000LL;
Serial.println(sec);
long int temp = tv.tv_usec/1000LL;
Serial.println(temp);
}
When I run this, all I'm getting is "Failed to obtain time"...
PS: I'm using arduino IDE and have included sys/time.h
Can anyone help me with this?
Many thanks
As the (original) POSIX command has the following structure
int gettimeofday(struct timeval *tv, struct timezone *tz);
and the error numbers are from 1 to 6
if (gettimeofday(&tv, NULL) != 0) {
Serial.println("Failed to obtain time");
return;
}
as it returns int and not bool as the function you used before is defined:
bool getLocalTime(struct tm * info, uint32_t ms)
in esp32-hal-time.c and as
extern "C" bool getLocalTime(struct tm * info, uint32_t ms = 5000);
in Arduino.h
EDIT
As gettimeofday() represents the time since UNIX_Epoch (1970) try this first:
printf("TimeVal-sec = %lld\n", (long long) tv.tv_sec);
printf("TimeVal-usec = %lld\n", (long long) tv.tv_usec);
will print something like
TimeVal-sec = 1493735463
TimeVal-usec = 525199 // already usec part
To "rip" apart the seconds you do the following
// Form the seconds of the day
long hms = tv.tv_sec % SEC_PER_DAY;
hms += tz.tz_dsttime * SEC_PER_HOUR;
hms -= tz.tz_minuteswest * SEC_PER_MIN;
// mod `hms` to ensure positive range of [0...SEC_PER_DAY)
hms = (hms + SEC_PER_DAY) % SEC_PER_DAY;
// Tear apart hms into h:m:s
int hour = hms / SEC_PER_HOUR;
int min = (hms % SEC_PER_HOUR) / SEC_PER_MIN;
int sec = (hms % SEC_PER_HOUR) % SEC_PER_MIN; // or hms % SEC_PER_MIN
This function gives you all the usec
static int64_t getNowUs() {
struct timeval tv;
gettimeofday(&tv, NULL);
return (int64_t)tv.tv_usec + tv.tv_sec * 1000000ll;
}
and if you need the "real" date you have toadd
const unsigned long long EPOCH = 2208988800ULL;
uint64_t tv_ntp = tv.tv_sec + EPOCH;
For measuring elapsed time you process sec with sec and usec with usec. Hope this EDIT solves another POSIX/UNIX mystery.
I'm using arduino IDE & I tried what you said like this:
char usec[30];
struct timeval tv;
if (gettimeofday(&tv, NULL)!= 0) {
Serial.println("Failed to obtain time");
return;
}
sprintf(sec, "%lld",(long long)tv.tv_sec);
sprintf(usec, "%lld", (long long)tv.tv_usec);
//long long seconds = (long long)tv.tv_sec;
//long long microseconds = (long long)tv.tv_usec;
Serial.print("TimeVal-sec = ");
Serial.print(sec);
Serial.print("TimeVal-usec = ");
Serial.print(usec);
But what I get is this: TimeVal-sec = 5TimeVal-usec = 792802
I'm also on Windows, is it a problem?
serve = 'pool.ntp.org'
ntp = ntplib.NTPClient()
ntpResponse = ntp.request(serve)
if (ntpResponse):
# calculate the ntp time and convert into microseconds
ntp_time = float(ntpResponse.tx_time * 1000000)
#print("ntp_time",ntp_time)
The above code is correct to get the ntp_server time in microseconds in python
I am trying to implement simple matrix multiplication program using shared memory in JCuda.
Following is my JCudaSharedMatrixMul.java code:
import static jcuda.driver.JCudaDriver.cuCtxCreate;
import static jcuda.driver.JCudaDriver.cuCtxSynchronize;
import static jcuda.driver.JCudaDriver.cuDeviceGet;
import static jcuda.driver.JCudaDriver.cuInit;
import static jcuda.driver.JCudaDriver.cuLaunchKernel;
import static jcuda.driver.JCudaDriver.cuMemAlloc;
import static jcuda.driver.JCudaDriver.cuMemFree;
import static jcuda.driver.JCudaDriver.cuMemcpyDtoH;
import static jcuda.driver.JCudaDriver.cuMemcpyHtoD;
import static jcuda.driver.JCudaDriver.cuModuleGetFunction;
import static jcuda.driver.JCudaDriver.cuModuleLoad;
import static jcuda.runtime.JCuda.cudaEventCreate;
import static jcuda.runtime.JCuda.cudaEventRecord;
import static jcuda.runtime.JCuda.*;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;
import jcuda.Pointer;
import jcuda.Sizeof;
import jcuda.driver.CUcontext;
import jcuda.driver.CUdevice;
import jcuda.driver.CUdeviceptr;
import jcuda.driver.CUfunction;
import jcuda.driver.CUmodule;
import jcuda.driver.JCudaDriver;
import jcuda.runtime.cudaEvent_t;
public class JCudaSharedMatrixMul
{
public static void main(String[] args) throws IOException
{
// Enable exceptions and omit all subsequent error checks
JCudaDriver.setExceptionsEnabled(true);
// Create the PTX file by calling the NVCC
String ptxFilename = preparePtxFile("JCudaSharedMatrixMulKernel.cu");
//Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet (device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
//Load PTX file
CUmodule module = new CUmodule();
cuModuleLoad(module,ptxFilename);
//Obtain a function pointer to the Add function
CUfunction function = new CUfunction();
cuModuleGetFunction(function, module, "jCudaSharedMatrixMulKernel");
int numRows = 16;
int numCols = 16;
//Allocate and fill Host input Matrices:
float hostMatrixA[] = new float[numRows*numCols];
float hostMatrixB[] = new float[numRows*numCols];
float hostMatrixC[] = new float[numRows*numCols];
for(int i = 0; i<numRows; i++)
{
for(int j = 0; j<numCols; j++)
{
hostMatrixA[i*numCols+j] = (float) 1;
hostMatrixB[i*numCols+j] = (float) 1;
}
}
// Allocate the device input data, and copy the
// host input data to the device
CUdeviceptr devMatrixA = new CUdeviceptr();
cuMemAlloc(devMatrixA, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixA, Pointer.to(hostMatrixA), numRows * numCols * Sizeof.FLOAT);
CUdeviceptr devMatrixB = new CUdeviceptr();
cuMemAlloc(devMatrixB, numRows * numCols * Sizeof.FLOAT);
//This is the part where it gives me the error
cuMemcpyHtoD(devMatrixB, Pointer.to(hostMatrixB ), numRows * numCols * Sizeof.FLOAT);
//Allocate device matrix C to store output
CUdeviceptr devMatrixC = new CUdeviceptr();
cuMemAlloc(devMatrixC, numRows * numCols * Sizeof.FLOAT);
// Set up the kernel parameters: A pointer to an array
// of pointers which point to the actual values.
Pointer kernelParameters = Pointer.to(
Pointer.to(new int[]{numCols}),
Pointer.to(devMatrixA),
Pointer.to(devMatrixB),
Pointer.to(devMatrixC));
//Kernel thread configuration
int blockSize = 16;
int gridSize = 1;
cudaEvent_t start = new cudaEvent_t();
cudaEvent_t stop = new cudaEvent_t();
cudaEventCreate(start);
cudaEventCreate(stop);
long start_nano=System.nanoTime();
cudaEventRecord(start, null);
cuLaunchKernel(function,
gridSize, 1, 1,
blockSize, 16, 1,
250, null, kernelParameters, null);
cuCtxSynchronize();
cudaEventRecord(stop, null);
long end_nano=System.nanoTime();
float elapsedTimeMsArray[] = { Float.NaN };
cudaEventElapsedTime(elapsedTimeMsArray, start, stop);
float elapsedTimeMs = elapsedTimeMsArray[0];
System.out.println("Time Required (Using cudaevent elapsed time) = " + " " +elapsedTimeMs+
"Time Required (Using nanotime)= "+(end_nano-start_nano)/1000000);
// Allocate host output memory and copy the device output
// to the host.
//This is the part where it gives me the error
cuMemcpyDtoH(Pointer.to(hostMatrixC), devMatrixC, numRows * numCols * Sizeof.FLOAT);
//verify the result
for (int i =0; i<numRows; i++)
{
for (int j =0; j<numRows; j++)
{
System.out.print(" "+ hostMatrixC[i*numCols+j]);
}
System.out.println("");
}
cuMemFree(devMatrixA);
cuMemFree(devMatrixB);
cuMemFree(devMatrixC);
}
private static String preparePtxFile(String cuFileName) throws IOException
{
int endIndex = cuFileName.lastIndexOf('.');
if (endIndex == -1)
endIndex = cuFileName.length()-1;
{
}
String ptxFileName = cuFileName.substring(0, endIndex+1)+"ptx";
File ptxFile = new File(ptxFileName);
if (ptxFile.exists())
{
return ptxFileName;
}
File cuFile = new File(cuFileName);
if (!cuFile.exists())
{
throw new IOException("Input file not found: "+cuFileName);
}
String modelString = "-m"+System.getProperty("sun.arch.data.model");
String command = "nvcc " + modelString + " -ptx "+ cuFile.getPath()+" -o "+ptxFileName;
System.out.println("Executing\n"+command);
Process process = Runtime.getRuntime().exec(command);
String errorMessage = new String(toByteArray(process.getErrorStream()));
String outputMessage = new String(toByteArray(process.getInputStream()));
int exitValue = 0;
try
{
exitValue = process.waitFor();
}
catch (InterruptedException e)
{
Thread.currentThread().interrupt();
throw new IOException(
"Interrupted while waiting for nvcc output", e);
}
if (exitValue != 0)
{
System.out.println("nvcc process exitValue "+exitValue);
System.out.println("errorMessage:\n"+errorMessage);
System.out.println("outputMessage:\n"+outputMessage);
throw new IOException(
"Could not create .ptx file: "+errorMessage);
}
System.out.println("Finished creating PTX file");
return ptxFileName;
}
private static byte[] toByteArray(InputStream inputStream) throws IOException
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte buffer[] = new byte[8192];
while (true)
{
int read = inputStream.read(buffer);
if (read == -1)
{
break;
}
baos.write(buffer, 0, read);
}
return baos.toByteArray();
}
}
Following is my JCudaSharedMatrixMulKernel.cu code:
extern "C"
__global__ void jCudaSharedMatrixMulKernel(int N,float *ad,float *bd,float *cd)
{
float pvalue=0;
int TILE=blockDim.x;
int ty=threadIdx.y;
int tx=threadIdx.x;
__shared__ float ads[4][4];
__shared__ float bds[4][4];
int Row = blockIdx.y * blockDim.y + threadIdx.y;
int Col = blockIdx.x * blockDim.x + threadIdx.x;
for(int i=0;i< N/TILE;++i)
{
ads[ty][tx] = ad[Row * N + (i * TILE) + tx];
bds[ty][tx] = bd[(i * TILE + ty) * N + Col];
__syncthreads();
for(int k=0;k<TILE;k++)
pvalue += ads[ty][k] * bds[k][tx];
__syncthreads();
}
cd[Row * N + Col] = pvalue;
}
In my above example total shared memory used per block is 2*4*4*4 = 128 bytes.
In the cuLaunchKernel when I define sharedMemBytes parameter as 0(zero) then it gives me following error:
**Exception in thread "main" jcuda.CudaException: CUDA_ERROR_LAUNCH_FAILED
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:282)
at jcuda.driver.JCudaDriver.cuCtxSynchronize(JCudaDriver.java:1795)
at JCudaSharedMatrixMul.main(JCudaSharedMatrixMul.java:121)**
When I define it as 128 then it gives the same above error. But when I make it as 129 then it gives me correct output! When I give any value between 129 to 49024 then it gives me the correct result.
My question is why I am not able to get the correct output when I am defining it as 128? Also what is the maximum shared memory can be defined? Why this 129-49024 range is working here?
You're launching blocks of 16x16 threads:
cuLaunchKernel(function,
gridSize, 1, 1,
blockSize, 16, 1, <-- the first two params are block.x and block.y
250, null, kernelParameters, null);
so __shared__ float ads[4][4]; should not be working at all. For example, these lines of kernel code would be accessing those shared arrays out-of-bounds for some threads:
ads[ty][tx] = ad[Row * N + (i * TILE) + tx];
bds[ty][tx] = bd[(i * TILE + ty) * N + Col];
^ ^
| tx goes from 0..15 for a 16x16 threadblock
ty goes from 0..15 for a 16x16 threadblock
Your code is broken in this respect. If you run your code with cuda-memcheck it may catch these out-of-bounds accesses, even in your "passing" case. Looking at the matrixMulDrv cuda sample code, will be instructive, and you'll see that the shared memory allocation is 2*block_size*block_size, as it should be for your case as well, but your shared memory definitions should be [16][16] not [4][4] It may be that the shared memory allocation granularity just happens to work when you exceed 128 bytes, but there is a defect in your code.
Your shared definitions should be:
__shared__ float ads[16][16];
__shared__ float bds[16][16];
Since the above allocations are static allocations, and the sharedMemBytes parameter is defined as dynamic shared memory allocation, for this example you don't need to allocate any (0 is OK) dynamic shared memory, and it still works. The difference between static and dynamic is covered here.
The maximum shared memory per block is available in the documentation, or if you run the cuda deviceQuery sample code. It is 48K bytes for cc2.0 and newer devices.
How would one convert from 2x32bit uints to a Number and back (assume max value of 2^52)?
I believe the following would theoretically work (passing around as ByteArray for clarity, but an Array could work as storage as well), But it doesn't because bitwise operators evidently force Number into 32 bits :\
(see: Binary math on Number objects limited to 32 bits?):
public static function read64BitNumberFromBuffer(buffer:ByteArray):Number {
var ch1:uint = buffer.readUnsignedInt();
var ch2:uint = buffer.readUnsignedInt();
var num:Number = ((ch1 << 32) | ch2);
return(num);
}
public static function write64BitNumberToBuffer(num:Number):ByteArray {
var ch1:uint = uint((num & 0xFFFFFFFF00000000) >> 32);
var ch2:uint = uint(num & 0xFFFFFFFF);
var buffer:ByteArray = new ByteArray();
buffer.writeUnsignedInt(ch1);
buffer.writeUnsignedInt(ch2);
return(buffer);
}
One could use a library like as3crypto's BigInteger to handle this, but that seems like an awful lot of bloat for such a discrete need. Is there a robust bit of code that could be injected into the above functions to make them return the correct values?
Although I'd prefer a pure Actionscript solution, as a point of interest- are bitwise operators in Crossbridge also limited to 32 bits? (btw- I need 1500 reputation to create a tag "crossbridge", can someone do it on my behalf?)
EDIT: Tried readDouble()/writeDouble() as well but it seemed to want to switch to reverse the bytes for some reason under a more thorough test (tried playing with endian setting, to no avail other than it did affect output in the wrong way)
OK- this seems to work perfectly:
package
{
import flash.display.Sprite;
import flash.utils.ByteArray;
public class TEMP extends Sprite
{
public function TEMP()
{
var targetNumber:Number = 6697992365;
var buffer:ByteArray = new ByteArray();
var testNumber:Number;
write64BitNumberToBuffer(buffer, targetNumber);
buffer.position = 0;
testNumber = read64BitNumberFromBuffer(buffer);
if(targetNumber == testNumber) {
trace("Passed! Both numbers are", targetNumber);
} else {
trace("Failed! Test number is", testNumber, "When it should be", targetNumber);
}
}
public static function read64BitNumberFromBuffer(buffer:ByteArray):Number {
var finalNumber:Number;
var str:String = '';
var byte:uint;
var chr:String;
while(str.length < 16) {
byte = buffer.readUnsignedByte();
chr = byte.toString(16);
if(chr.length == 1) {
chr = '0' + chr;
}
str += chr;
}
finalNumber = Number('0x' + str);
return(finalNumber);
}
public static function write64BitNumberToBuffer(buffer:ByteArray, num:Number) {
var hexString:String = num.toString(16);
var idx:uint = 16 - hexString.length;
var byte:uint;
while(idx--) {
hexString = '0' + hexString;
}
for(idx = 0; idx < hexString.length; idx += 2) {
byte = uint('0x' + hexString.substr(idx, 2));
buffer.writeByte(byte);
}
}
}
}
Output: Passed! Both numbers are 6697992365
I have an image that has a width of 512px. This piece of code will throw
RasterFormatException (x+width) is outside Raster
I dont understand what im doing wrong, when i check the raster size it says its 512
private void automaticStaticSpriteLoader(String loadedName, String imgLoc, BufferedImage[] biArray, int numberOfSpritesToLoad, int numberOfSpritesInImage, int percentComplete){
try {
temporaryBigImg = ImageIO.read(getClass().getResource(imgLoc + ".png"));
} catch (IOException e) {
System.out.println(classNumber + " Error Loading Sprite Images. Shutting Down.");
e.printStackTrace();
}
for(int i = 0; i<numberOfSpritesToLoad;i++){
biArray[i] = temporaryBigImg.getSubimage(counterX, counterY, 32, 32);
System.out.println("CounterX = " + counterX + " CounterY = " + counterY + " Current Index = " + i);
if(counterX == 512){
counterY += 32;
counterX = -32;
}
counterX+=32;
}
}
You are updating counterXand counterY too late.
You have to check if counterX >= 512 and eventually increment counterY and reset counterX before you call getSubImage.
The code, as in your post will first call getSubImage(512, 0, 32, 32), then test if counterX == 512 (but the test is never reached). Try printing the actual values you pass in, and you will see what is wrong.