I have CUDA kernel where basically each thread holds a value, and it needs to add that value to one or more lists in shared memory. So for each of those lists, it needs to get an index value (unique for that list) to put the value.
The real code is different, but there are lists like:
typedef struct {
unsigned int numItems;
float items[MAX_NUM_ITEMS];
} List;
__shared__ List lists[NUM_LISTS];
The values numItems are initially all set to 0, and then a __syncthreads() is done.
To add its value to the lists, each thread does:
for(int list = 0; list < NUM_LISTS; ++list) {
if(should_add_to_list(threadIdx, list)) {
unsigned int index = atomicInc(&lists[list].numItems, 0xffffffff);
assert(index < MAX_NUM_ITEMS); // always true
lists[list].items[index] = my_value;
}
}
This works most of the time, but it seems that when making some unrelated changes in other parts of the kernel (such as not checking asserts that always succeed), sometimes two threads get the same index for one list, or indices are skipped.
The final value of numSamples always becomes correct, however.
However, when using the following custom implementation for atomicInc_ instead, it seems to work correctly:
__device__ static inline uint32_t atomicInc_(uint32_t* ptr) {
uint32_t value;
do {
value = *ptr;
} while(atomicCAS(ptr, value, value + 1) != value);
return value;
}
Are the two atomicInc functions equivalent, and is it valid to use atomicInc that way to get unique indices?
According the the CUDA programming guide, the atomic functions do not imply memory ordering constraints, and different threads can access the numSamples of different lists at the same time: could this cause it to fail?
Edit:
The real kernel looks like this:
Basically there is a list of spot blocks, containing spots. Each spot has XY coordinates (col, row). The kernel needs to find, for each spot, the spots that are in a certain window (col/row difference) around it, and put them into a list in shared memory.
The kernel is called with a fixed number of warps. A CUDA block corresponds to a group of spot blocks. (here 3) These are called the local spot blocks.
First it takes the spots from the block's 3 spot blocks, and copies them into shared memory (localSpots[]).
For this it uses one warp for each spot block, so that the spots can be read coalesced. Each thread in the warp is a spot in the local spot block.
The spot block indices are here hardcoded (blocks[]).
Then it goes through the surrounding spot blocks: These are all the spot blocks that may contain spots that are close enough to a spot in the local spot blocks. The surrounding spot block's indices are also hardcoded here (sblock[]).
In this example it only uses the first warp for this, and traverses sblocks[] iteratively. Each thread in the warp is a spot in the surrounding spot block.
It also iterates through the list of all the local spots. If the thread's spot is close enough to the local spot: It inserts it into the local spot's list, using atomicInc to get an index.
When executed, the printf shows that for a given local spot (here the one with row=37, col=977), indices are sometimes repeated or skipped.
The real code is more complex/optimized, but this code already has the problem. Here it also only runs one CUDA block.
#include <assert.h>
#include <stdio.h>
#define MAX_NUM_SPOTS_IN_WINDOW 80
__global__ void Kernel(
const uint16_t* blockNumSpotsBuffer,
XGPU_SpotProcessingBlockSpotDataBuffers blockSpotsBuffers,
size_t blockSpotsBuffersElementPitch,
int2 unused1,
int2 unused2,
int unused3 ) {
typedef unsigned int uint;
if(blockIdx.x!=30 || blockIdx.y!=1) return;
int window = 5;
ASSERT(blockDim.x % WARP_SIZE == 0);
ASSERT(blockDim.y == 1);
uint numWarps = blockDim.x / WARP_SIZE;
uint idxWarp = threadIdx.x / WARP_SIZE;
int idxThreadInWarp = threadIdx.x % WARP_SIZE;
struct Spot {
int16_t row;
int16_t col;
volatile unsigned int numSamples;
float signalSamples[MAX_NUM_SPOTS_IN_WINDOW];
};
__shared__ uint numLocalSpots;
__shared__ Spot localSpots[3 * 32];
numLocalSpots = 0;
__syncthreads();
ASSERT(numWarps >= 3);
int blocks[3] = {174, 222, 270};
if(idxWarp < 3) {
uint spotBlockIdx = blocks[idxWarp];
ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);
uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
ASSERT(numSpots < WARP_SIZE);
size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;
uint outOffset;
if(idxThreadInWarp == 0) outOffset = atomicAdd(&numLocalSpots, numSpots);
outOffset = __shfl_sync(0xffffffff, outOffset, 0, 32);
if(idxThreadInWarp < numSpots) {
Spot* outSpot = &localSpots[outOffset + idxThreadInWarp];
outSpot->numSamples = 0;
uint32_t coord = blockSpotsBuffers.coord[inOffset];
UnpackCoordinates(coord, &outSpot->row, &outSpot->col);
}
}
__syncthreads();
int sblocks[] = { 29,30,31,77,78,79,125,126,127,173,174,175,221,222,223,269,270,271,317,318,319,365,366,367,413,414,415 };
if(idxWarp == 0) for(int block = 0; block < sizeof(sblocks)/sizeof(int); ++block) {
uint spotBlockIdx = sblocks[block];
ASSERT(spotBlockIdx < numSpotBlocks.x * numSpotBlocks.y);
uint numSpots = blockNumSpotsBuffer[spotBlockIdx];
uint idxThreadInWarp = threadIdx.x % WARP_SIZE;
if(idxThreadInWarp >= numSpots) continue;
size_t inOffset = (spotBlockIdx * blockSpotsBuffersElementPitch) + idxThreadInWarp;
uint32_t coord = blockSpotsBuffers.coord[inOffset];
if(coord == 0) return; // invalid surrounding spot
int16_t row, col;
UnpackCoordinates(coord, &row, &col);
for(int idxLocalSpot = 0; idxLocalSpot < numLocalSpots; ++idxLocalSpot) {
Spot* localSpot = &localSpots[idxLocalSpot];
if(localSpot->row == 0 && localSpot->col == 0) continue;
if((abs(localSpot->row - row) >= window) && (abs(localSpot->col - col) >= window)) continue;
int index = atomicInc_block((unsigned int*)&localSpot->numSamples, 0xffffffff);
if(localSpot->row == 37 && localSpot->col == 977) printf("%02d ", index); // <-- sometimes indices are skipped or duplicated
if(index >= MAX_NUM_SPOTS_IN_WINDOW) continue; // index out of bounds, discard value for median calculation
localSpot->signalSamples[index] = blockSpotsBuffers.signal[inOffset];
}
} }
Output looks like this:
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 02 03 03 04 05 06 07 08 09 10 11 12 06 13 14 15 16 17 18 19 20 21
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 23
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Each line is the output of one execution (the kernel is run multiple times). It is expected that indices appear in different orders. But for example on the third-last line, index 23 is repeated.
Using atomicCAS seems to fix it. Also using __syncwarp() between executions on the outer for-loop seems to fix it. But it is not clear why, and if that always fixes it.
Edit 2:
This is a full program (main.cu) that shows the problem:
https://pastebin.com/cDqYmjGb
The CMakeLists.txt:
https://pastebin.com/iB9mbUJw
Must be compiled with -DCMAKE_BUILD_TYPE=Release.
It produces this output:
00(0:00000221E40003E0)
01(2:00000221E40003E0)
02(7:00000221E40003E0)
03(1:00000221E40003E0)
03(2:00000221E40003E0)
04(3:00000221E40003E0)
04(1:00000221E40003E0)
05(4:00000221E40003E0)
06(6:00000221E40003E0)
07(2:00000221E40003E0)
08(3:00000221E40003E0)
09(6:00000221E40003E0)
10(3:00000221E40003E0)
11(5:00000221E40003E0)
12(0:00000221E40003E0)
13(1:00000221E40003E0)
14(3:00000221E40003E0)
15(1:00000221E40003E0)
16(0:00000221E40003E0)
17(3:00000221E40003E0)
18(0:00000221E40003E0)
19(2:00000221E40003E0)
20(4:00000221E40003E0)
21(4:00000221E40003E0)
22(1:00000221E40003E0)
For example the lines with 03 show that two threads (1 and 2), get the same result (3), after calling atomicInc_block on the same counter (at 0x00000221E40003E0).
According to my testing, this problem is fixed in CUDA 11.4.1 currently available here and driver 470.52.02. It may also be fixed in some earlier versions of CUDA 11.4 and 11.3, but the problem is present in CUDA 11.2.
Related
Is there a way to perform multiway (>2) stable partition in Thrust?
Either stable partition or stable partition copy both are equally interesting. Currently I can only use two-way stable partition copy for purposes described above. It is clear how to use it to partition a sequence into a three parts using two predicates and two calls of thrust::stable_partition_copy. But I am sure it is technically possible to implement multiway stable partition.
I can imagine the following multiway stable partition copy (pseudocode):
using F = float;
thrust::device_vector< F > trianges{N * 3};
// fill triangles here
thrust::device_vector< F > A{N}, B{N}, C{N};
auto vertices_begin = thrust::make_tuple(A.begin(), B.begin(), C.begin());
using U = unsigned int;
auto selector = [] __host__ __device__ (U i) -> U { return i % 3; };
thrust::multiway_stable_partition_copy(p, triangles.cbegin(), triangles.cend(), selector, vertices_begin);
A.begin(), B.begin(), C.begin() should be incremented individually.
Also, I can imagine hypothetical dispatch iterator, which would do the same (and would be more useful I think).
From my knowledge of the thrust internals, there is no readily adaptable algorithm to do what you envisage.
A simple approach would be to extend your theoretical two pass three way partition to M-1 passes using a smart binary predicate, something like
template<typename T>
struct divider
{
int pass;
__host__ __device__ divider(int p) : pass(p) { };
__host__ __device__ int classify(const T &val) { .... };
__host__ __device__ bool operator()(const T &val) { return !(classify(val) > pass); };
}
which enumerates a given input into M possible subsets and returns true if the input is in the Nth or less subset, and then a loop
auto start = input.begin();
for(int i=0; i<(M-1); ++i) {
divider pred<T>(i);
result[i] = thrust::stable_partition(
thrust::device,
start,
input.end(),
pred);
start = result[i];
}
[ note all code written in a browser on a tablet while floating on a boat in the Baltic. Obviously never compiled or run. ]
This will certainly be the most space efficient, as a maximum of len(input) temporary storage is required, whereas a hypothetical single pass implementation would require M * len(input) storage, which would quickly get impractical for a large M.
Edit to add that now I'm back on land with a compiler, this seems to work as expected:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <thrust/partition.h>
struct divider
{
int pass;
__host__ __device__
divider(int p) : pass(p) { };
__host__ __device__
int classify(const int &val) { return (val % 12); };
__host__ __device__
bool operator()(const int &val) { return !(classify(val) > pass); };
};
int main()
{
const int M = 12;
const int N = 120;
thrust::device_vector<int> input(N);
thrust::counting_iterator<int> iter(0);
thrust::copy(iter, iter+N, input.begin());
thrust::device_vector<int>::iterator result[M];
auto start = input.begin();
for(int i=0; i<(M-1); ++i) {
divider pred(i);
result[i] = thrust::stable_partition(
thrust::device,
start,
input.end(),
pred);
start = result[i];
}
int i = 0;
for(auto j=input.begin(); j!=input.end(); ++j) {
if (j == result[i]) {
i++;
std:: cout << std::endl;
}
std::cout << *j << " ";
}
return 0;
}
$ nvcc -std=c++11 -arch=sm_52 -o partition partition.cu
$ ./partition
0 12 24 36 48 60 72 84 96 108
1 13 25 37 49 61 73 85 97 109
2 14 26 38 50 62 74 86 98 110
3 15 27 39 51 63 75 87 99 111
4 16 28 40 52 64 76 88 100 112
5 17 29 41 53 65 77 89 101 113
6 18 30 42 54 66 78 90 102 114
7 19 31 43 55 67 79 91 103 115
8 20 32 44 56 68 80 92 104 116
9 21 33 45 57 69 81 93 105 117
10 22 34 46 58 70 82 94 106 118
11 23 35 47 59 71 83 95 107 119
I have a sql table 'animals' where there are blobs images. I found out how to upload images but not how to display them.
I would like to display the image which is called 'Dog' in my table.
Here is my code, where I print the result of my blob img.
let sql = 'SELECT * FROM animals WHERE file_name=\'Dog\''
connection.query(sql, (err,result) => {
console.log(result[0].img)
})
Here is the result of my code:
<Buffer ff d8 ff e0 00 10 4a 46 49 46 00 01 01 00 00 01 00 01 00 00 ff db 00 84 00 09 06 07 13 12 12 15 13 13 13 16 16 15 15 18 18 17 16 18 15 17 15 17 17 16 ... >
Is there is any way to display that picture?
Thank you.
You can use the Fetch API to get the resource on your web page.
You can display the image like this :
fetch('http://localhost:1234/your_api')
.then(function(response) {
return response.blob();
})
.then(function(myBlob) {
var objectURL = URL.createObjectURL(myBlob);
document.querySelector("#databaseimage").src = objectURL;
});
In HTML :
<img id="databaseimage"/>
You can read more about Fetch API here :
https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API/Using_Fetch
I'm trying to write to a hex file using PB12.5, I'm able to write to it without any issues but through testing noticed I will need to send a null value (00) to the file at certain points.
I know if I assign null to a string, it will null out the entire string so I tried using a Blob where I can insert a null value when needed (BlobEdit(blb_data, ll_pos, CharA(0)) )
But BlobEdit() automatically inserts a null value in between each position, I don't want this as it's causing issues as I'm trying to update the hex file. I just need to add my CharA(lb_byte) to each consecutive position in the Blob.
Is there any way around this or is PB just unable to do this? Below is the code:
ll_test = 1
ll_pos = 1
ll_length = Len(ls_output)
Do While ll_pos <= (ll_length)
ls_data = Mid(ls_output, ll_pos, 2)
lb_byte = Event ue_get_decimal_value_of_hex(ls_data)
ll_test = BlobEdit(blb_data, ll_test, CharA(lb_byte), EncodingANSI!)
ll_pos = ll_pos + 2
Loop
Hex file appears as follows:
16 35 2D D8 08 45 29 18 35 27 76 25 30 55 66 85 44 66 57 A4 67 99
After Blob update:
16 00 48 00 5D 00 C3 92 00 08 00 48 00 51 00 E2
I hope to help you:
//////////////////////////////////////////////////////////////////////////
// Function: f_longtohex
// Description: LONG to HEXADECIMAL
// Ambito: public
// Argumentos: as_number //Variable long to convert to hexadecimal
// as_digitos //Number of digits to return
// Return: String
// Example:
// f_longtohex(198 , 2) --> 'C6'
// f_longtohex(198 , 4) --> '00C6'
//////////////////////////////////////////////////////////////////////////
long ll_temp0, ll_temp1
char lc_ret
if isnull(as_digitos) then as_digitos = 2
IF as_digitos > 0 THEN
ll_temp0 = abs(as_number / (16 ^ (as_digitos - 1)))
ll_temp1 = ll_temp0 * (16 ^ (as_digitos - 1))
IF ll_temp0 > 9 THEN
lc_ret = char(ll_temp0 + 55)
ELSE
lc_ret = char(ll_temp0 + 48)
END IF
RETURN lc_ret + f_longtohex(as_number - ll_temp1 , as_digitos - 1)
END IF
RETURN ''
I'm trying to connect to the Safecom TA-810 (badge/registration system) to automate the process of calculating how long employee's have worked each day. Currently this is done by:
Pulling the data into the official application
Printing a list of all 'registrations'
Manually entering the values from the printed lists into our HR application
This is a job that can take multiple hours which we'd like to see automated. So far the official tech support has been disappointing and refused to share any details.
Using wireshark I have been capturing the UDP transmissions and have pretty much succeeded in understanding how the protocol is built up. I'm only having issues with what i suppose is a CRC field. I don't know how it is calculated (CRC type and parameters) and using which fields ...
This is how a message header looks like:
D0 07 71 BC BE 3B 00 00
D0 07 - Message type
71 BC - This i believe is the CRC
BE 3B - Some kind of session identifier. Stays the same for every message after the initial message (initial message has '00 00' as value)
00 00 - Message number. '01 00', '02 00', '03 00'
Some examples:
Header only examples
E8 03 17 FC 00 00 00 00 -> initial request (#0, no session nr)
D0 07 71 BC BE 3B 00 00 -> Initial response (#0, device sends a session nr)
4C 04 EF BF BE 3B 06 00 -> Message #6, still using the same session # as the initial response
Larger example, which has data
0B 00 07 E1 BE 3B 01 00 7E 45 78 74 65 6E 64 46 6D 74
I've also been trying to figure this out by reading the disassembled code from the original application. The screenshot below happens before the socket.sendto and seems to be related.
Any help will be extremely appreciated.
EDIT: Made some success with debugging the application using ollydbg. The CRC appears in register (reversed) EDX at the selected line in the following screenshot.
Take a look at CRC RevEng. If you can correctly identify the data that the CRC is operating on and the location of the CRC, you should be able to determine the CRC parameters. If it is a CRC.
I've managed to create a php script that does the CRC calculation by debugging the application using OllyDbg.
The CRC is calculated by Adding up every 2 bytes (every short). if the result is larger than a short, the 'most significant short' is added to the 'least significant short' until the result fits in a short. Finally, the CRC (short) is inverted.
I'll add my php script for completeness:
<?php
function CompareHash($telegram)
{
$telegram = str_replace(" ", "", $telegram);
$telegram_crc = substr($telegram, 4, 4);
$telegram = str_replace($telegram_crc, "0000", $telegram);
echo "Telegram: ", $telegram, ', Crc: ', $telegram_crc, ' (', hexdec($telegram_crc), ')<br />';
$crc = 0;
$i = 0;
while ($i < strlen($telegram))
{
$short = substr($telegram, $i, 4);
if (strlen($short) < 4) $short = $short . '00';
$crc += hexdec($short);
$i += 4;
}
echo "Crc: ", $crc, ', inverse: ', ~$crc;
// Region "truncate CRC to Int16"
while($crc > hexdec('FFFF'))
{
$short = $crc & hexdec ('FFFF');
// Region "unsigned shift right by 16 bits"
$crc = $crc >> 16;
$crc = $crc & hexdec ('FFFF');
// End region
$crc = $short + $crc;
}
// End region
// Region "invert Int16"
$crc = ~$crc;
$crc = $crc & hexdec ('FFFF');
// End region
echo ', shifted ', $crc;
if (hexdec($telegram_crc) == $crc)
{
echo "<br />MATCH!!! <br />";
}
else
{
echo "<br />failed .... <br />";
}
}
$s1_full = "E8 03 17 FC 00 00 00 00";
$s2_full = "D0 07 71 BC BE 3B 00 00";
$s3_full = "D0 07 4E D4 E1 23 00 00";
$s4_full = "D0 07 35 32 BE 3B 07 00 7E 44 65 76 69 63 65 4E 61 6D 65 3D 54 41 38 31 30 00";
$s5_full = "0B 00 39 6C BE 3B 05 00 7E 52 46 43 61 72 64 4F 6E";
CompareHash($s1_full);
CompareHash($s2_full);
CompareHash($s3_full);
CompareHash($s4_full);
CompareHash($s5_full);
?>
Thanks for the feedback!
I have some binary data stream which passes geo location coordinates - latitude and longitude. I need to find the method they are encoded.
4adac812 = 74°26.2851' = 74.438085
2b6059f9 = 43°0.2763' = 43.004605
4adaee12 = 74°26.3003' = 74.438338
2a3c8df9 = 42°56.3177' = 42.938628
4ae86d11 = 74°40.1463' = 74.669105
2afd0efb = 42°59.6263' = 42.993772
1st value is hex value. 2nd & 3rd are values that I get in output (not sure which one is used in conversion).
I've found that first byte represents integer part of value (0x4a = 74). But I cannot find how decimal part is encoded.
I would really appreciate any help!
Thanks.
--
Upd: This stream comes from some "chinese" gps server software through tcp protocol. I have no sources or documentation for clent software. I suppose it was written in VC++6 and uses some standard implementations.
--
Upd: Here is packets I get:
Hex data:
41 00 00 00 13 bd b2 2c
4a e8 6d 11 2a 3c 8d f9
f6 0c ee 13
Log data in client soft:
[Lng] 74°40.1463', direction:1
[Lat] 42°56.3177', direction:1
[Head] direction:1006, speed:3318, AVA:1
[Time] 2011-02-25 19:52:19
Result data in client (UI):
74.669105
42.938628
Head 100 // floor(1006/10)
Speed 61.1 // floor(3318/54.3)
41 00 00 00 b1 bc b2 2c
4a da ee 12 2b 60 59 f9
00 00 bc 11
[Lng] 74°26.3003', direction:1
[Lat] 43°0.2763', direction:1
[Head] direction:444, speed:0, AVA:1
[Time] 2011-02-25 19:50:49
74.438338
43.004605
00 00 00 00 21 bd b2 2c
4a da c8 12 aa fd 0e fb
0d 0b e1 1d
[Lng] 74°26.2851', direction:1
[Lat] 42°59.6263', direction:1
[Head] direction:3553, speed:2829, AVA:1
[Time] 2011-02-25 19:52:33
74.438085
42.993772
I don't know what first 4 bytes mean.
I found the lower 7 bits of 5th byte represent the number of sec. (maybe 5-8 bits are time?)
Byte 9 represent integer of Lat.
Byte 13 is integer of Lng.
Bytes 17-18 reversed (word byte) is speed.
Bytes 19-20 reversed is ava(?) & direction (4 + 12 bits). (btw, somebody knows what ava is?)
And one note. In 3rd packet 13th byte you can see only lower 7 bits are used. I guess 1st bit doesnt mean smth (I removed it in the beginning, sorry if I'm wrong).
I have reordered your data so that we first have 3 longitures and then 3 latitudes:
74.438085, 74.438338, 74.669105, 43.004605, 42.938628, 42.993772
This is the best fit of the hexadecimals i can come up with is:
74.437368, 74.439881, 74.668392, 42.993224, 42.961388, 42.982391
The differences are: -0.000717, 0.001543, -0.000713, -0.011381, 0.022760, -0.011381
The program that generates these values from the complete Hex'es (4 not 3 bytes) is:
int main(int argc, char** argv) {
int a[] = { 0x4adac812, 0x4adaee12, 0x4ae86d11, 0x2b6059f9, 0x2a3c8df9, 0x2afd0efb };
int i = 0;
while(i<3) {
double b = (double)a[i] / (2<<(3*8)) * 8.668993 -250.0197;
printf("%f\n",b);
i++;
}
while(i<6) {
double b = (double)a[i] / (2<<(3*8)) * 0.05586007 +41.78172;
printf("%f\n",b);
i++;
}
printf("press key");
getch();
}
Brainstorming here.
If we look at the lower 6 bits of the second byte (data[1]&0x3f) we get the "minutes" value for most of the examples.
0xda & 0x3f = 0x1a = 26; // ok
0x60 & 0x3f = 0; // ok
0xe8 & 0x3f = 0x28 = 40; // ok
0x3c & 0x3f = 0x3c = 60; // should be 56
0xfd & 0x3f = 0x3d = 61; // should be 59
Perhaps this is the right direction?
I have tried your new data packets:
74+40.1463/60
74+26.3003/60
74+26.2851/60
42+56.3177/60
43+0.2763/60
42+59.6263/60
74.66910, 74.43834, 74.43809, 42.93863, 43.00460, 42.99377
My program gives:
74.668392, 74.439881, 74.437368, 42.961388, 42.993224, 39.407346
The differences are:
-0.000708, 0.001541, -0.000722, 0.022758, -0.011376, -3.586424
I re-used the 4 constants i derived from your first packet as those are probably stored in your client somewhere. The slight differences might be the result of some randomization the client does to prevent you from getting the exact value or reverse-engineering their protocol.