How to read 3x3xN coordinates string into matlab array efficently - json

I have a MATLAB script that takes a JSON that was created by myself in a remote server and contains a long list of 3x3xN coordinates e.g. for N=1:
str = '[1,2,3.14],[4,5.66,7.8],[0,0,0],';
I want to avoid string splitting it, is there any approach to use strread or similar to read this 3×3×N tensor?
It's a multi-particle system and N can be large, though I have enough memory to store it all at once in the memory.
Any suggestion of how to format the array string in the JSON is very welcome as well.

If you can guarantee the format is always the same, I think it's easiest, safest and fastest to use sscanf:
fmt = '[%f,%f,%f],[%f,%f,%f],[%f,%f,%f],';
data = reshape(sscanf(str, fmt), 3, 3).';
Depending on the rest of your data (how is that "N" represented?), you might need to adjust that reshape/transpose.
EDIT
Based on your comment, I think this will solve your problem quite efficiently:
% Strip unneeded concatenation characters
str(str == ',') = ' ';
str(str == ']' | str == '[') = [];
% Reshape into workable dimensions
data = permute( reshape(sscanf(str, '%f '), 3,3,[]), [2 1 3]);
As noted by rahnema1, you can avoid the permute and/or character removal by adjusting your JSON generators to spit out the data column-major and without brackets, but you'll have to ask yourself these questions:
whether that is really worth the effort, considering that this code right here is already quite tiny and pretty efficient
whether other applications are going to use the JSON interface, because in essence you're de-generalizing the JSON output just to fit your processing script on the other end. I think that's a pretty bad design practice, but oh well.
Just something to keep in mind:
emitting 500k values in binary is about 34 MB
doing the same in ASCII is about 110 MB
Now depending a bit on your connection speed, I'd be getting really annoyed really quickly because every little test run takes about 3 times as long as it should be taking :)
So if an API call straight to the raw data is not possible, I would at least base64 that data in the JSON.

You can use eval function:
str = '[1,2,3.14],[4,5.66,7.8],[0,0,0],';
result=permute(reshape(eval(['[' ,str, ']']),3,3,[]),[2 1 3])
result =
1.00000 2.00000 3.14000
4.00000 5.66000 7.80000
0.00000 0.00000 0.00000
Using eval all elements concatenated to create a row vector. Then row vector reshaped to a 3d array. Since in MATLAB elements are placed in matrix columnwise it is required to permute the array so each 3*3 matrix are trasposed.
note1: There is no need to place [] in jSON string so you can use str2num instead of eval :
result=permute(reshape(str2num(str),3,3,[]),[2 1 3])
note2:
if you save data columnwise there is no need to permute:
str='1 4 0 2 5.66 0 3.14 7.8 0';
result=reshape(str2num(str),3,3,[])
Update: As Ander Biguri and excaza noted about security an speed issues related to eval and str2num and after Rody Oldenhuis 's suggestion about using sscanf I tested 3 methods in Octave:
a=num2str(rand(1,60000));
disp('-----SSCANF---------')
tic
sscanf(a,'%f ');
toc
disp('-----STR2NUM---------')
tic
str2num(a);
toc
disp('-----STRREAD---------')
tic
strread(a,'%f ');
toc
and here is the result:
-----SSCANF---------
Elapsed time is 0.0344398 seconds.
-----STR2NUM---------
Elapsed time is 0.142491 seconds.
-----STRREAD---------
Elapsed time is 0.515257 seconds.
So it is more secure and faster to use sscanf, in your case:
str='1 4 0 2 5.66 0 3.14 7.8 0';
result=reshape(sscanf(str,'%f '),3,3,[])
or
str='1, 4, 0, 2, 5.66, 0, 3.14, 7.8, 0';
result=reshape(sscanf(str,'%f,'),3,3,[])

Related

Stable Baselines - PPO Iterate through the data frame for learning

PPO model doesn't iterate through the whole dataframe .. its basically repeating the first step many times (10,000 in this example) ?
In this case, DF's shape is (5476, 28) and each step's obs shape is: (60, 28).. I dont see that its iterating through the whole DF.
# df shape - (5476, 28)
env = MyRLEnv(df)
model = PPO("MlpPolicy", env, verbose=4)
model.learn(total_timesteps=10000)
MyRLEnv:
self.action_space = spaces.Discrete(4)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(60, 28) , dtype=np.float64)
Thanks!
I also got stuck few days ago on something similar to this , but after inspecting deeply I found that the learn method actually runs the environment for n times , now this n is equal to total_timesteps/size_of_df , this in your case would be nearly 10000/5476times which is almost equal to 1.8 , so this 1.8 means , the algorithm would reset the environment at the beginning, then run the step method for the entire dataframe and reset the environment again and run the step method for only 80% of the data in dataframe. So, when the PPO Algorithm stops you see only 80% of the dataframe is being ran .
The Actor Critic Algorithms runs the environment numerous number of times to improve it's efficiency, so that is the reason it is usually suggested that in order to get better results we should keep the value of total_timesteps fairly high , so that it can run it on the same data for quite some times to learn better.
Example:
Say my total_timesteps = 10000 and len(df) = 5000,
then in that case it would run for, n = total_timesteps/len(df) = 2 Full scans of the entire dataframe .

How to manipulate binary numbers efficiently in Crystal?

I'm trying to implement the Bitcoin specification BIP-39, specifically the part Generating the mnemonic. The following causes some headaches:
Next, these concatenated bits are split into groups of 11 bits, each encoding a number from 0-2047, serving as an index into a wordlist. Finally, we convert these numbers into words and use the joined words as a mnemonic sentence.
Splitting a binary number into groups of 11 bits. But how would I do this efficiently in Crystal?
Here is what I do, I personally find it a bit embarrassing but admittedly it works:
seed = "87C1B129FBADD7B6E9ABC0A9EF7695436D767AECE042BEC198A97E949FCBE14C0d"
# => "87C1B129FBADD7B6E9ABC0A9EF7695436D767AECE042BEC198A97E949FCBE14C0d"
bin = BigInt.new(seed, 16).to_s(2)
# => "100001111100000110110001001010011111101110101101110101111011011011101001101010111100000010101001111011110111011010010101010000110110110101110110011110101110110011100000010000101011111011000001100110001010100101111110100101001001111111001011111000010100110000001101"
iter = 0
size = 11
while iter < bin.size
p bin[iter, size]
# => "10000111110"
# [...]
end
Now, as I said, it works, I can take the binary strings and convert them back to numbers and continue, but this cannot be it. I'm wondering, what is a more elegant, more efficient, or more correct way to approach this?
Sorry for the succinct answer, but I think what you're looking for is BitArray. Hope it serves you well!

Octave -inf and NaN

I searched the forum and found this thread, but it does not cover my question
Two ways around -inf
From a Machine Learning class, week 3, I am getting -inf when using log(0), which later turns into an NaN. The NaN results in no answer being given in a sum formula, so no scalar for J (a cost function which is the result of matrix math).
Here is a test of my function
>> sigmoid([-100;0;100])
ans =
3.7201e-44
5.0000e-01
1.0000e+00
This is as expected. but the hypothesis requires ans = 1-sigmoid
>> 1-ans
ans =
1.00000
0.50000
0.00000
and the Log(0) gives -Inf
>> log(ans)
ans =
0.00000
-0.69315
-Inf
-Inf rows do not add to the cost function, but the -Inf carries through to NaN, and I do not get a result. I cannot find any material on -Inf, but am thinking there is a problem with my sigmoid function.
Can you provide any direction?
The typical way to avoid infinity in these cases is to add eps to the operand:
log(ans + eps)
eps is a very, very small value, and won't affect the output for values of ans unless ans is zero:
>> z = [-100;0;100];
>> g = 1 ./ (1+exp(-z));
>> log(1-g + eps)
ans =
0.0000
-0.6931
-36.0437
Adding to the answers here, I really do hope you would provide some more context to your question (in particular, what are you actually trying to do.
I will go out on a limb and guess the context, just in case this is useful. You are probably doing machine learning, and trying to define a cost function based on the negative log likelihood of a model, and then trying to differentiate it to find the point where this cost is at its minimum.
In general for a reasonable model with a useful likelihood that adheres to Cromwell's rule, you shouldn't have these problems, but, in practice it happens. And presumably in the process of trying to calculate a negative log likelihood of a zero probability you get inf, and trying to calculate a differential between two points produces inf / inf = nan.
In this case, this is an 'edge case', and generally in computer science edge cases need to be spotted as exceptional circumstances and dealt with appropriately. The reality is that you can reasonably expect that inf isn't going to be your function's minimum! Therefore, whether you remove it from the calculations, or replace it by a very large number (whether arbitrarily or via machine precision) doesn't really make a difference.
So in practice you can do either of the two things suggested by others here, or even just detect such instances and skip them from the calculation. The practical result should be the same.
-inf means negative infinity. Which is the correct answer because log of (0) is minus infinity by definition.
The easiest thing to do is to check your intermediate results and if the number is below some threshold (like 1e-12) then just set it to that threshold. The answers won't be perfect but they will still be pretty close.
Using the following as the sigmoid function:
function g = sigmoid(z)
g = 1 ./ (1 + e.^-z);
end
Then the following code runs with no issues. Choose the threshold value in the 'max' statement to be less than the expected noise in your measurements and then you're good to go
>> a = sigmoid([-100, 0, 100])
a =
3.7201e-44 5.0000e-01 1.0000e+00
>> b = 1-a
b =
1.00000 0.50000 0.00000
>> c = max(b, 1e-12)
c =
1.0000e+00 5.0000e-01 1.0000e-12
>> d = log(c)
d =
0.00000 -0.69315 -27.63102

For-Loop for finding combinations of springs?

I need to use a for-loop in a function in order to find spring constants of all possible combinations of springs in series and parallel. I have 5 springs with data therefore I found the spring constant (K) of each in a new matrix by using polyfit to find the slope (using F=Kx).
I have created a function that does so, however it returns data not in a matrix, but as individual outputs. So instead of KP (Parallel)= [1 2 3 4 5] it says KP=1, KP=2, KP=3, etc. Because of this, only the final output is stored in my workspace. Here is the code I have for the function. Keep in mind that the reason I need to use the +2 in the for loop for b is because my original matrix K with all spring constants is ten columns, with every odd number being a 0. Ex: K=[1 0 2 0 3 0 4 0 5] --- This is because my original dataset to find K (slope) was ten columns wide.
function[KP,KS]=function_name1(K)
L=length(K);
c=1;
for a=1:2:L
for b=a+2:2:L
KP=K(a)+K(b)
KS=1/((1/K(a))+(1/K(b)))
end
end
c=c+1;
and then a program calling that function
[KP,KS]=function_name1(K);
What I tried: - Suppressing and unsuppressing lines of code (unsuccessful)
Any help would be greatly appreciated.
hmmm...
your code seems workable, but you aren't dealing with things in the most practical manner
I'd start be redimensioning K so that it makes sense, that is that it's 5 spaces wide instead of your current 10 - you'll see why in a minute.
Then I'd adjust KP and KS to the size that you want (I'm going to do a 5X5 as that will give all the permutations - right now it looks like you are doing some triangular thing, I wouldn't worry too much about space unless you were to do this for say 50,000 spring constants or so)
So my code would look like this
function[KP,KS]=function_name1(K)
L=length(K);
KP = zeros(L);
KS = zeros(l);
c=1;
for a=1:L
for b=1:L
KP(a,b)=K(a)+K(b)
KS(a,b)=1/((1/K(a))+(1/K(b)))
end
end
c=c+1;
then when you want the parallel combination of springs 1 and 4 KP(1,4) or KP(4,1) will do the trick

What is the probability of collision with a 6 digit random alphanumeric code?

I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.
perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'
Because of the Birthday Paradox it's more likely than you might think.
There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).
I ran a few tests and this is the number of codes I could generate before the first collision occurred:
73366
59307
79297
36909
Collisions will become more frequent as the number of codes increases.
Here was my test code (written in Python):
>>> import random
>>> codes = set()
>>> while 1:
code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
if code in codes: break
codes.add(code)
>>> len(codes)
36909
Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately
1 - ((d-1)/d)**(n*(n-1)/2)
For any n over 50,000 or so, that's pretty high.
Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.
Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:
http://wolfr.am/niaHIF
Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.
As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:
1 - ((d-1)/d)^(n*(n-1)/2)
We can easily plot this in R:
> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')
which gives
As you can see the collision probability increases very quickly with the number of trials/rows
While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique.
See Fisher-Yates shuffle on Wikipedia.
A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:
The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.
The module provides a drop in replacement for rand which is handy.
You can generate the sequence of keys with the following code:
#!/usr/bin/env perl
use warnings; use strict;
use Math::Random::MT::Auto qw( rand );
my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;
for my $i (1 .. $SEQUENCE_LENGTH) {
my $pick = pick_one();
$picks += 1;
redo if exists $dict{ $pick };
$dict{ $pick } = undef;
}
printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;
sub pick_one {
join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}
Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.