The usage of function cutPoints - partitioning

The usage of function cutPoints in Dolphindb
I used the function cutPoints in the below statements:
symbols = array(SYMBOL, 0, 100)
symbols = symbols.distinct().sort!().append!("999999");
symRanges = symbols.cutPoints(100)
But the error occurred:
binNum is larger than the number of data points.

Generally, cutPoints is used to distribute elements within a number of buckets and it will return a vector. This function can also be used to get the partition scheme of a range domain in a distributed database.
The symbols written in your script contain only one element, so the error is raised because the element cannot be divided into 100 ranges.

Related

"iteration limit reached" in lme4 GLMM - what does it mean?

I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)

Octave force deepcopy

The question
What are the ways of coercing octave to create a real copy of whatever object? Structures are the main interest.
My underlying problem
In my problem I'm obtaining a rather large structure from another function in a loop but for the current task only a few pieces of it are needed. For example:
for i=1:many
res=solver(params);
store1{i}=res.string1;
store2{i}=res.arr(:,1);
end
res is a sizable chunk of data and due to lazy-copy those store-s are references to tiny portions of bytes in that chunk. After I store those tiny portions, I don't need res itself, however, since middle of that chunk is referenced by store, the memory area is unfit for res obtained on the next iteration (they are of the same size) and thus another sizable piece of memory is allocated, which is then again crossed by few tiny links an so on.
Without storing parts of res, the program successfully keeps the memory consumption same after first couple of iterations.
So how do I make a complete copy of structure field?
I've tried using struct-related functions like rmfield but those keep references instead of their own objects.
I've tried to wrap the assignment of in its own function:
new_struct=copy( rmfield(old_struct,"bigdata"));
function c=copy(a);
c=a;
end;
This by the way doesn't work even for arrays.
I'm interested in method applicable to any generic variable.
Minimal working example of the problem
a=cell(3,1);
for i=1:length(a);
r=rand(100000,1000);
a{i}=r(1:100,end);
whos; fflush(stdout);
pause(2);
end;
The above code will cause memory usage to gradually grow by far more than 8.08 kb reported by whos due to references stored by a{i} blocking bigger memory block than they actually need. If you force the proper copy, the problem is not present.
Numerical arrays
For numeric types addition of zero is enough to warrant a new array.
c=a+0;
Strings
For string which is 1 x n char array, something along the following lines will work:
c=[a "a"](1:end-1);
Multidimensional char arrays will require concatenation with a column:
c=[a true(size(a,1),1)](:,1:end-1);
Here true is used to generate dummy array of size compatible with char. (There seems to be no procedural method of generating char array of arbitrary size) char(zeros(size(a,1),1)) and char(true(size(a,1),1)) caused excess memory usage during their creation on some calls.
Note that empty concatenation c=[a ""]; will not result in a copying. Also it is possible to do c=[a+0 ""]; which will result in a copying due to +0 but that one infers type conversions to and from double which is 8 times larger in size. (char(zeros( doesn't seem to cause that)
Other types
In general you can use casting for the types allowed by it in order to not tailor the expressions manually as I had to do above:
typelist={"double","single","char"}; %full list of supported types is available in the link
class_of_a = typelist{ isa(a,typelist) };
c=typecast( [typecast(a,'single'); single(1)] (1:end-1), class_of_a);
Single is seemingly smallest datatype available in octave.
Note that logical is not supported by this method.
Copying structures
Apparently you'd have to write your own function to go around struct fields, copy them with above methods and recursively go to substructs.
(As it doesn't involve complexities relevant here, I'd rather leave that to be done by those who actually needs that, my own problem being solved by +0's.)

Generate unique serial from id number

I have a database that increases id incrementally. I need a function that converts that id to a unique number between 0 and 1000. (the actual max is much larger but just for simplicity's sake.)
1 => 3301,
2 => 0234,
3 => 7928,
4 => 9821
The number generated cannot have duplicates.
It can not be incremental.
Need it generated on the fly (not create a table of uniform numbers to read from)
I thought a hash function but there is a possibility for collisions.
Random numbers could also have duplicates.
I need a minimal perfect hash function but cannot find a simple solution.
Since the criteria are sort of vague (good enough to fool the average person), I am unsure exactly which route to take. Here are some ideas:
You could use a Pearson hash. According to the Wikipedia page:
Given a small, privileged set of inputs (e.g., reserved words for a compiler), the permutation table can be adjusted so that those inputs yield distinct hash values, producing what is called a perfect hash function.
You could just use a complicated looking one-to-one mathematical function. The drawback of this is that it would be difficult to make one that was not strictly increasing or strictly decreasing due to the one-to-one requirement. If you did something like (id ^ 2) + id * 2, the interval between ids would change and it wouldn't be immediately obvious what the function was without knowing the original ids.
You could do something like this:
new_id = (old_id << 4) + arbitrary_4bit_hash(old_id);
This would give the unique IDs and it wouldn't be immediately obvious that the first 4 bits are just garbage (especially when reading the numbers in decimal format). Like the last option, the new IDs would be in the same order as the old ones. I don't know if that would be a problem.
You could just hardcode all ID conversions by making a lookup array full of "random" numbers.
You could use some kind of hash function generator like gperf.
GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
You could encrypt the ids with a key using a cryptographically secure mechanism.
Hopefully one of these works for you.
Update
Here is the rotational shift the OP requested:
function map($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
// Also, mask out all but 10 bits. This allows unique mappings
// from 0-1023 to 0-1023
$high_bits = 0b0000001111111000 & $number;
$new_low_bits = $high_bits >> 3;
$low_bits = 0b0000000000000111 & $number;
$new_high_bits = $low_bits << 7;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
function demap($number)
{
// Shift the high bits down to the low end and the low bits
// down to the high end
$high_bits = 0b0000001110000000 & $number;
$new_low_bits = $high_bits >> 7;
$low_bits = 0b0000000001111111 & $number;
$new_high_bits = $low_bits << 3;
// Recombine bits
$new_number = $new_high_bits | $new_low_bits;
return $new_number;
}
This method has its advantages and disadvantages. The main disadvantage that I can think of (besides the security aspect) is that for lower IDs consecutive numbers will be exactly the same (multiplicative) interval apart until digits start wrapping around. That is to say
map(1) * 2 == map(2)
map(1) * 3 == map(3)
This happens, of course, because with lower numbers, all the higher bits are 0, so the map function is equivalent to just shifting. This is why I suggested using pseudo-random data for the lower bits rather than the higher bits of the number. It would make the regular interval less noticeable. To help mitigate this problem, the function I wrote shifts only the first 3 bits and rotates the rest. By doing this, the regular interval will be less noticeable for all IDs greater than 7.
It seems that it doesn't have to be numerical? What about an MD5-Hash?
select md5(id+rand(10000)) from ...

How are functions modified at run-time then propagated to multiple threads?

With Clojure (and other Lisp dialects) you can modify running code. So, when a function is modified during runtime is that change made available to multiple threads?
I'm trying to figure out how it works technically in a concurrent setting: if several threads are using a function foo, what happens when I redefine (say using defn) the function foo?
There has to be some synchronization going on: when and how does such synchronization happen and what does it cost?
Say on a JVM, is the function referenced using a volatile reference? If so, does it mean every single time there's a "function lookup" then one has to pay the volatile cost?
In Clojure functions are instances of the IFn class and they are almost always stored in vars. vars are Clojures mechanism for thread local values.
when you define a function that sets the "root binding" of the var to reference the function
threads other threads get whatever the the current value of the root binding for the var but can't change the value. this prevents any two threads from having to fight over the value of the var because only the root thread can set the value.
threads can choose to use a new value of the var if they need to, but calling binding which gives then their own thread local value that they are free to change at will because no other thread can read it.
A good understanding of vars is well worth a little study, they are a very useful concurrency device once you get used to them.
ps: the root thread is usually the REPL
pss: you are of course free to store your functions in something other than vars, if for instance you needed to atomically update a group of functions, though this is rare.

Accessing an array element directly vs. assigning it to a variable

Performance-wise, is it better to access an array element 'directly' multiple times, or assign its value to a variable and use that variable? Assuming I'll be referencing the value several times in the following code.
The reasoning behind this question is that, accessing an array element presumably involves some computing cost each time it is done, without requiring extra space. On the other hand, storing the value in a variable eliminates this access-cost, but takes up extra space.
// use a variable to store the value
Temp = ArrayOfValues(0)
If Temp > 100 Or Temp < 50 Then
Dim Blah = Temp
...
// reference the array element 'directly'
If ArrayOfValues(0) > 100 Or ArrayOfValues(0) < 50 Then
Dim Blah = ArrayOfValues(0)
...
I know this is a trivial example, but assuming we're talking about a larger scale in actual use (where the value will be referenced many times) at what point is the tradeoff between space and computing time worth considering (if at all)?
This is tagged language agnostic, but I don't really believe that it is. This post answers the C and C++ version of the question.
An optimizing compiler can take care of "naked" array accesses; in C or C++ there's no reason to think that the compiler wouldn't remember the value of a memory location if no functions were called in between. E.g.
int a = myarray[19];
int b = myarray[19] * 5;
int c = myarray[19] / 2;
int d = myarray[19] + 3;
However, if myarray is not just defined as int[] but is actually something "fancy", especially some user defined container type with a function operator[]() defined in another translation unit, then that function must be called each time the value is requested (since the function is returning the data at location in memory and the local function doesn't know that the result of the function is intended to be constant).
Even with 'naked' arrays though, if you access the same thing multiple times around function calls, the compiler similarly must assume that the value has been changed (even if it can remember the address itself). E.g.
int a = myarray[19];
NiftyFunction();
int b = myarray[19] * 8;
There's no way that the compiler can know that myarray[19] will have the same value before and after the function call.
So- generally speaking, if you know that a value is constant through the local scope, "cache" it in a local variable. You can program defensively and use assertions to validate this condition you've put on things:
int a = myarray[19];
NiftyFunction();
assert(myarray[19] == a);
int b = a * 8;
A final benefit is that it's much easier to inspect the values in a debugger if they're not buried in an array somewhere.
The overhead in memory consumption is very limited because for reference types it's just a pointer (couple of bytes) and most value types also require just a few bytes.
Arrays are very efficient structures in most languages. Getting to an index doesn't involve any lookup but just some math (each array slot takes 4 bytes so the 11th slot is at offset 40). Then there is probably a bit of overhead for bounds checking. Allocating the memory for a new local var and freeing it requires a bit of cpu cycles as well. So in the end it also depends how many array lookups you eliminate by copying to a local var.
Fact is that you really need exceptionally crappy hardware or have really big loops for this to be really important and if it is run a decent test on it. I personally choose often for the seperate variable as I find that it makes the code more readible.
Your example is odd btw since you do 2 array lookups before you create the local var :)
This makes more sense (elimination 2 more lookups)
Dim blah = ArrayOfValues(0)
if blah > 100 or blah < 50 then
...