I am completely new to Perl, like absolute newbie. I am trying to develop a system which reads a database and, according to the results, generates a queue which launches another script.
HERE is the source code.
Now the script works as expected, except I have noticed that it doesn't really do the threads parallel. Whether I use 1 thread or 50 threads, the execution time is the same; 1 thread is even faster.
When I have the script display which thread did what, I see the threads don't run at the same time, because it will do thread 1, then 2, then 3 etc.
Does anyone know what I did wrong here? Again the script itself works, just not in parallel threads.
You need to learn what semaphores actually are before you start using them. You've explicitly told the threads not to run in parallel:
my $s = Thread::Semaphore->new;
#...
while ($queue_id_list->pending > 0) {
$s->down;
my $info = $queue_id_list->dequeue_nb;
if (defined($info)) {
my #details = split(/#/, $info);
#my $result = system("./match_name db=user_".$details[0]." id=".$details[1]);
# normally the script above would be launched which is a php script run in php-cli and does some database things
sleep(0.1);
#print "Thread: ". threads->self->tid. " - Done user: ".$details[0]. " and addressbook id: ". $details[1]."\r\n";
#print $queue_id_list->pending."\r\n";
}
$s->up;
}
You've created a semaphore $s, which by default has a count of 1. Then in the function you're trying to run, you call $s->down at the start -- which decreases the count by 1, or blocks if the count is already <1, and $s->up at the end, which increases the count by 1.
Once a thread calls down, no other threads will run until it calls up again.
You should carefully read the Thread::Semaphore docs, and probably this wikipedia article on semaphores, too.
Related
I want to design a scheduler that works in Arinc653 manner just for experimental issues.
Is this possible to manipulate the scheduler in this way?
There is time-slicing in threadX I know but all examples I've encountered are using TX_NO_TIME_SLICE (And my shots with that did not work either.).
Besides I'm not sure if time-slice make the thread wait until its deadline met or put it into sleep so that other threads get running.
For short; Arinc653 scheduler defines a constant major frame that each
'thread' has its definite amount of running times and repeats major
frame endlessly. If a thread assigned with i.e 3ms within a major frame and it finishes its job in 1 ms; kernel still waits 2ms to switch next 'thread'.
You can use time slicing to limit the amount of time each thread runs: https://learn.microsoft.com/en-us/azure/rtos/threadx/chapter4#tx_thread_create
I understand that the characteristic of of the Arinc653 scheduler that you want to emulate is time partitioning. The ThreadX schedule policy is based on priority, preemption threshold and time-slicing.
You can emulate time partitioning with ThreadX. To achieve that you can use a timer, where you can suspend/resume threads for each frame. Timers execute in a different context than threads, they are light weight and not affected by priorities. By default ThreadX uses a timer thread, set to the highest priority, to execute threads; but to get better performance you can compile ThreadX to run the timers inside an IRQ (define the option TX_TIMER_PROCESS_IN_ISR).
An example:
Threads thd1,thd2,thd3 belong to frame A
Threads thd4,thd5,thd6 belong to frame B
Timer tm1 is triggered once every frame change
Pseudo code for tm1:
tm1()
{
static int i = 0;
if (i = ~i)
{
tx_thread_suspend(thd1);
tx_thread_suspend(thd2);
tx_thread_suspend(thd3);
tx_thread_resume(thd4);
tx_thread_resume(thd5);
tx_thread_resume(thd6);
}
else
{
tx_thread_suspend(thd4);
tx_thread_suspend(thd5);
tx_thread_suspend(thd6);
tx_thread_resume(thd1);
tx_thread_resume(thd2);
tx_thread_resume(thd3);
}
}
I am targeting 10x load for an API, this API contains 6 endpoints which should be under the test, but each endpoint has its own throughput which should be multiplied by 10.
Now, I put all endpoints in one script file, but it doesn't make any sense to have the same throughput for all endpoints, I wanna run the k6 and it has to stop automatically when the needed throughput is already reached for a specific group.
Example:
api/GetUser > current 1k RPM > target 10k RPM
api/GetManyUsers > current 500 RPM > target 5k RPM
The main problem is when I put each endpoint in a separate group in one single script, this let k6 iterate over both groups/endpoints with the same iterations count with the same virtual users, which leads to reach 10x for both endpoints which is not required at the moment.
One more thing, I already tried to separate all endpoints in separate scripts, but this is difficult to manage and this makes the monitoring not easy because all 6 endpoints should be run in parallel.
What you need can currently be approximated roughly with the __ITER and/or __VU execution context variables. Have a single default function that has something like this:
if (__ITER % 3 == 0) {
CallGetManyUsers(); // 33% of iterations
} else {
CallGetUser(); // 66% of iterations
}
In the very near future we plan to also add a more elegant way of supporting multi-scenario tests in a single script: https://github.com/loadimpact/k6/pull/1007
Does it make sense to set batchSize = 1? In case I would like to process files one-at-a-time?
Tried batchSize = 1000 and batchSize = 1 - seems to have the same effect
{
"version": "2.0",
"functionTimeout": "00:15:00",
"aggregator": {
"batchSize": 1,
"flushTimeout": "00:00:30"
}
}
Edited:
Added into app setings:
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
Still the function is triggered simultaneously - using blob trigger. Two more files were uploaded.
From https://github.com/Azure/azure-functions-host/wiki/Configuration-Settings
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
Set a maximum number of instances that a function app can scale to. This limit is not yet fully supported - it does work to limit your scale out, but there are some cases where it might not be completely foolproof. We're working on improving this.
I think I can close this issue. There is no easy way how to set one-message-one-time feature in multiple function apps instances.
I think your misunderstand the batchSize meaning with aggregator. This batchSize means Maximum number of requests to aggregate. You could check here and about the aggregator it's configured to the runtime agregates data about function executions over a period of time.
From your description, it's similar to the Azure Queue batchSize. It sets the number of queue messages that the Functions runtime retrieves simultaneously and processes in parallel. And If you want to avoid parallel execution for messages received on one queue, you can set batchSize to 1(This means one-message-one-time).
I'm using the implementation of the parallel reduction on CUDA using new kepler's shuffle instructions, similar to this:
http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
I was searching for the minima of rows in a given matrix, and in the end of the kernel I had the following code:
my_register = min(my_register, __shfl_down(my_register,8,16));
my_register = min(my_register, __shfl_down(my_register,4,16));
my_register = min(my_register, __shfl_down(my_register,2,16));
my_register = min(my_register, __shfl_down(my_register,1,16));
My blocks are 16*16, so everything worked fine, with that code I was getting minima in two sub-rows in the very same kernel.
Now I also need to return the indices of the smallest elements in every row of my matrix, so I was going to replace "min" with the "if" statement and handle these indices in a similar fashion, I got stuck at this code:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
if (my_reg > __shfl_down(my_reg,4,16)){my_reg = __shfl_down(my_reg,4,16);};
if (my_reg > __shfl_down(my_reg,2,16)){my_reg = __shfl_down(my_reg,2,16);};
if (my_reg > __shfl_down(my_reg,1,16)){my_reg = __shfl_down(my_reg,1,16);};
No cudaErrors whatsoever, but kernel returns trash now. Nevertheless I have fix for that:
myreg_tmp = __shfl_down(myreg,8,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,4,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,2,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
myreg_tmp = __shfl_down(myreg,1,16);
if (myreg > myreg_tmp){myreg = myreg_tmp;};
So, allocating new tmp variable to sneak into neighboring registers saves everything for me.
Now the question: Are the kepler shuffle instructions destructive ? in a sense that invoking same instruction twice doesn't issue the same result. I haven't assigned anything to those registers saying "my_reg > __shfl_down(my_reg,8,16)" - this adds up to my confusion. Can anyone explain me what is the problem with invoking shuffle twice? I'm pretty much a newbie in CUDA, so detailed explanation for dummies is welcomed
warp shuffle is not destructive. The operation, if repeated under the exact same conditions, will return the same result each time. The var value (myreg in your example) does not get modified by the warp shuffle function itself.
The problem you are experiencing is due to the fact that the number of participating threads on the second invocation of __shfl_down() in your first method is different than the other invocations, in either method.
First, let's remind ourselves of a key point in the documentation:
Threads may only read data from another thread which is actively participating in the __shfl() command. If the target thread is inactive, the retrieved value is undefined.
Now let's take a look at your first "broken" method:
if (my_reg > __shfl_down(my_reg,8,16)){my_reg = __shfl_down(my_reg,8,16);};
The first time you call __shfl_down() above (within the if-clause), all threads are participating. Therefore all values returned by __shfl_down() will be what you expect. However, once the if clause is complete, only threads that satisfied the if-clause will participate in the body of the if-statement. Therefore, on the second invocation of __shfl_down() within the if-statement body, only threads for which their my_reg value was greater than the my_reg value of the thread 8 lanes above them will participate. This means that some of these assignment statements probably will not return the value you expect, because the other thread may not be participating. (The participation of the thread 8 lanes above would be dependent on the result of the if comparison done by that thread, which may or may not be true.)
The second method you propose has no such issue, and works correctly according to your statements. All threads participate in each invocation of __shfl_down().
First, please refer to this block of code:
while(1) {
lt = time(NULL);
ptr = localtime(<);
int n = read (fd, buf, sizeof(buf));
strftime(str, 100, "%c", ptr);
int temp = sprintf(tempCommand, "UPDATE roomtemp SET Temperature='%s' WHERE Date='Today'", buf);
temp = sprintf(dateCommand, "UPDATE roomtemp SET Date='%s' WHERE Type='DisplayTemp'", str);
printf("%s", buf);
mysql_query(conn, tempCommand);
mysql_query(conn, dateCommand);
}
The read function is actually reading data coming in from a serial port. It works great, but the problem I am experiencing (I think) is the time it takes for the loop to execute. I have data being sent to the serial port every second. Suppose the data is "22" every second. What this loop does is read in "2222" or sometimes "222222". What I think is happening is that the loop takes too long to iterate, and that causes data to accumulate in the serial buffer. The read statement reads in everything in the buffer, hence giving me repeated values.
Is there any way to get around this? Perhaps at the end of the loop, I can flush the buffer. But I am not certain I know how to do this. Or perhaps there is some way to cut down the code inside the loop in order to reduce the overall time each iteration takes in the first place. My guess is that the MySQL queries are what take the most time anyway.
To start with you should check for errors from read, and also properly terminate the received "string".
To continue with your problem, there are a couple of ways to solve this. One it to put either the reading from the serial port or the database updates in a separate thread. Then you can pass "messages" between the the threads. Be careful though, as it seems your database is slow and the message queue might build up. This message-buildup can be averted by having a message queue of size one, which always contain the latest temperature read. Then you only need a single flag that the temperature reading thread sets, and the database updating thread checks and then clears.
Another solution is to modify the protocol used for the communication, so it includes a digit to tell how big the message is.