Simple sqlalchemy script runs slow - sqlalchemy

Getting started with sqlalhchemy, I wrote a script that dumps csv output of a db I was already using. On my development machine (VirtualBox Ubuntu VM on Windows 10), it executes in ~2 seconds. When I moved it to its intended destination, it requires 6.5+ minutes of near 100% CPU utilization to complete. While my devel machine is faster, I don't see that it could possibly explain the difference in observed execution times. Any ideas on where to look? In addition to having a faster CPU, the Ubuntu VM is SSD based.
Here's the code:
virts = session.query(Virtual, Pool, Pool_node, Node, Partition, Ltm).filter\
(Virtual.virtual_pool == Pool.pool_id).filter\
(Pool_node.pool_id == Pool.pool_id).filter\
(Pool_node.node_id == Node.node_id).filter\
(Virtual.virtual_partition == Partition.partition_id).filter\
(Partition.partition_ltm == Ltm.ltm_id)
for v in virts:
print "%s,%s,%d,%s,%s,%s,%s,%s,%s,%s,%s,%d,%s" \
%(v.Ltm.ltm_fqdn, v.Virtual.virtual_destination_ip, \
v.Virtual.virtual_destination_port,\
v.Virtual.virtual_destination_dns,\
v.Virtual.virtual_name, v.Virtual.virtual_descrip,\
v.Virtual.virtual_ena_status, v.Virtual.virtual_avail_status,\
v.Pool.pool_name,v.Node.node_name,\
v.Node.node_ip, v.Pool_node.pool_node_port, v.Node.node_dns)

Related

Pytorch DirectML computational inconsistency

I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?

How can I get a kernel's execution time with NSight Compute 2019 CLI?

Suppose I have an executable myapp which needs no command-line argument, and launches a CUDA kernel mykernel. I can invoke:
nv-nsight-cu-cli -k mykernel myapp
and get output looking like this:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 1234
[1234] myapp#127.0.0.1
mykernel(), 2020-Oct-25 01:23:45, Context 1, Stream 7
Section: GPU Speed Of Light
--------------------------------------------------------------------
Memory Frequency cycle/nsecond 1.62
SOL FB % 1.58
Elapsed Cycles cycle 4,421,067
SM Frequency cycle/nsecond 1.43
Memory [%] % 61.76
Duration msecond 3.07
SOL L2 % 0.79
SM Active Cycles cycle 4,390,420.69
(etc. etc.)
--------------------------------------------------------------------
(etc. etc. - other sections here)
so far - so good. But now, I just want the overall kernel duration of mykernel - and no other output. Looking at nv-nsight-cu-cli --query-metrics, I see, among others:
gpu__time_duration incremental duration in nanoseconds; isolated measurement is same as gpu__time_active
gpu__time_active total duration in nanoseconds
So, it must be one of these, right? But when I run
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_duration,gpu__time_active
I get:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 12345
[12345] myapp#127.0.0.1
mykernel(), 2020-Oct-25 12:34:56, Context 1, Stream 7
Section: GPU Speed Of Light
Section: Command line profiler metrics
---------------------------------------------------------------
gpu__time_active (!) n/a
gpu__time_duration (!) n/a
---------------------------------------------------------------
My questions:
Why am I getting "n/a" values?
How can I get the actual values I'm after, and nothing else?
Notes: :
I'm using CUDA 10.2 with NSight Compute version 2019.5.0 (Build 27346997).
I realize I can filter the standard output stream of the unqualified invocation, but that's not what I'm after.
I actually just want the raw number, but I'm willing to settle for using --csv and taking the last field.
Couldn't find anything relevant in the nvprof transition guide.
tl;dr: You need to specify the appropriate 'submetric':
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_active.avg
(Based on #RobertCrovella's comments)
CUDA's profiling mechanism collects 'base metrics', which are indeed listed with --list-metrics. For each of these, multiple samples are taken. In version 2019.5 of NSight Compute you can't just get the raw samples; you can only get 'submetric' values.
'Submetrics' are essentially some aggregation of the sequence of samples into a scalar value. Different metrics have different kinds of submetrics (see this listing); for gpu__time_active, these are: .min, .max, .sum, .avg. Yes, if you're wondering - they're missing second-moment metrics like the variance or the sample standard deviation.
So, you must either specify one or more submetrics (see example above), or alternatively, upgrade to a newer version of NSight Compute, with which you actually can just get all the samples apparently.

Testing k6 groups with separate throughput

I am targeting 10x load for an API, this API contains 6 endpoints which should be under the test, but each endpoint has its own throughput which should be multiplied by 10.
Now, I put all endpoints in one script file, but it doesn't make any sense to have the same throughput for all endpoints, I wanna run the k6 and it has to stop automatically when the needed throughput is already reached for a specific group.
Example:
api/GetUser > current 1k RPM > target 10k RPM
api/GetManyUsers > current 500 RPM > target 5k RPM
The main problem is when I put each endpoint in a separate group in one single script, this let k6 iterate over both groups/endpoints with the same iterations count with the same virtual users, which leads to reach 10x for both endpoints which is not required at the moment.
One more thing, I already tried to separate all endpoints in separate scripts, but this is difficult to manage and this makes the monitoring not easy because all 6 endpoints should be run in parallel.
What you need can currently be approximated roughly with the __ITER and/or __VU execution context variables. Have a single default function that has something like this:
if (__ITER % 3 == 0) {
CallGetManyUsers(); // 33% of iterations
} else {
CallGetUser(); // 66% of iterations
}
In the very near future we plan to also add a more elegant way of supporting multi-scenario tests in a single script: https://github.com/loadimpact/k6/pull/1007

What is reasonable top speed for reading a CSV file into a 2-dimensional array?

What is reasonable time to load a CSV file into a 2-dimensional array in memory, where # columns is fixed (406), and the number of rows are about 87,000? -- In Perl it is taking about 12 seconds from either HardDisk (SATA) or SSD. -- other languages OK if speed can greatly improved.
I expected the time to be much less!
Size on disk of the referenced CSV file is 302MB!
Snip of the interesting Perl below:
while ($iline = <$CSVFILE>)
{
chomp($iline);
#csv_values = split /,/,$iline;
# Create a HASH Key from csv_value[0], which is the CODE/label!
$hashname=$csv_values[0];
$Greeks{$hashname}=[#csv_values]; # Create the reference & copy the array!
}
For above, the majority of the time is consumed in the "split", and the Hash new key addition lines!
I tried a similar test in python (not my strong suite), and the performance was much much worse!
FYI: cpu is intel 3.2GHz i7-3930K wiht 32GB ram, 64-bit OS (win 10), for referenced performance.
Thanks for constructive ideas!

Perl script multi thread not running parallel

I am completely new to Perl, like absolute newbie. I am trying to develop a system which reads a database and, according to the results, generates a queue which launches another script.
HERE is the source code.
Now the script works as expected, except I have noticed that it doesn't really do the threads parallel. Whether I use 1 thread or 50 threads, the execution time is the same; 1 thread is even faster.
When I have the script display which thread did what, I see the threads don't run at the same time, because it will do thread 1, then 2, then 3 etc.
Does anyone know what I did wrong here? Again the script itself works, just not in parallel threads.
You need to learn what semaphores actually are before you start using them. You've explicitly told the threads not to run in parallel:
my $s = Thread::Semaphore->new;
#...
while ($queue_id_list->pending > 0) {
$s->down;
my $info = $queue_id_list->dequeue_nb;
if (defined($info)) {
my #details = split(/#/, $info);
#my $result = system("./match_name db=user_".$details[0]." id=".$details[1]);
# normally the script above would be launched which is a php script run in php-cli and does some database things
sleep(0.1);
#print "Thread: ". threads->self->tid. " - Done user: ".$details[0]. " and addressbook id: ". $details[1]."\r\n";
#print $queue_id_list->pending."\r\n";
}
$s->up;
}
You've created a semaphore $s, which by default has a count of 1. Then in the function you're trying to run, you call $s->down at the start -- which decreases the count by 1, or blocks if the count is already <1, and $s->up at the end, which increases the count by 1.
Once a thread calls down, no other threads will run until it calls up again.
You should carefully read the Thread::Semaphore docs, and probably this wikipedia article on semaphores, too.