GCP Cloud SQL - 250 of 2000 - mysql

In GCP Cloud SQL for mySQL,
Say for Network throughput (MB/s) it specified
250 of 2000
Is 250 the average value and 2000 the maximum value attainable?

If you click on the question mark (to the left of your red square] it will lead you to this doc. You will see that limiting factor is 16 Gbps.
Now using converter, 16 Gbps is 2000MB/s.
If you change your machine type to high mem, 8 vCPU or 16, you will see the cap at 2000. So i suspect 250MB/s is the allocated value for the machine type you chose, the 1cpu.

Related

What is the cause of the low CPU utilization in rllib PPO? What does 'cpu_util_percent' measure?

I implement multiagent ppo in rllib with a custom environment, it learns and works well except for the speed performance. I wonder if an underutilized CPU may cause the issue, so I want to know what ray/tune/perf/cpu_util_percent measures. Does it measure only the rollout workers, or is averaged over the learner? And what may be the cause? (All my runs give average of 13% CPU usage.)
run on gcp
ray 2.0
python3.9
torch1.12
head: n1-standard-8 with 1 v100 gpu
2 workers: c2-standard-60
num_workers: 120 # this worker != machine, num_workers = num_rollout_workers
num_envs_per_worker: 1
num_cpus_for_driver: 8
num_gpus: 1
num_cpus_per_worker: 1
num_gpus_per_worker: 0
train_batch_size: 12000
sgd_minibatch_size: 3000
I tried smaller batch size=4096 and smaller number of workers=10, and larger batch_size=480000, all resulted 10~20% CPU usage.
I cannot share the code.

Faster RCNN (caffe) Joint Learning: Out of Memory with dedicated memory 5376 MB (changed batch sizes and no.of RPN proposals) what else?

I have worked with the alternative optimization matlab code before, currently I am trying to get joint learning running. I am able to run the test demo with my GPU Tesla 2070. For training, I have set all the batch sizes to 1:
__C.TRAIN.IMS_PER_BATCH = 1
__C.TRAIN.BATCH_SIZE = 1
__C.TRAIN.RPN_BATCHSIZE = 1
(updated yaml to 1 as well since it was overridden)
But I still have the error == cudaSuccess (2 vs. 0) out of memory.
I have tried to experiment with lowering the number of proposals. (the originals are below:)
train:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TRAIN.RPN_POST_NMS_TOP_N = 2000
test:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TEST.RPN_PRE_NMS_TOP_N = 6000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TEST.RPN_POST_NMS_TOP_N = 300
I tried as low as pre: 100 post:50 for sanity check.
And I still am not able to run without the out of memory problem. What am I missing here?? I have a Tesla 5376 MB dedicated memory and I use the Tesla only for this (have a separate GPU for my screen) I am positive about reading 5376 MB should be enough by an author himself.
Thanks.

How to get 100% GPU usage using CUDA

I wonder how I can generate high load in a GPU, step by step, though.
What I'm trying to do is a program which put the maximum load in a MP, then in other, until reach the total number of MP.
It would be similar to execute a "while true" in every single core of a CPU, but I'm not sure if the same paradigm would work on a GPU with CUDA.
Can you help me?
If you want to do a stress-test/power consumption test, you'll need to pick the workload. The highest power consumption with compute-only code you'll most likely get with some synthetic benchmark that feed the GPUs with the optimal mix and sequence of operations. Otherwise, BLAS level 3 is probably quite close to optimal.
Putting load only on a certain number of multi-processors will require that you tweak the workload to limit the block-level parallelism.
Briefly, this is what I'd do:
Pick a code that is well-optimized and known to utilize the GPU to a great extent (high IPC, high power consumption, etc.). Have a look around on the CUDA developer forums, you should be able to find hand-tuned BLAS code or something alike.
Change the code to force it to run on a given number of multi-processors. This will require that you tune the number of blocks and threads to produce exactly the right amount of load for the number of processors you want to utilize.
Profile: the profiler counters can show you the amount of instruction per multi-processor which gives you a check that you are indeed only running on the desired number of processors as well as other counters that can indicate how efficiently is the code running.
as well as
Measure. If you have a Tesla or Quadro you get power consumption out of the box. Otherwise, try the nvml fix. Without a power measurement it will be hard for you to know how far are you from the TDP and especially weather the GPU is throttling.
Some of my benchmarks carry out the same calculations via CUDA, OpenMP and programmed multithreading. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. A range of data sizes are also used. [Free] Benchmarks, source codes and results for Linux are via my page:
http://www.roylongbottom.org.uk/linux%20benchmarks.htm
I also provide Windows varieties. Following are some CUDA results, showing maximum speed of 412 GFLOPS using a GeForce GTX 650. On the quad core/8 thread Core i7, OpenMP produced up to 91 GFLOPS and multithreading up to 93 GFLOPS using SSE instructions and 178 GFLOPS with AVX 1. See also section on Burn-In and Reliability Apps, where the most demanding CUDA test is run for a period to show temperature gains, at the same time as CPU stress tests.
Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650
Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014
CUDA devices found
Device 0: GeForce GTX 650 with 2 Processors 16 cores
Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes
Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes
Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes
Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes
Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes
Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes
Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes
Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes
Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes
Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes
Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes
Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes
Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes
Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes
Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes
Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes
Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes
Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes
Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes
Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes
Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes
Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes
Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes
Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes
Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes
Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes
Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes
Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes
Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes

Really 1 KB (KiloByte) equals 1024 bytes?

Until now I believed that 1024 bytes equals 1 KB (kilobyte) but I was reading on the internet about decimal and binary system.
So, actually 1024 bytes = 1 KB would be the correct way to define or simply there is a general confusion?
What you are seeing is a marketing stunt.
Since non-technical people don't know the difference between Metric Meg, Gig, etc. against the binary Meg, Gig, etc. marketers for storage will use the Metric calculation, thus 1000 Bytes == 1 KiloByte.
This can cause issues with development or highly technical people so you get the idea of a binary Meg, Gig, etc. which is designated with a bi instead of the standard combination (ex. Mebibyte vs Megabyte, or Gibibyte vs Gigabyte)
There are two ways to represent big numbers: You could either display them in multiples of 1000 (base 10) or 1024 (base 2). If you divide by 1000, you probably use the SI prefix names, if you divide by 1024, you probably use the IEC prefix names. The problem starts with dividing by 1024. Many applications use the SI prefix names for it and some use the IEC prefix names. But it is important how it is written:
Using IEC standard:
1 KiB = 1,024 bytes (Note: big K)
1 MiB = 1,024 KiB = 1,048,576 bytes
Using SI standard:
1 kB = 1,000 bytes (Note: small k)
1 MB = 1,000 kB = 1,000,000 bytes
Source: ubunty units policy: https://wiki.ubuntu.com/UnitsPolicy
In the normal world, most things go by the power of 10. This would include electricity, for example.
But, in the computer world, it is about half binary. For example, when they sell a hard drive, they sell it by the value of 10, so if it is a 1KB drive, then it is 1000 B. But, when the computer reads it, the OS's usually read by the value of 1024. This is why, when you read the size of space available on a drive, it reads much less then what it was advertised. A 500 GB drive will read only about 466GB, because the computer is reading the drive by the binary 1024 version. Not the power of 10 that it was sold and advertised by. Same will go with flash drives. But, RAM is sold, and read by the computer, by the Binary 1024 version.
One thing to note.. It is "B", not "b". There are 8 bits "b" in a Byte "B". The reason I bring this up is when you get internet service, they usually advertise the speed by bits, not bytes. When it reads in the download box on the computer, it reads the speed in bytes. Say you have a 50Mb internet connection, it is actually 6.25MB connection in the download speed box, because you have to divide the 50 by 8 since there are 8 bits in a byte. That is how the computer reads it. Another marking strategy too. After all, 50Mb sounds much faster then 6.25MB. Other then speeds through a network, most things are read by bytes "B". Some people do not realize that there is a difference between the "B" and "b".
Quite simple...
The word 'Byte' is a computing reference for which the letter 'B' is used as abbreviation.
It must follow then that any reference to Bytes, eg. KB, MB etc, must be based on the well known and widely accepted 1024 base.
Therefore 1KB must equal 1024 Bytes, 1MB must equal 1048576 Bytes (1024x1024) etc.
Any non-computing reference to Kilo/Mega etc. Is based on the decimal 1000 base, eg. 1KW or 1KiloWatt which is 1000 Watts.

Only one node owns data in a Cassandra cluster

I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here