In the process of studying Ethereum smart contracts, the question arose of how to store large structures inside a contract (array or map). There is a contract
pragma solidity ^ 0.8.6;
contract UselessWorker {
uint32 public successfullyExecutedIterations = 0;
mapping (uint32 => uint32 [6]) items;
event ItemAdded (uint32 result);
function doWork (int _iterations) public {
for (int32 i = 0; i <_iterations; i ++) {
items [successfullyExecutedIterations] = [uint32 (block.timestamp), successfullyExecutedIterations, 10, 10, 10, 10];
successfullyExecutedIterations ++;
}
emit ItemAdded (successfullyExecutedIterations);
}
}
The doWork method fills the map items with arrays of numbers, called by an external script. At the same time, the more records appear in items, the faster disk space is consumed (for 1,000,000 the blockchain size is about 350 MB, for 2,000,000 about 1.1 GB, for 19,000,000 about 22 GB.This is the size of the .ethereum/net/geth/chaindata/ folder).
I am testing it on a private network, so the price of gas does not bother me. I run it with the command
geth --mine --networkid 999 --datadir ~ / .ethereum / net
--rpccorsdomain "" --allow-insecure-unlock --miner.gastarget 900000000 --rpc --ws --ws.port 8546 --ws.addr "127.0.0.1" --ws.origins
"" --ws.api "web3, eth" --rpc.gascap 800000000
According to estimates, one record in the map should take about 224 bytes (7 * 32 bytes) and for 19M records it should be about 4.2 GB.
It feels like a memory leak is taking place. Or I don't understand well how memory is allocated for storing map.
Can anyone suggest why blockchain is consuming so much disk space?
Related
I am trying to make a mesh network using ESP32 module. The WiFi.h softAPConfig() can be used to set the starting address for leasing, but it progress upwards without reusing the already leased addresses which are no more in use. So I want to limit the leasing range between two addresses.
I found this piece of code from dhcpserver.h
/* Defined in esp_misc.h */
typedef struct {
bool enable;
ip4_addr_t start_ip;
ip4_addr_t end_ip;
} dhcps_lease_t;
This is the code I compiled and uploaded into the ESP32 module
#include "WiFi.h"
char *ssid = "AirMesh";
IPAddress local_IP(192,168,1,0);
IPAddress gateway(192,168,1,1);
IPAddress subnet(255,255,255,0);
void setup()
{
Serial.begin(9600);
Serial.println();
Serial.print("Setting soft-AP configuration ... ");
Serial.println(WiFi.softAPConfig(local_IP, gateway, subnet) ? "Ready" : "Failed!");
Serial.print("Setting soft-AP ... ");
Serial.println(WiFi.softAP("ESPsoftAP_01") ? "Ready" : "Failed!");
Serial.print("Soft-AP IP address = ");
Serial.println(WiFi.softAPIP());
WiFi.softAP(ssid);
}
void loop() {}
The first device when connected gives an IP 192.168.1.1, the second device an IP 192.168.1.2, when I disconnect the first device and reconnected it and it gives an IP 192.168.1.3 (every connection use different physical addresses)
This progression keeps going
EDIT:
After digging it up, I think I found the code responsible for ranging IP leasing, but couldn't figure out what it means.
lease.start_ip.addr = static_cast<uint32_t>(local_ip) + (1 << 24);
lease.end_ip.addr = static_cast<uint32_t>(local_ip) + (11 << 24);
After trial and error methord, I managed to find the answer
Change the code in WiFiAP.cpp file (i have forked and requested a pull replacing 11 with 10 since the maximum number of connections possible for ESP32 is 10)
lease.start_ip.addr = static_cast<uint32_t>(local_ip) + (1 << 24);
lease.end_ip.addr = static_cast<uint32_t>(local_ip) + (n << 24);
Where n is number of IP addresses that must be allocated to external devices.
For ex:-
lease.end_ip.addr = static_cast<uint32_t>(local_ip) + (20 << 24);
means if the starting IP is 192.168.1.0 the DHCP will assign address starting from 192.168.1.1 to 192.168.1.20, while 192.168.1.0 (starting IP will be the address of the ESP32 module)
Doesn't the access point use the first ip of 192.168.1.1 not 1.0?
So 2-11 is 10 connections.
I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?
This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.
Is it possible to access hard disk/ flash disk directly from GPU (CUDA/openCL) and load/store content directly from the GPU's memory ?
I am trying to avoid copying stuff from disk to memory and then copying it over to GPU's memory.
I read about Nvidia GPUDirect but not sure if it does what I explained above. It talks about remote GPU memory and disks but the disks in my case are local to the GPU.
Basic idea is to load contents (something like dma) -> do some operations -> store contents back to disk (again in dma fashion).
I am trying to involve CPU and RAM as little as possible here.
Please feel free to offer any suggestions about the design.
For anyone else looking for this, 'lazy unpinning' did more or less what I want.
Go through the following to see if this can be helpful for you.
The most straightforward implementation using RDMA for GPUDirect would
pin memory before each transfer and unpin it right after the transfer
is complete. Unfortunately, this would perform poorly in general, as
pinning and unpinning memory are expensive operations. The rest of the
steps required to perform an RDMA transfer, however, can be performed
quickly without entering the kernel (the DMA list can be cached and
replayed using MMIO registers/command lists).
Hence, lazily unpinning memory is key to a high performance RDMA
implementation. What it implies, is keeping the memory pinned even
after the transfer has finished. This takes advantage of the fact that
it is likely that the same memory region will be used for future DMA
transfers thus lazy unpinning saves pin/unpin operations.
An example implementation of lazy unpinning would keep a set of pinned
memory regions and only unpin some of them (for example the least
recently used one) if the total size of the regions reached some
threshold, or if pinning a new region failed because of BAR space
exhaustion (see PCI BAR sizes).
Here is a link to an application guide and to nvidia docs.
Trying to use this feature, I wrote a small example on Windows x64 to implement this. In this example, kernel "directly" accesses disk space. Actually, as #RobertCrovella mentioned previously, the operating system is doing the job, with probably some CPU work; but no supplemental coding.
__global__ void kernel(int4* ptr)
{
int4 val ; val.x = threadIdx.x ; val.y = blockDim.x ; val.z = blockIdx.x ; val.w = gridDim.x ;
ptr[threadIdx.x + blockDim.x * blockIdx.x] = val ;
ptr[160*1024*1024 + threadIdx.x + blockDim.x * blockIdx.x] = val ;
}
#include "Windows.h"
int main()
{
// 4GB - larger than installed GPU memory
size_t size = 256 * 1024 * 1024 * sizeof(int4) ;
HANDLE hFile = ::CreateFile ("GPU.dump", (GENERIC_READ | GENERIC_WRITE), 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL) ;
HANDLE hFileMapping = ::CreateFileMapping (hFile, 0, PAGE_READWRITE, (size >> 32), (int)size, 0) ;
void* ptr = ::MapViewOfFile (hFileMapping, FILE_MAP_ALL_ACCESS, 0, 0, size) ;
::cudaSetDeviceFlags (cudaDeviceMapHost) ;
cudaError_t er = ::cudaHostRegister (ptr, size, cudaHostRegisterMapped) ;
if (cudaSuccess != er)
{
printf ("could not register\n") ;
return 1 ;
}
void* d_ptr ;
er = ::cudaHostGetDevicePointer (&d_ptr, ptr, 0) ;
if (cudaSuccess != er)
{
printf ("could not get device pointer\n") ;
return 1 ;
}
kernel<<<256,256>>> ((int4*)d_ptr) ;
if (cudaSuccess != ::cudaDeviceSynchronize())
{
printf ("error in kernel\n") ;
return 1 ;
}
if (cudaSuccess != ::cudaHostUnregister (ptr))
{
printf ("could not unregister\n") ;
return 1 ;
}
::UnmapViewOfFile (ptr) ;
::CloseHandle (hFileMapping) ;
::CloseHandle (hFile) ;
::cudaDeviceReset() ;
printf ("DONE\n");
return 0 ;
}
The real solution is on the horizon!
Early access: https://developer.nvidia.com/gpudirect-storage
GPUDirect® Storage (GDS) is the newest addition to the GPUDirect family. GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.
I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.
I've always worked with linear shared memory (load, store, access neighbours) but I've made a simple test in 2D to study bank conflicts which results have confused me.
The next code read data from one dimensional global memory array to shared memory and copy it back from shared memory to global memory.
__global__ void update(int* gIn, int* gOut, int w) {
// shared memory space
__shared__ int shData[16][16];
// map from threadIdx/BlockIdx to data position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global id into the one dimensional array
int gid = x + y * w;
// load shared memory
shData[threadIdx.x][threadIdx.y] = gIn[gid];
// synchronize threads not really needed but keep it for convenience
__syncthreads();
// write data back to global memory
gOut[gid] = shData[threadIdx.x][threadIdx.y];
}
The visual profiler reported conflicts in shared memory. The next code avoid thouse conflicts (only show the differences)
// load shared memory
shData[threadIdx.y][threadIdx.x] = gIn[gid];
// write data back to global memory
gOut[gid] = shData[threadIdx.y][threadIdx.x];
This behavior has confused me because in Programming Massively Parallel Processors. A Hands-on approach we can read:
matrix elements in C and CUDA are placed into the linearly addressed locations according to the row major convention. That is, the elements of row 0 of a matrix are first placed in order into consecutive locations.
Is this related to shared memory arragment? or with threads indexes? Maybe am I missing something?
The kernel configuration is as follow:
// kernel configuration
dim3 dimBlock = dim3 ( 16, 16, 1 );
dim3 dimGrid = dim3 ( 64, 64 );
// Launching a grid of 64x64 blocks with 16x16 threads -> 1048576 threads
update<<<dimGrid, dimBlock>>>(d_input, d_output, 1024);
Thanks in advance.
Yes, shared memory is arranged in row-major order as you expected. So your [16][16] array is stored row wise, something like this:
bank0 .... bank15
row 0 [ 0 .... 15 ]
1 [ 16 .... 31 ]
2 [ 32 .... 47 ]
3 [ 48 .... 63 ]
4 [ 64 .... 79 ]
5 [ 80 .... 95 ]
6 [ 96 .... 111 ]
7 [ 112 .... 127 ]
8 [ 128 .... 143 ]
9 [ 144 .... 159 ]
10 [ 160 .... 175 ]
11 [ 176 .... 191 ]
12 [ 192 .... 207 ]
13 [ 208 .... 223 ]
14 [ 224 .... 239 ]
15 [ 240 .... 255 ]
col 0 .... col 15
Because there are 16 32 bit shared memory banks on pre-Fermi hardware, every integer entry in each column maps onto one shared memory bank. So how does that interact with your choice of indexing scheme?
The thing to keep in mind is that threads within a block are numbered in the equivalent of column major order (technically the x dimension of the structure is the fastest varying, followed by y, followed by z). So when you use this indexing scheme:
shData[threadIdx.x][threadIdx.y]
threads within a half-warp will be reading from the same column, which implies reading from the same shared memory bank, and bank conflicts will occur. When you use the opposite scheme:
shData[threadIdx.y][threadIdx.x]
threads within the same half-warp will be reading from the same row, which implies reading from each of the 16 different shared memory banks, no conflicts occur.