My single-page web application runs into a memory leak situation with Chromium on Linux. My app fetches html markup and embeds it into the main page via .innerHTML. The remote html snippets may reference jpgs or pngs. The memory leak does not occur in JS-heap. This can be seen in the table below.
Column A shows the size of the JS-heap as reported by performance.memory.usedJSHeapSize. (I am running Chromium with -enable-precise-memory-info parameter to get precise values.)
Column B shows the total amount of memory occupied by Chromium as shown by top (only start and end values sampled).
Column C shows the amount of "available memory" in Linux as reported by /proc/meminfo.
A B C
01 1337628 234.5 MB 522964 KB
02 1372198 499404 KB
03 1500304 499864 KB
04 1568540 485476 KB
05 1651320 478048 KB
06 1718846 489684 KB GC
07 1300169 450240 KB
08 1394914 475624 KB
09 1462320 472540 KB
10 1516964 471064 KB
11 1644589 459604 KB GC
12 1287521 441532 KB
13 1446901 449220 KB
14 1580417 436504 KB
15 1690518 457488 KB
16 1772467 444216 KB GC
17 1261924 418896 KB
18 1329657 439252 KB
19 1403951 436028 KB
20 1498403 434292 KB
21 1607942 429272 KB GC
22 1298138 403828 KB
23 1402844 412368 KB
24 1498350 412560 KB
25 1570854 409912 KB
26 1639122 419268 KB
27 1715667 399460 KB GC
28 1327934 379188 KB
29 1438188 417764 KB
30 1499364 401160 KB
31 1646557 406020 KB
32 1720947 402000 KB GC
33 1369626 283.3 MB 378324 KB
While the JS-heap only alters between 1.3 MB and 1.8 MB during my 33-step test, Chromium memory (as reported by top) grows by 48.8 MB (from 234.5 MB to 283.3 MB). And according to /proc/meminfo the "available memory" is even shrinking by 145 MB (from 522964 KB to 378324 KB in column C) at the same time. I assume Chromium is occupying some good amount of cache memory outside of the reported 283.3 MB. Note that I am invoking GC manually 6 times via developer tools.
Before running the test, I have stopped all unnecessary services and killed all unneeded processes. No other code is doing any essential work in parallel. There are no browser extensions and no other open tabs.
The memory leak appears to be in native memory and probably involves the images being shown. They are never being freed, it seems. This issue appears to be similar to 1 (bugs.webkit.org). I have applied all the usual suggestions as listed here 2. If I keep the app running, the amount of memory occupied by Chromium will grow unbound up until everything comes to a crawl or until the Linux OOM killer strikes. The one thing I can do (before it is too late) is to switch the browser to a different URL. This will releases all native memory and I can then return to my app and continue with a completely fresh memory situation. But this is not a real solution.
Q: Can the native memory caches be freed in a more programatic way?
Related
One of the SGE job was running slow and killed by qmaster to enforce the h_rt=1200.
Is that possible SGE admin dynamically change the setting to make the job(id=2771780) running slow? If yes, what could be the setting to do so? If not, what could cause this?
qname test.q
hostname abc
group domain
owner jenkins
project NONE
department defaultdepartment
jobname top
jobnumber 2771780
taskid undefined
account sge
priority 0
qsub_time Mon Dec 20 11:46:06 2021
start_time Mon Dec 20 11:46:07 2021
end_time Mon Dec 20 12:06:08 2021
granted_pe NONE
slots 1
failed 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status 137 (Killed)
ru_wallclock 1201s
ru_utime 0.088s
ru_stime 8.797s
ru_maxrss 5.559KB
ru_ixrss 0.000B
ru_ismrss 0.000B
ru_idrss 0.000B
ru_isrss 0.000B
ru_minflt 23574
ru_majflt 0
ru_nswap 0
ru_inblock 128
ru_oublock 240
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 24156
ru_nivcsw 66
cpu 1454.650s
mem 54.658GBs
io 495.010GB
iow 0.000s
maxvmem 1014.082MB
arid undefined
ar_sub_time undefined
category -U arusers,digital -q test.q -l h_rt=1200
If you are saying that usually the job finishes in 1200s, but ran slowly on this particular occasion, this could be for various external factors such as contention for storage or network bandwidth. You may have also landed on a different compute node type that had slower CPU. An SGE admin can change various resource settings before the job starts executing such as the number of cores, but the more likely issue is contention for storage/io or even throttled cpu for thermal reasons.
Machine has 4 Numa nodes and is booted with kernel boot parameter default_hugepagesz=1G. I start VM with libvirt/virsh, and I can see that qemu launches with -m 65536 ... -mem-prealloc -mem-path /mnt/hugepages/libvirt/qemu, i.e. start virtual machine with 64GB of memory and request it to allocate the guest memory from a temporarily created file in /mnt/hugepages/libvirt/qemu:
% fgrep Huge /proc/meminfo
AnonHugePages: 270336 kB
ShmemHugePages: 0 kB
HugePages_Total: 113
HugePages_Free: 49
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB
Hugetlb: 118489088 kB
%
% numastat -cm -p `pidof qemu-system-x86_64`
Per-node process memory usage (in MBs) for PID 3365 (qemu-system-x86)
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ -----
Huge 29696 7168 0 28672 65536
Heap 0 0 0 31 31
Stack 0 0 0 0 0
Private 4 9 4 305 322
------- ------ ------ ------ ------ -----
Total 29700 7177 4 29008 65889
...
Node 0 Node 1 Node 2 Node 3 Total
------ ------ ------ ------ ------
MemTotal 128748 129017 129017 129004 515785
MemFree 98732 97339 100060 95848 391979
MemUsed 30016 31678 28957 33156 123807
...
AnonHugePages 0 4 0 260 264
HugePages_Total 29696 28672 28672 28672 115712
HugePages_Free 0 21504 28672 0 50176
HugePages_Surp 0 0 0 0 0
%
This output confirms that host's memory of 512GB is equally split across the numa nodes, and hugepages are also equally distributed across the nodes.
The question is how does qemu (or kvm?) determine how many hugepages to allocate? Note that libvirt xml has the following directive:
<memoryBacking>
<hugepages/>
<locked/>
</memoryBacking>
However, it is unclear from https://libvirt.org/formatdomain.html#memory-tuning what are defaults for hugepage allocation and on which nodes? Is it possible to have all memory for VM allocated from node 0? What is the right way doing this?
UPDATE
Since my VM workload is actually pinned to a set of cores on a single numa node 0 using <vcpupin> element, I thought it'd be good idea to enforce Qemu to allocate memory from the the same numa node:
<numtune>
<memory mode="strict" nodeset="0">
</numtune>
However this didn't work, qemu returned error in its log:
os_mem_prealloc insufficient free host memory pages available to allocate guest ram
Does it mean it fails to find free huge pages on the numa node 0?
If you use a plain <hugepages/> element, then libvirt will configure QEMU to allocate from the default huge page pool. Given your 'default_hugepagesz=1G' that should mean that QEMU allocates 1 GB sized pages. QEMU will allocate as many as are needed to satisfy the request RAM size. Given your configuration, these huge pages can potentially be allocated from any NUMA node.
With more advanced libvirt configuration it is possible to request allocation of a specific size of huge page, and pick them from specific NUMA nodes. The latter is only really needed if you are also locking CPUs to a specific host NUMA node.
Does it mean it fails to find free huge pages on the numa node 0?
Yes, it does.
numastat -m can be used to find out how many Huge Pages are there totally, free.
I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear to me how to avoid bank conflicts when doing so. Since my interactions are local, I was planning on sorting my particle data into local groups that I can send to each SM's shared memory (not yet worrying about particles that have a neighbor who is being worked on from another SM. In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
All of the information I see seems to only mention that memory be coalesced for the copy from global memory to shared memory, but don't I see anything about whether threads in a warp (or the whole SM) care about coalesence in shared memory.
In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
bank conflicts are only possible between threads in a single warp that are performing a shared memory access, and then only possible on a per-instruction (issued) basis. The instructions I am talking about here are SASS (GPU assembly code) instructions, but nevertheless should be directly identifiable from shared memory references in CUDA C++ source code.
There is no such idea as bank conflicts:
between threads in different warps
between shared memory accesses arising from different (issued) instructions
A given thread may access shared memory in any pattern, with no concern or possibility of shared memory back conflicts, due to its own activity. Bank conflicts only arise as a result of 2 or more threads in a single warp, as a result of a particular shared memory instruction or access, issued warp-wide.
Furthermore it is not sufficient that each thread reads/writes from/to a different address. For a given issued instruction (i.e. a given access) roughly speaking, each thread in the warp must read from a different bank, or else it must read from an address that is the same as another address in the warp (broadcast).
Let's assume that we are referring to 32-bit banks, and an arrangement of 32 banks.
Shared memory can readily be imagined as a 2D arrangement:
Addr Bank
v 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 96 97 98 ...
We see that addresses/index/offset/locations 0, 32, 64, 96 etc. are in the same bank. Addresses 1, 33, 65, 97, etc. are in the same bank, and so on, for each of the 32 banks. Banks are like columns of locations when the addresses of shared memory are visualized in this 2D arrangement
The requirement for non-bank-conflicted access for a given instruction (load or store) issued to a warp is:
no 2 threads in the warp may access locations in the same bank/column.
a special case exists if the locations in the same column are actually the same location. This invokes the broadcast rule and does not lead to bank conflicts.
And to repeat some statements above in a slightly different way:
If I have a loop in CUDA code, there is no possibility for bank conflicts to arise between separate iterations of that loop
If I have two separate lines of CUDA C++ code, there is no possibility for bank conflicts to arise between those two separate lines of CUDA C++ code.
The kernel uses: (--ptxas-options=-v)
0 bytes stack frame, 0 bytes spill sotes, 0 bytes spill loads
ptxas info: Used 45 registers, 49152+0 bytes smem, 64 bytes cmem[0], 12 bytes cmem[16]
Launch with: kernelA<<<20,512>>>(float parmA, int paramB); and it will run fine.
Launch with: kernelA<<<20,513>>>(float parmA, int paramB); and it get the out of resources error. (too many resources requested for launch).
The Fermi device properties: 48KB of shared mem per SM, constant mem 64KB, 32K registers per SM, 1024 maximum threads per block, comp capable 2.1 (sm_21)
I'm using all my shared mem space.
I'll run out of block register space around 700 threads/block. The kernel will not launch if I ask for more than half the number of MAX_threads/block. It may just be a coincidence, but I doubt it.
Why can't I use a full block of threads (1024)?
Any guess as to which resource I'm running out of?
I have often wondered where the stalled thread data/state goes between warps. What resource holds these?
When I did the reg count, I commented out the printf's. Reg count= 45
When it was running, it had the printf's coded. Reg count= 63 w/plenty of spill "reg's".
I suspect each thread really has 64 reg's, with only 63 available to the program.
64 reg's * 512 threads = 32K - The maximum available to a single block.
So I suggest the # of available "code" reg's to a block = cudaDeviceProp::regsPerBlock - blockDim i.e. The kernel doesn't have access to all 32K registers.
The compiler currently limits the # of reg's per thread to 63, (or they spill over to lmem). I suspect this 63, is a HW addressing limitation.
So it looks like I'm running out of register space.
I'm currently developing an practice application in node.js. This applications consists of a JSON REST web service which allows two services.
Insert log (a PUT request to /log, with the message to log)
Last 100 logs (a GET request to /log, that returns the latest 100 logs)
The current stack is formed by a node.js server that has the application logic and a mongodb database that takes care of the persistence. To offer the JSON REST web services I'm using the node-restify module.
I'm currently executing some stress tests using apache bench (using 5000 requests with a concurrency of 10) and get the following results:
Execute stress tests
1) Insert log
Requests per second: 754.80 [#/sec] (mean)
2) Last 100 logs
Requests per second: 110.37 [#/sec] (mean)
I'm surprised of the difference there is in performance, the query I'm executing uses an index. Interestingly enough it seems that the JSON output generation seems to get all the time on deeper tests I have performed.
Can node applications be profiled in detail?
Is this behaviour normal? Retrieving data takes so much more than inserting data?
EDIT:
Full test information
1) Insert log
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Server Software: log-server
Server Hostname: localhost
Server Port: 3010
Document Path: /log
Document Length: 0 bytes
Concurrency Level: 10
Time taken for tests: 6.502 seconds
Complete requests: 5000
Failed requests: 0
Write errors: 0
Total transferred: 2240634 bytes
Total PUT: 935000
HTML transferred: 0 bytes
Requests per second: 768.99 [#/sec] (mean)
Time per request: 13.004 [ms] (mean)
Time per request: 1.300 [ms] (mean, across all concurrent requests)
Transfer rate: 336.53 [Kbytes/sec] received
140.43 kb/s sent
476.96 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 3
Processing: 6 13 3.9 12 39
Waiting: 6 12 3.9 11 39
Total: 6 13 3.9 12 39
Percentage of the requests served within a certain time (ms)
50% 12
66% 12
75% 12
80% 13
90% 15
95% 24
98% 26
99% 30
100% 39 (longest request)
2) Last 100 logs
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Server Software: log-server
Server Hostname: localhost
Server Port: 3010
Document Path: /log
Document Length: 4601 bytes
Concurrency Level: 10
Time taken for tests: 46.528 seconds
Complete requests: 5000
Failed requests: 0
Write errors: 0
Total transferred: 25620233 bytes
HTML transferred: 23005000 bytes
Requests per second: 107.46 [#/sec] (mean)
Time per request: 93.057 [ms] (mean)
Time per request: 9.306 [ms] (mean, across all concurrent requests)
Transfer rate: 537.73 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 28 93 16.4 92 166
Waiting: 26 85 18.0 86 161
Total: 29 93 16.4 92 166
Percentage of the requests served within a certain time (ms)
50% 92
66% 97
75% 101
80% 104
90% 113
95% 121
98% 131
99% 137
100% 166 (longest request)
Retrieving data from the database
To query the database I use the mongoosejs module. The log schema is defined as:
{
date: { type: Date, 'default': Date.now, index: true },
message: String
}
and the query I execute is the following:
Log.find({}, ['message']).sort('date', -1).limit(100)
Can node applications be profiled in detail?
Yes. Use node --prof app.js to create a v8.log, then use linux-tick-processor, mac-tick-processor or windows-tick-processor.bat (in deps/v8/tools in the node src directory) to interpret the log. You have to build d8 in deps/v8 to be able to run the tick processor.
Here's how I do it on my machine:
apt-get install scons
cd ~/development/external/node-0.6.12/deps/v8
scons arch=x64 d8
cd ~/development/projects/foo
node --prof app.js
D8_PATH=~/development/external/node-0.6.12/deps/v8 ~/development/external/node-0.6.12/deps/v8/tools/linux-tick-processor > profile.log
There are also a few tools to make this easier, including node-profiler and v8-profiler (with node-inspector).
Regarding your other question, I would like some more information on how you fetch your data from Mongo, and what the data looks like (I agree with beny23 that it looks like a suspiciously low amount of data).
I strongly suggest taking a look at the DTrace support of Restify. It will likely become your best friend when profiling.
http://mcavage.github.com/node-restify/#DTrace