My program use libpcap like this:
while pcaket = pcap_next() {
...
(modify the pcaket and do checksum)
...
pcap_sendpacket(pcaket)
}
Recently, I found there is a memory leak in my program...
My question is:
Will the libpcap free the pcaket after pcap_next? or I have to do the free work myself ?
Will the libpcap free the pcaket after pcap_next?
The packet is contained in a buffer internal to libpcap (attached to the pcap_t); a new buffer is not allocated for each packet, so the buffer isn't freed after pcap_next(), it's freed after the pcap_t is closed. You do not have to free it yourself.
(This also means that the packet data from a particular call to pcap_next() or pcap_next_ex() is not guaranteed to remain valid after the next call to pcap_next() or pcap_next_ex() - or pcap_loop() or pcap_dispatch(); it might be overwritten with data from the next packet or the next batch of packets.)
Related
I have a library which uses libpcap to capture packets. I'm using pcap_loop() in a dedicated thread for the capture and pcap_breakloop() to stop the capture.
The packet buffer timeout is set to 500ms.
In some rare cases I am missing the last packets that my application sends before calling pcap_breakloop().
Reading the libpcap documentation I ended up wondering if the packet loss is related to the packet buffer timeout. The documentation says:
packets are not delivered as soon as they arrive, but are delivered after a short delay (called a "packet buffer timeout")
What happens if pcap_breakloop() is called during this delay ? Are the packets in the buffer passed to the callback or are they dropped before pcap_loop() returns ?
I was unable to find the answer in the documentation.
Are the packets in the buffer passed to the callback
No.
or are they dropped before pcap_loop() returns ?
Yes. In capture mechanisms that buffer packets in kernel code and deliver them only when the buffer fills up or the timeout expires pcap_breakloop() doesn't force the packets to be delivered.
For some of those capture mechanisms there might be a way to force the timeout to, in effect, expire, but I don't know of any documented way to do that with Linux PF_PACKET sockets, BPF, or WinPcap/Npcap NPF.
Update, giving more details:
On Linux and Windows, pcap_breakloop() attempt to wake up anything that's blocked waiting for packets on the same pcap_t.
On Linux, this is implemented by having the poll() call in libpcap block on both the PF_PACKET socket being used for capturing and on an "event" descriptor; pcap_breakloop() causes the "event" descriptor to supply an event, so that the poll() wakes up even if there are no packets to pick up from the socket yet. That does not force the current chunk in the buffer (memory shared between the kernel and userland code) to be assigned to userland, so they're not provided to the caller of libpcap.
On Windows, with Npcap, an "event object" is used by the driver and Packet32 library (the libpcap part of Npcap calls routines in the Packet32 library) to allow the library to block waiting for packets and the driver to wake the library up when packets are available. pcap_breakloop() does a SetEvent() call on the handle for that object, which forces userland code waiting for packets to wake up; it then tries to read from the device. I'd have to spend more time looking at the driver code to see whether, if there are be buffered-but-not-delivered packets at that point, they will be delivered.
On all other platforms, pcap_breakloop() does not deliver a wakeup, as the capture mechanism either does no buffering or provides no mechanism to force a wakeup, so:
if no buffering is done, there's no packet buffer to flush;
if there's a timeout, code blocked on a read will be woken up when the timeout expires, and that buffer will be delivered to userland;
if there's no timeout, code blocked on a read could be blocked for an indefinite period of time.
The ideal situation would be if the capture mechanism provided, on all platforms that do buffering, a way for userland code to force the current buffer to be delivered, and thus to cause a wakeup. That would require changes to the NPF driver and Packet32 library in Npcap, and would require kernel changes in Linux, *BSD, macOS, Solaris, and AIX.
Update 2:
Note also that "break loop" means break out of the loop immediately, so even if all of the above were done, when the loop is exited, there might be packets remaining in libpcap's userland buffer. If you want those packets - even though, by calling pcap_breakloop(), you told libpcap "stop giving me packets" - you'll have put the pcap_t in non-blocking mode and call pcap_dispatch() to drain the userland buffer. (That won't drain the kernel buffer.)
I meet this problem when finishing the lab of my OS course. We are trying to implement a kernel with the function of system call (platform: QEMU/i386).
When testing the kernel, problem occurred that after kernel load user program to memory and change the CPU state from kernel mode to user mode using 'iret' instruction, CPU works in a strange way as following.
%EIP register increased by 2 each time no matter how long the current instrution is.
no instruction seems to be execute, for no other registers change meantime.
Your guest has probably ended up executing a block of zeroed out memory. In i386, zeroed memory disassembles to a succession of "add BYTE PTR [rax],al" instructions, each of which is two bytes long (0x00 0x00), and if rax happens to point to memory which reads as zeroes, this will effectively be a 2-byte-insn no-op, which corresponds to what you are seeing. This might happen because you set up the iret incorrectly and it isn't returning to the address you expected, or because you've got the MMU setup wrong and the userspace program isn't in the memory where you expect it to be, for instance.
You could confirm this theory using QEMU's debug options (eg -d in_asm,cpu,exec,int,unimp,guest_errors -D qemu.log will log a lot of execution information to a file), which should (among a lot of other data) show you what instructions it is actually executing.
My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory
is copied to a staging buffer immediately (no device synchronization
is performed). The function will return once the pageable buffer has
been copied to the staging memory. The DMA transfer to final
destination may not have completed. For transfers between pinned host
memory and device memory, the function is fully asynchronous. For
transfers from device memory to pageable host memory, the function
will return only once the copy has completed. For all other transfers,
the function is fully asynchronous. If pageable memory must first be
staged to pinned memory, this will be handled asynchronously with a
worker thread. For transfers from any host memory to any host memory,
the function is fully synchronous with respect to the host.
MemcpyAsynchronousBehavior
Thanks!
In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.
Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.
You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.
My program runs 2 threads - Thread A (for input) and B (for processing). I also have a pair of pointers to 2 buffers, so that when Thread A has finished copying data into Buffer 1, Thread B starts processing Buffer 1 and Thread A starts copying data into Buffer 2. Then when Buffer 2 is full, Thread A copies data into Buffer 1 and Thread B processes Buffer 2, and so on.
My problem comes when I try to cudaMemcpy Buffer[] into d_Buffer (which was previously cudaMalloc'd by the main thread, i.e. before thread creation. Buffer[] were also malloc'd by the main thread). I get a "invalid argument" error, but have no idea which is the invalid argument.
I've reduced my program to a single-threaded program, but still using 2 buffers. That is, the copying and processing takes place one after another, instead of simultaneously. The cudaMemcpy line is exactly the same as the double-threaded one. The single-threaded program works fine.
I'm not sure where the error lies.
Thank you.
Regards,
Rayne
If you are doing this with CUDA 3.2 or earlier, the reason is that GPU contexts are tied to a specific thread. If a multi-threaded program allocated memory on the same GPU from different host threads, the allocations wind up establishing different contexts, and pointers from one context are not portable to another context. Each context has its own "virtualised" memory space to work with.
The solution is to either use the context migration API to transfer a single context from thread to thread as they do work, or try the new public CUDA 4.0rc2 release, which should support what you are trying to do without the use of context migration. The downside is that 4.0rc2 is a testing release, and it requires a particular beta release driver. That driver won't work will all hardware (laptops for example).
I have a 100MB character array (h_array) that is allocated using cudaHostAlloc() with the flag cudaHostAllocWriteCombined.
The program first copies data into h_array on the host. When h_array is full, it will copy h_array to d_array on the device and some processing is done. When the processing is completed, h_array is reused in the sense that new data is copied to it again, starting from h_array[0]. The new data is meant to overwrite what was previously stored in h_array.
However, I'm getting segmentation fault when the new data is copied to h_array after processing is complete. There are no seg fault errors when I'm using regular malloc.
What is wrong? Can I not rewrite the memory when it's pinned?
Thank you!
Your CUDA context is probably getting yanked out from under you somehow.
For example, if you allocate the pinned host memory in a thread that then exits, the memory will go away.
Make sure the thread that performs the allocation sticks around!