Response signal when performing a store into the L1 Dcache of Rocket Chip Core - chisel

If I perform a store into the L1 Dcache does the Rocket Chip core produce a resp valid signal or is that only for a load signal? Cos for a load signal you are requesting something and you get something in response whereas for the store you just need to check if the memory interface is ready and send out the signal. (I am talking about the io.mem.resp field)

I figured out the answer. In short while performing a store operation it is just a ready-valid interface.

Related

Using eventlet to process large number of https requests with large data returns

I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!
Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.
When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.
Throttle concurrency only when you have too many parallel CPU-bound code running.

Handling lots of req / sec in go or nodejs

I'm developing a web app that needs to handle bursts of very high loads,
once per minute I get a burst of requests in very few seconds (~1M-3M/sec) and then for the rest of the minute I get nothing,
What's my best strategy to handle as many req /sec as possible at each front server, just sending a reply and storing the request in memory somehow to be processed in the background by the DB writer worker later ?
The aim is to do as less as possible during the burst, and write the requests to the DB ASAP after the burst.
Edit : the order of transactions in not important,
we can lose some transactions but 99% need to be recorded
latency of getting all requests to the DB can be a few seconds after then last request has been received. Lets say not more than 15 seconds
This question is kind of vague. But I'll take a stab at it.
1) You need limits. A simple implementation will open millions of connections to the DB, which will obviously perform badly. At the very least, each connection eats MB of RAM on the DB. Even with connection pooling, each 'thread' could take a lot of RAM to record it's (incoming) state.
If your app server had a limited number of processing threads, you can use HAProxy to "pick up the phone" and buffer the request in a queue for a few seconds until there is a free thread on your app server to handle the request.
In fact, you could just use a web server like nginx to take the request and say "200 OK". Then later, a simple app reads the web log and inserts into DB. This will scale pretty well, although you probably want one thread reading the log and several threads inserting.
2) If your language has coroutines, it may be better to handle the buffering yourself. You should measure the overhead of relying on our language runtime for scheduling.
For example, if each HTTP request is 1K of headers + data, want to parse it and throw away everything but the one or two pieces of data that you actually need (i.e. the DB ID). If you rely on your language coroutines as an 'implicit' queue, it will have 1K buffers for each coroutine while they are being parsed. In some cases, it's more efficient/faster to have a finite number of workers, and manage the queue explicitly. When you have a million things to do, small overheads add up quickly, and the language runtime won't always be optimized for your app.
Also, Go will give you far better control over your memory than Node.js. (Structs are much smaller than objects. The 'overhead' for the Keys to your struct is a compile-time thing for Go, but a run-time thing for Node.js)
3) How do you know it's working? You want to be able to know exactly how you are doing. When you rely on the language co-routines, it's not easy to ask "how many threads of execution do I have and what's the oldest one?" If you make an explicit queue, those questions are much easier to ask. (Imagine a handful of workers putting stuff in the queue, and a handful of workers pulling stuff out. There is a little uncertainty around the edges, but the queue in the middle very explicitly captures your backlog. You can easily calculate things like "drain rate" and "max memory usage" which are very important to knowing how overloaded you are.)
My advice: Go with Go. Long term, Go will be a much better choice. The Go runtime is a bit immature right now, but every release is getting better. Node.js is probably slightly ahead in a few areas (maturity, size of community, libraries, etc.)
How about a channel with a buffer size equal to what the DB writer can handle in 15 seconds? When the request comes in, it is sent on the channel. If the channel is full, give some sort of "System Overloaded" error response.
Then the DB writer reads from the channel and writes to the database.

Node.js: does JSON.parse block the event loop?

using JSON.parse is the most common way to parse a JSON string into a JavaScript object.
it is a synchronous code, but does it actually block the event loop (since its much lower level than the user's code) ?
is there an easy way to parse JSON asynchronously? should it matter at all for few KBs - few hundred KBs of JSON data ?
A function that does not accept a callback or return a promise blocks until it returns a value.
So yes it JSON.parse blocks. Parsing JSON is a CPU intensive task, and JS is single threaded. So the parsing would have to block the main thread at some point. Async only makes sense when waiting on another process or system (which is why disk I/O and networking make good async sense as they have more latency than raw CPU processing).
I'd first prove that parsing JSON is actually a bottleneck for your app before you start optimizing it's parsing. I suspect it's not.
If you think that you might have a lot of heavy JSON decoding to do, consider moving it out to another process. I know it might seem obvious, but the key to using node.js successfully is in the name.
To set up another 'node' top handle a CPU heavy task, use IPC. Simple sockets will do, but ØMQ adds a touch of radioactive magic in that it supports a variety of transports.
It might be that the overhead of connecting a socket and sending the JSON is more intensive overall but it will certainly stop the blocking.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)

pcap_dispatch - callback processing questions

I am writing fairly simply pcap "live" capture engine, however the packet processing callback implementation for pcap_dispatch should take relatively long time for processing.
Does pcap run every "pcap_handler" callback in separate thread? If yes, is "pcap_handler" thread-safe, or should the care be taken to protect it with critical sections?
Alternatively, does pcap_dispatch callback works in serial fashion? E.g. is "pcap_handler" for the packet 2 called only after "pcap_handler" for packet 1 is done? If so, is there an approach to avoid accumulating latency?
Thanks,
-V
Pcap basically works like this: There is a kernel-mode driver capturing the packets and placing them in a buffer of size B. The user-mode application may request any amount of packets at any time using pcap_loop, pcap_dispatch, or pcap_next (the latter is basically pcap_dispatch with one packet).
Therefore, when you use pcap_dispatch to request some packets, libpcap goes to the kernel and asks for the next packet in the buffer (If there isn't one the timeout code and stuff kicks in, but this is irrelevant for this discussion), transfers it into userland and deletes it from the buffer. After that, pcap_dispatch calls your handler, reduces it's packets-to-do counter and starts from the beginning. As a result, pcap_dispatch only returns if the requested amount of packets have been processed, an error ocurred, or a timeout happened.
As you can see, libpcap is completely non-threaded, as most C API's are. The kernel-mode driver, however, is obviously happy enough to deliver packets to multiple threads (else you wouldn't be able to capture from more than one process), and is completly thread-safe (there is one separate buffer for each usermode handle).
This implies that you must implement all parallelisms by yourself. You'd want to do something like this:
pcap_dispatch(P, count, handler, data);
.
.
.
struct pcap_work_item {
struct pcap_pkthdr header;
u_char data[];
};
void handler(u_char *user, struct pcap_pkthdr *header, u_char *data)
{
struct pcap_work_item *item = malloc(sizeof(pcap_pkthdr) + header->caplen);
item->header = *header;
memcpy(item->data, data, header->caplen);
queue_work_item(item);
}
Note that we have to copy the packet into the heap, because the header and data pointers are invalid after the callback returns.
The function queue_work_item should find a worker thread, and assign it the task of handling the packet. Since you said that your callback takes a 'relativley long time', you likely need a large number of worker threads. Finding a suitable number of workers is subject to fine-tweaking.
At the beginning of this post I said that the kernel-mode driver has buffer to collect incoming packets which await processing. The size of this buffer is implementation-defined. The snaplen parameter to pcap_open_live only controls how many bytes of one packet are captured, however, the number of packets cannot be controlled in a portable fashion. It might be fixed-size. It might get larger as more and more packets arrive. However, if it overflows, all further packets are discarded until there is enough space for the next one to arrive. If you want to use your application in a high-traffic environment, you want to make sure that your *pcap_dispatch* callback completes quickly. My sample callback simply assigns the packet to a worker, so it works fine even in high-traffic enviroments.
I hope this answers all your questions.