Payload size when upserting via SODA.NET API - socrata

I am using the SODA.NET library to update a very large dataset. I've found the larger the payload, thus less chatty, is more performant. I'm currently sending 1000 records at a time.
What is the recommended upper limit to the payload size for an upsert call? Are there any hidden gotchas to be aware of as I increase the size?
Thanks!

With SODA 2 endpoints, you can throw all of your data at it (that's the default with the R-based libraries, for instance). Socrata will ingest at its own pace. Performance will depend a lot on network performance and number of columns.
Older SODA 1 endpoints capped around 50k.

Related

Offline Data Augmentation in Google Colab - Problems with RAM

What would be most efficient way to perform OFFLINE data augmentation in Google Colab?
Since I am not from US I cannot purchase Google Chrome for bigger RAM, so I am trying to be "smart" about it. For example, when I finish loading 11000 images, first as NumPy arrays and then creating pandas DataFrame from them, it occupies around 7.5GB of RAM. Problem is, I tried to del every object (NumPy array, tf.data object etc) in order to check if RAM changes, and RAM does not changed.
Is it better to try and be smart about RAM or maybe write to disk any time I augment image and do not keep anything in RAM? If this is the case, is using TFRecords a smart approach for this?

When to use tensorflow datasets api versus pandas or numpy

There are a number of guides I've seen on using LSTMs for time series in tensorflow, but I am still unsure about the current best practices in terms of reading and processing data - in particular, when one is supposed to use the tf.data.Dataset API.
In my situation I have a file data.csv with my features, and would like to do the following two tasks:
Compute targets - the target at time t is the percent change of
some column at some horizon, i.e.,
labels[i] = features[i + h, -1] / features[i, -1] - 1
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
train_features[i] = features[i: i + window]
I am perfectly comfortable constructing these objects using pandas or numpy, so I'm not asking how to achieve this in general - my question is specifically what such a pipeline ought to look like in tensorflow.
Edit: I guess that I'd also like to know whether the 2 tasks I listed are suited for the dataset api, or if i'm better off using other libraries to deal with them?
First off, note that you can use dataset API with pandas or numpy arrays as described in the tutorial:
If all of your input data fit in memory, the simplest way to create a
Dataset from them is to convert them to tf.Tensor objects and use
Dataset.from_tensor_slices()
A more interesting question is whether you should organize data pipeline with session feed_dict or via Dataset methods. As already stated in the comments, Dataset API is more efficient, because the data flows directly to the device, bypassing the client. From "Performance Guide":
While feeding data using a feed_dict offers a high level of
flexibility, in most instances using feed_dict does not scale
optimally. However, in instances where only a single GPU is being used
the difference can be negligible. Using the Dataset API is still
strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
But, as they say themselves, the difference may be negligible and the GPU can still be fully utilized with ordinary feed_dict input. When the training speed is not critical, there's no difference, use any pipeline you feel comfortable with. When the speed is important and you have a large training set, the Dataset API seems a better choice, especially you plan distributed computation.
The Dataset API works nicely with text data, such as CSV files, checkout this section of the dataset tutorial.

Using eventlet to process large number of https requests with large data returns

I am trying to use eventlets to process a large number of data requests, approx. 100,000 requests at a time to a remote server, each of which should generate a 10k-15k byte JSON response. I have to decode the JSON, then perform some data transformations (some field name changes, some simple transforms like English->metric, but a few require minor parsing), and send all 100,000 requests out the back end as XML in a couple of formats expected by a legacy system. I'm using the code from the eventlet example which uses imap() "for body in pool.imap(fetch, urls):...."; lightly modified. eventlet is working well so far on a small sample (5K urls), to fetch the JSON data. My question is whether I should add the non-I/O processing (JSON decode, field transform, XML encode) to the "fetch()" function so that all that transform processing happens in the greenthread, or should I do the bare minimum in the greenthread, return the raw response body, and do the main processing in the "for body in pool.imap():" loop? I'm concerned that if I do the latter, the amount of data from completed threads will start building up, and will bloat memory, where doing the former would essentially throttle the process to where the XML output would keep up. Suggestions as to preferred method to implement this welcome. Oh, and this will eventually run off of cron hourly, so it really has a time window it has to fit into. Thanks!
Ideally, you put each data processing operation into separate green thread. Then, only when required, combine several operations into batch or use a pool to throttle concurrency.
When you do non-IO-bound processing in one loop, essentially you throttle concurrency to 1 simultaneous task. But you can run those in parallel using (OS) thread pool in eventlet.tpool module.
Throttle concurrency only when you have too many parallel CPU-bound code running.

Handling lots of req / sec in go or nodejs

I'm developing a web app that needs to handle bursts of very high loads,
once per minute I get a burst of requests in very few seconds (~1M-3M/sec) and then for the rest of the minute I get nothing,
What's my best strategy to handle as many req /sec as possible at each front server, just sending a reply and storing the request in memory somehow to be processed in the background by the DB writer worker later ?
The aim is to do as less as possible during the burst, and write the requests to the DB ASAP after the burst.
Edit : the order of transactions in not important,
we can lose some transactions but 99% need to be recorded
latency of getting all requests to the DB can be a few seconds after then last request has been received. Lets say not more than 15 seconds
This question is kind of vague. But I'll take a stab at it.
1) You need limits. A simple implementation will open millions of connections to the DB, which will obviously perform badly. At the very least, each connection eats MB of RAM on the DB. Even with connection pooling, each 'thread' could take a lot of RAM to record it's (incoming) state.
If your app server had a limited number of processing threads, you can use HAProxy to "pick up the phone" and buffer the request in a queue for a few seconds until there is a free thread on your app server to handle the request.
In fact, you could just use a web server like nginx to take the request and say "200 OK". Then later, a simple app reads the web log and inserts into DB. This will scale pretty well, although you probably want one thread reading the log and several threads inserting.
2) If your language has coroutines, it may be better to handle the buffering yourself. You should measure the overhead of relying on our language runtime for scheduling.
For example, if each HTTP request is 1K of headers + data, want to parse it and throw away everything but the one or two pieces of data that you actually need (i.e. the DB ID). If you rely on your language coroutines as an 'implicit' queue, it will have 1K buffers for each coroutine while they are being parsed. In some cases, it's more efficient/faster to have a finite number of workers, and manage the queue explicitly. When you have a million things to do, small overheads add up quickly, and the language runtime won't always be optimized for your app.
Also, Go will give you far better control over your memory than Node.js. (Structs are much smaller than objects. The 'overhead' for the Keys to your struct is a compile-time thing for Go, but a run-time thing for Node.js)
3) How do you know it's working? You want to be able to know exactly how you are doing. When you rely on the language co-routines, it's not easy to ask "how many threads of execution do I have and what's the oldest one?" If you make an explicit queue, those questions are much easier to ask. (Imagine a handful of workers putting stuff in the queue, and a handful of workers pulling stuff out. There is a little uncertainty around the edges, but the queue in the middle very explicitly captures your backlog. You can easily calculate things like "drain rate" and "max memory usage" which are very important to knowing how overloaded you are.)
My advice: Go with Go. Long term, Go will be a much better choice. The Go runtime is a bit immature right now, but every release is getting better. Node.js is probably slightly ahead in a few areas (maturity, size of community, libraries, etc.)
How about a channel with a buffer size equal to what the DB writer can handle in 15 seconds? When the request comes in, it is sent on the channel. If the channel is full, give some sort of "System Overloaded" error response.
Then the DB writer reads from the channel and writes to the database.

Very large HTTP request vs many small requests

I need a 2D array (as Json) to be sent from server to client. It would be around 400x400 in size with each entry around 4 characters of text. So that makes it around 640KB of data.
Which of the following extreme approaches is better ?
I make a large HTTP request of all the data at one go.
I make 400 requests - each asking for a single row (around 1.6 KB)
I believe optimal approach would be somewhere in middle. Could anyone give me an idea what might be the optimal single request size for this data?
Thanks.
Couple of considerations for choosing one big vs several small:
In the single request case, you can't do progressive data processing as the data arrives; you need to wait for the full packet to arrive before you can do anything. If it fails, you need to start everything from scratch.
In the multiple requests case, you can do progressive data processing. However, you now have to consider the potential for multiple failures and how to recover from these.
Multiple requests incur overhead for each request. This is additional bandwidth you app will be consuming.
Some HTTP agents limit the number of concurrent requests to the same server, and you might need to do some logic to work around that.
Response compression will work better for the single request case.
Multiple requests won't require you to allocate the full memory for your data. Granted, 640KB is not that big chunk of memory, so that might not be a big consideration for you, depending on how often you will allocate it.
In the case of early terminate of the process (either a Cancel button or the app is terminated or the browser navigates away from your page), the single request will still finish the full response download; however, for the multiple requests case, any request your code hasn't started yet will not be executed.
Honestly, I wouldn't be that worried about the last two and would base my choice on 1) is progressive data processing important; and 2) what your app tolerance is for failures and partial data.
Unless you are dealing with slow (very slow by today's standards) connections and really need incremental updates, do it in one request.
That gives you better efficiency for compressing the response, and avoids the overhead of the extra HTTP requests and response headers.
And you have to keep in mind that the servers may have vulnerabilities with large requests.