Is there really no bulk insert in the AppFabric Cache 1.1 API? - windows-server

I'm looking at the API and I don't see any way to send a collection of individual objects to cache at the same time.
Given that latency can be a significant issue wouldn't bulk insert afford the possibility of huge performance gains under certain circumstances?
I sure hope I'm missing something; this is probably a deal breaker for us since we need to initialize our cache with large amounts of data at system start and time is precious.

You are right about bulk operations giving a performance boost since the usual insert, add, remove methods work with single key-value pairs, while the bulk operations let you work with a collection of key-value pairs. The idea behind bulk operations is to minimize the cluster traffic by bundling multiple operations into a single call. This is handy if you want to work with collection of items. I don't think appfabric provides for this. You an try NCache , which comes with bulk operations and also provides for bulk operations at the cluster level on startup:
http://www.alachisoft.com/ncache/

Related

Should I implement my own caching or rely on read-replicas?

We have an enterprise application that uses an SQL database. The database access characteristics are about 90% reads. The data that does get updated or created needs to be up-to-date immediately. The cache needs to be correctly invalidated with high certainty. The entities are referred to by their primary key for 98% of the cases.
The application is based on Node.js and is AWS-native. Since the application is AWS-native, I'd like to rely on managed services from AWS rather than hosting my own. One option is to implement our read-through Redis-based cache. Upon retrieving the entities, we'd check the cache and if the data is not cached we'd put it into the cache before turning it to the user. The parts of the code that update those entities will invalidate the cache by primary key.
Generally speaking, in computer science cache coherency is one of the most challenging problems to get right. I am of the opinion that rather than implementing a Redis cache and thinking through all of the possible scenarios for correctly invalidating it, it is wiser to instead configure an Aurora read-replica specifically for reading frequently accessed entities. The RDBMS will do a much better job at caching than anything we can build ourselves.
So, I am facing two options -- go through the effort of implementing my own caching, or use read replicas. My personal opinion is to use a read replica.
Any advice is greatly appreciated, as always.
Yes, you're right, cache invalidation is a tough problem. The simplest solution is to add code to your data writes, to replace the cached values. So they're always current. But this is easy only if the cached values have a pretty much 1-to-1 correlation with rows in your database.
An advantage of your own cache is that you can cache data that is not 1-to-1 with rows of data in the database. You might cache an entire HTML fragment for a drop-down menu for example. That could be the result of several SQL queries. It could be quite an advantage to cache data that is higher up the "food chain" so to speak. But cache invalidation becomes less straightforward. Best for storing results of queries that don't change often.
Using a read-replica is not a substitute for using a cache. Querying a read-replica still has overhead of making a database connection, authentication, SQL query parsing and optimization, locking, and all the other overhead that goes into RDBMS workings.
Querying data from a cache can be orders of magnitude faster.
Both have their place. It's best to use both a cache and a read-replica for different tasks. I would also add message queues as an important technology. I believe database, cache, and queue form a three-legged stool.
But you must have experience and judgment to know when each is the best tool for a given case.

What would be a preferrable approach for rendering time series data

We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data. 
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API. 
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.

SQL Azure performance considerations

Which are the performance considerations I should keep in mind when I'm planning an SQL Azure application? Azure Storage, and the worker and the web roles looks very scalable, but if at the end they are using one database... it looks like the bottleneck.
I was trying to find numbers about:
How many concurrent connections does
SQL Azure support?
Which is the bandwidth?
But no luck.
For example, I'm planning and application that uses a very high level of inserts, but I need return the result of an aggregate function each time (e.g.: the sum of all records with same key in a column), so I can not go with table storage.
Batching is an option, but time response is critical as well, so I'm afraid the database will be bloated with lot of connections.
Sharding is another option, but even when the amount of inserts is massive, the amount of data is very small, 4 to 6 columns with one PK and no FK. So even a 1Gb DB would be an overkill (and an overpay :D) for a partition.
Which would be the performance keys I should keep in mind when I'm facing these kind of applications?
Cheers.
Achieving both scalability and performance can be very difficult, even in the cloud. Your question was primarily about scalability, so you may want to design your application in such a way that your data becomes "eventually" consistent, using queues for example. A worker role would listen for incoming insert requests and would perform the insert asynchronously.
To minimize the number of roundtrips to the database and optimize connection pooling make sure to batch your inserts as well. So you could send 100 inserts in one shot. Also keep in mind that SQL Azure now supports MARS (multiple active recordsets) so that you can return multiple SELECTs in a single batch back to the calling code. The use of batching and MARS should reduce the number of database connections to a minimum.
Sharding usually helps for Read operations; not so much for inserts (although I never benchmarked inserts with sharding). So I am not sure sharding will help you that much for your requirements.
Remember that the Azure offering is designed first for scalability and reasonable performance in a multitenancy environment, where your database is shared with others on the same server. So if you need strong performance with guaranteed response time you may need to reevaluate your hosting choices or indeed test the performance boundaries of Azure for your needs as suggested by tijmenvdk.
SQL Azure will throttle your connections if any form of resource contention occurs (this includes heavy load but might also occur when your database is physically moved around). Throttling is non-deterministic, meaning that you cannot predict if and when this happens. When throttling, SQL Azure will drop your connection, requiring you to perform a retry. Number of connections supported and bandwidth is not published "by design" due to the flexible nature of the underlying infrastructure. Having said that, the setup is optimized for high availability, not high throughput.
If the bursts happen at a known time, you might consider sharding just during those bursts and consolidating the data after the burst has happened. Another way to handle this, is to start queueing/batching writes if and only if throttling occurs. You can use an Azure Queue for that plus a worker role to empty the queue later. This "overflow mechanism" has the advantage of automatically engaging if throttling occurs.
As an alternative you could use Azure Table Storage and keep a separate table of running totals that you can report back instead of performing an aggregation over the data to return the required sum of all records (this might be tricky due to the lack of locking on the tables though).
Apologies for stating the obvious, but the first step would be to test if you run into throttling at all in your scenario. I would give the overflow solution a try.

Storing MySQL writes in Memory

This is a bit of an oddball question. I'm aware of using memcached to cache "read heavy" data in memory, but is it possible todo the same for writes?
For example: You have a chunk of data in memory (in memcached) and if you have to make any changes to that data, you make it in memory itself. At the end of a certain time period (hour, or day) you replicate all those changes into MySQL. So you are using storing things in memory rather than disk, and then at the end of the time period those changes become permanent when they are copied over to MySQL.
Is there a piece of software that can accomplish this ? Sample code maybe ?
Not possible for the critical data, as memcache do not guarantee data consistency.
Though you can use such a behavior for the session data and no special software needed. Just retrieve your data, alter it and save back.
yes, it can be in that way to reduce I/O overload.
i believe what you described is a "write-behind" scenario for cache data.
the implementation can be done with creating task queues and setup workers to do the job.
Perhaps MySQL Proxy can be of use for you? I haven't tried it so I don't know that it does what you require.

Is a good idea to build in-memory indexes and circumvent the DB when operating intensively on a small subset?

I'm working on a program to automatically find optimal shift assignments, subject to lots of constraints. I'm using grails, i.e. the data about workers, shifts and assignments will be kept in a DBMS.
For the optimization itself, I'll have to work very intensively on a small subset of the data (about 600 rows total from about 5 different tables). I'll have to iterate over and search through various sub-subsets dozens of times to compute fitness functions, change some values, compute fitness again, lather, rinse, repeat, perhaps hundreds of times.
Now, while searching and iteration are exactly what a DBMS is for, I believe that in this case the overhead of hundreds of DB requests would dwarf the actual work being done, even for an in-memory DBMS like HSQLDB. So instead, I'm planning to slurp the entire subset into memory at the beginning, build my own indexes (HashMap, mainly) for the lookups I'll have to do, and then work only with those, staying away from the DB until I'm done and write my result to it.
Is this a sound approach? Any better ideas?
I'm assuming you must issue hundreds of commands to the database? There's no way to execute the code inside the DB?
The main thing I'd be worried about is integrity; make sure you handle locking correctly. You'd probably want a version number stored somewhere so you don't need to lock the entire set of data for the duration of processing. In the update transaction, you'd first ensure the version number is the same as when you started reading.
Finally, benchmark it? I've done some apps over the last year or so that had a similar very intensive compute process per request. Using in-process objects to represent the data was orders of magnitude more efficient than hitting the database per request. But every app is different and there might be things not considered that'll impact it.