Will a Node.js app consuming a JSON file share in-memory between connections? - json

I am trying to decide between holding static data (gets updated on a nightly basis, not real-time) in a database or in flat JSON files to supply a Node.js application. In preliminary tests the flat file method is twice as fast. My question is about the issue of memory when using the file method.
If my Node.js app reads the data from the file and then does JSON.parse and passes the object to the template to render... does the in-memory size of that data get duplicated with each user connection?
i.e. if the data file is 1MB and there are 1000 concurrent users, does it consume 1000MB of server memory during that period?

Each connection runs separately so if you have 1000 concurrent users, they aren't really running their request all at the same time because node.js is single threaded. It runs one single connection until it either finishes or until it hits a non-blocking operation such as async I/O. Assuming you are using async file I/O, you could have a few connections in process at the same time, but as soon as one finishes, its memory use will be returned to the system by the garbage collector.
Your operation sounds ideal for an in-memory cache. You can decide what lifetime works best for the cache, but you could load the JSON, store it in memory, set an expiration time 10 minutes from now and as long as the current time is not greater than the expiration time, you just return the result from the cache with no disk access. Thus, you'd only ever retrieve the data from disk once every 10 minutes max and the data would be returned even faster and the average memory used per request would be significantly lower.
The only downside to this cache approach is that when the data is updated real-time, it could take up to 10 minutes (on average 1/2 the cache time or 5 minutes) for the cached data to expire and the new data to be returned. Since this only happens once nightly, it may not be a big deal to you, but there are ways to deal with that issue if you want to. For example, you can check the file date/time of the data file on every request and if it hasn't changed since the last time, then you just keep using your cached version of the data. When it does change, you read it from the disk and replace the cached version. This adds an extra disk I/O operation on each request, but guarantees that the user always gets the latest version while still allowing for the benefits of a cached version that only has to be newly read into memory when the data has actually changed.
One other thing to consider. If the data is 1MB and you're generating a giant HTML file from that, your page rendering may be where the largest memory consumption is since expanding a large data structure into HTML can often make it 10-20x larger and how well your rendering engine does with memory consumption depends entirely on the rendering engine.
If there is no per-user customization in the HTML or anything else in the rendered HTML that varies from one rendering to the next (as long as the JSON hasn't changed), you might want to actually cache the rendered HTML so all you have to do is stream it to each request.

Related

Max size of docs sent via JSON to SOLR for indexing

I have a solr instance that gets its data from mysql via cron / python scripts. Right now, it sends 1000 documents at a time. What's the maximum number I can send without problems?
Thanks
Solr does not have any such limit on max number of document to be indexed in a single load. This will be limited by the memory allocated to the JVM, which is running the Solr instance. I have personally tried with 50K document at once, given java heap memory was 24GB. The advantage of doing the bulk load saving the network latency time.

Very frequent couchbase document updates

I'm new to couchbase and was wondering if very frequent updates to a single document (possibly every second) will cause all updates to pass through the disk write queue, or only the last update made to the document?
In other words, does couchbase optimize disk writes by only writing the document to disk once, even if updated multiple time between writes.
Based on the docs, http://docs.couchbase.com/admin/admin/Monitoring/monitor-diskqueue.html, it sounds like all updates are processed. If anyone can confirm this, I'd be grateful.
thanks
Updates are held in a disk queue before being written to disk. If a write to a document occurs and a previous write is still in the disk queue, then the two writes will be coalesced, and only the more recent version will actually be written to disk.
Exactly how fast the disk queue drains will depend on the storage subsystem, so whether writes to the same key get coalesced will depend on how quick the writes come in compared to the storage subsystem speed / node load.
Jako, you should worry more about updates happening in the millisecond time frame or more than one update happening in 1 (one) millisecond. The disc write isn't the problem, Couchbase solves this intelligently itself but the fact that you will concurrency issues when you operate in the milliseconds time frame.
I've run into them fairly easily when I tested my application and first couldn't understand why Node.js (in my case) sometimes would write data to CouchBase and sometimes not. If it didn't write to CouchBase usually for the first record.
More problems raised when I first checked if a document with a specific key existed, upon not existing I would try to write it to CouchBase only to find out that in the meantime an early callback had finished and now there was indeed a key for the same document.
In those case you have to operate with the CAS flag and program it iteratively so that your app is continuously trying to pull the right document for that key and then updates. Keep this in mind especially when running tests and updates to the same document is being done!

Does Caching always enhance performance?

I have a number of sites with PHP and MySQL, especially running MediaWiki, and I need to enhance the performance. However, I have only a limited percentage of CPU that I'm allowed to use.
The best thing I can think about to improve performance is to enable caching. However, I'm confused: Does that really enhance performance overall or just enhance speed?
What I can think about is, if caching will use files, then it would take more processing to get the content of these files. If it will use SQL tables, then it will take more processing to query these tables as well, perhaps the time will be shorter, but the CPU usage will be more.
Is that correct or not? does caching consume more CPU to give a speeder results or it improves performance overall?
At the most basic level caching should be used to store the result of CPU intensive processes. For example, if you have a server side image handler that creates an image on-the-fly (say a thumbnail and larger preview) then you don't want this operation to occur on every request - you'd want to run this process once and store the results; Then, every other request gets the saved result.
This is obviously a hugely over-simplified description of basic caching, and the use of an image is fine in this case as you don't have to worry about stale data i.e. how often will the actual image change? In your case, databases are hugely different. If you cache data then how can you guarantee that there won't be an instant mismatch between your real data and your cached data? Querying a database is not always a CPU intensive task also (granted you have to consider how the database is designed in terms of indexing, table size etc) but in most cases querying a well designed database is far more intensive on disk I/O than it is on CPU cycles.
First, you need to look at your database design and secondly your queries. For example are you normalizing your database correctly, are your queries trawling through huge amounts of data when you could just archive, are you joining tables on non-indexed fields, are your where clauses querying fields that could be indexed (IN is particulary bad in these cases).
I recommend you get hold of a query analyzer and spend some time optimizing your table structure and queries to find that bottle neck before looking into more drastic changes.
Reference : http://msdn.microsoft.com/en-us/library/ee817646.aspx
Performance : Caching techniques are commonly used to improve application performance by storing relevant data as close as possible to the data consumer, thus avoiding repetitive data creation, processing, and transportation.
For example, storing data that does not change, such as a list of countries, in a cache can improve performance by minimizing data access operations and eliminating the need to recreate the same data for each request.
Scalability : The same data, business functionality, and user interface fragments are often required by many users and processes in an application. If this information is processed for each request, valuable resources are wasted recreating the same output. Instead, you can store the results in a cache and reuse them for each request. This improves the scalability of your application because as the user base increases, the demand for server resources for these tasks remains constant.
For example, in a Web application the Web server is required to render the user interface for each user request. You can cache the rendered page in the ASP.NET output cache to be used for future requests, freeing resources to be used for other purposes.
Caching data can also help scale the resources of your database server. By storing frequently used data in a cache, fewer database requests are made, meaning that more users can be served.
Availability : Occasionally the services that provide information to your application may be unavailable. By storing that data in another place, your application may be able to survive system failures such as network latency, Web service problems, or hardware failures.
For example, each time a user requests information from your data store, you can return the information and also cache the results, updating the cache on each request. If the data store then becomes unavailable, you can still service requests using the cached data until the data store comes back online.
You need to profile your seem and find out where the bottle necking is happening. Cacheing is the best type of page load, its one that doesn't hit the server at all. You can build a very simple caching system that only reloads the information ever 15 minutes. So, if the page was cached in the last 15 minutes it gives them a pre-rendered page. The page loaded once, it creates a temp file. every 15 minutes you create a new on (if someone loads that page).
Caching only stores a file that the server has already done the work for. The work to create the file is already done and your simply storing it.
You use the terms 'performance' and 'speed'. I'll assume 'performance' relates to CPU cycles on your web server and that 'speed' relates to the time it takes to serve the page to the user. You want to maximize web server 'performance' ( by lowering the total number of CPU cycles needed to serve pages ) whilst maximizing 'speed' ( lowering the time it takes to serve a web page ).
The good news for you is that Caching can improve both of these metrics at the same time. By caching content you create an output page that is stored in the cache and can be served repeatedly to users directly without having to re-execute PHP code that originally created this output page ( thus lowering CPU cycles ). Fetching a cached page from cache consumes less CPU cycles than re-executing PHP code.
Caching is particularly good for web pages that are generally the same for all users who request the page - for example in a wiki, and for pages that generally do not change all too often - again, a wiki.
"Enhance performance" sounds like some of the email I get...
There are two, interrelated things that happen here. One is "how long does it take to serve a given request?", and the other is "how many requests can I serve concurrently given my limited resources?". People tend to use either or both of those concepts when talking about performance.
Caching can help with both those things.
The most effective caching strategy uses resources outside your machines to cache your stuff - the most obvious examples are the user's browser, or a CDN. I'll assume you can't use a CDN, but by spending a bit of effort on setting the HTTP cache headers, you can reduce the number of requests to your server for static or sluggish resources quite dramatically.
For dynamic content - usually the web page you generate by querying your database - the next most effective caching strategy is to cache the HTML generated by (parts of) your page. For instance, if you have a "most popular items" box on your homepage, this will usually run a couple of moderately complex database queries, and then some "turn data to HTML" back-end code. If you can cache the HTML, you save both the database queries and the CPU effort of turning the data into HTML.
If that's not possible, you may be able to cache the result of some database queries. That helps in reducing the database load, and usually also reduces the load on your web server - the code required to run the database query and deal with the results is usually more onerous that retrieving the item from cache; because it's faster, it allows your request to be handled quicker, which frees up resources more quickly. This reduces the load on your servers for an individual request, and thus allows you to serve more concurrent requests.

MYSQL concatenating large string

I have a web crawler that saves information to a database as it crawls the web. While it does this, it also saves a log file of its actions, and any errors it encounters to a log field in a mysql database (field becomes anywhere from 64kb to 100kb. It accomplishes this by concatenating (using the mysql CONCAT function).
This seems to work fine, but I am concerned about the cpu useage / impact it has on the mysql database. I've noticed that the web crawling is performing slower than before I implemented saving the log to the database.
I view this log file from a management webpage, and the current implementation seems to work fine other than the slow loading. Any recommendations for speeding this up, or implementation recommendations?
Reading 100kb strings into memory numerous time then write them to disk via a db. Of course your going to experience slowdown! Every part of what you are doing is going to task memory, disk, and cpu (especially if memory usage hits the system max and you start swapping to disk). Let me count some of the ways your going to possibly decrease overall site performance:
Sql connections max out and back up as the time to store 100kb records increases time a single process holds a connection
Webserver processes eat up free process pool and max out and take longer to free up because they have to wait on db connections to free.
Web server processes begin to bloat and take more memory each, possibly more than the system can handle without swapping. This is compounded by using the max. Umber of processes due to #2
... A book could be written on your situation.

What is the actual use of buffer temp and blob temp in ssis

I have changed the memory setting in the sql server property to a low memory.Also i have changed the buffer temp path to a particular location of my system.But why package is failing with message as insufficient memory?.If we set buffer temp and blob temp,the data should swap to that temp location right?Then if it is failing, what is the use of buffer temp?
Somewhat related What is the default file path location for BufferTempStoragePath in SSIS 2005? In particular, read the linked article from bimonkey concerning the accessibility of these locations on disk from the sql agent service account.
Generally speaking, when your package is reporting low memory, it is due to the use of fully blocking transformations and Lookup Tasks pulling back too much data. If your package make heavy use of blocking transformations, try and offload the work to source systems. If lookups are to blame, try being more selective in your query. Do not pull back the entire table, only pull the columns you need. If that isn't selective enough, can you try filtering that dataset with a where clause (I only need current year's data, etc). Failing that, switch the lookup from full cache mode to partial cache or no cache. No cache will result in one-off queries to the source system for every row that comes through. It will have no memory that it ran the exact same query 2 rows ago. Partial cache solves that dilemma by keeping the X MB of data in memory. If you want more details about how to reduce memory usage, post some screenshots of what your package looks like. Also note, settings like BufferTempStoragePath are per data flow so if you have multiple data flows in a package, each one will need to be configured.
The architecture of the dataflow is such that data is read into memory buffers and the address of those buffers are passed to the various tasks. Instead of each task needing however much memory allocated to them to hold the data that's passing through them, they all work off the same shared set of memory. Copying that memory from task to task would be slow and very expensive in terms of memory consumption.
With that preamble said, what are BufferTempStoragePath and BlobTempStoragePath? Anytime you pull large object types (n/varchar(max), xml, image, etc) into the dataflow, that data is not kept in memory buffers like native types. Instead it's written to disk and a pointer to that address is what is put into the memory buffer. BufferTempStoragePath is used when your data flow task still has work to do but you've either
fragmented your memory so much (through fully/partially blocking transformations) the engine can't get any more
are trying to do too damn many things in a single task. My rule of thumb is that I should be able to trace a line from any transformation in the package to all the sources and destinations. If you've created a package from the import/export wizard, those dataflows are prime candidates for being split out into separate flows as it loves to group unrelated things into a single data flow which makes them memory hungry.
the box simply doesn't have sufficient resources to perform for the data. I generally prefer to avoid throwing more hardware at a job but if you've addressed the first two bullets, this would be the last one in my pistol.