I've developed a Chrome Extension that cycles through approximately 1500-2000 pages to collect information from a website and push it onto my own server. I use a Chrome extension, as given the requirements it's much easier to configure than making and parsing from the server side.
The extension is used only for internal purposes and we trigger this job using the chrome.alarm API to run at 3:00 am every day. This alarm when triggers pops open a new tab and runs through the 1500-2000 pages. The problem is when we check in the morning we see the SNAP dialogue after the extension has cycled through about 1500 pages (approx 3/4). I'm presuming this is due to the memory demand placed on Chrome to maintain such an unusually large history?
My question is, what would be the best way to mitigate this? Presumably killing the tab and reopening (after x number of pages) would work but that would slow down the feed and require quite a bit of code re-factoring. Is there any way you can force Chrome to dump the history and would doing so free memory in the immediate session?
Just to add some context, I am running this on a extra small VM with only 1 GB of memory. I appreciate I could upgrade the VM but that really just defers the problem.
It's hard to make a solid recommendation without seeing your code, but here's something a bit more general:
The problem is likely the information that is being collected. It continues to build up in memory until you run out. You might consider saving the results to local storage using Chrome's unlimited storage capability. In manifest file:
"permissions": [
"unlimitedStorage"
],
What I would probably do is save to local storage after every page, or something similar, and then reset the array (or whatever you're using).
UPDATE
If indeed history is the problem, you could consider deleting it on the fly. This function will delete entries for a specific URL:
https://developer.chrome.com/extensions/history.html#method-deleteUrl
Related
Besides the browser cache, there are a few other ways browser cache data. For Chrome, there is another cache in the rendering engine Blink that stores images, styles, scripts and fonts (maybe more) in memory.
This cache is used for consecutive navigations on a site. Resources delivered from the Blink cache are tagged with (from memory cache) in the network tab. Resources served from the browser cache are tagged with (from disk cache).
My question is now, which resources are stored in and delivered from this very fast cache? From my tests, it varies a lot:
It works extremely well for images and script tag which are directly in the HTML.
It works sometimes for style (link) tags which are directly in the HTML. Sometimes it does not work (in the same browser with the same session).
It works almost never for script tags that are inserted into the HTML programmatically. Sometimes it works though.
One huge difference between disk cache hits and memory cache hits becomes visible in combination with Service Workers. Requests that are served by the in-memory cache cannot be observed in the Service Worker (because the cache comes before the Service Worker). Requests that are served by the disk cache pass through the Service Worker (since the Browser Cache lies behind Service Worker).
To show the explained behavior, I built a test page with all resource types: https://dm-clone-optimized.app.baqend.com/
You can navigate through the site with the links at the top and observe how the requests behave in the network tab and console. Every page loads the same resources.
After a bit of navigating (Chrome 70.0.3538.67), I get this behavior most of the time:
HTML is fetched from network
Script tags scripts.js and scripts2.js are from in-memory cache
Image tag logo.png is from in-memory as well
Style link tag styles.css is from disk cache :(
Programatically added script tag scripts2.js?id=1 is from disk cache as well :(
Sometimes though, I get really lucky and everything is served from in-memory cache:
I would love to understand how the Blink in-memory cache works and how I can tune my site to use it for all resources with appropriate cache control header.
---- edit ----
What concerns me the most is: Why are dynamically added scripts not cached at all? This has a noticeable impact on frameworks like require.js since they insert all dependencies as dynamically added script tags.
Blink in-memory cache works
Blink has four memory allocators
PartitionAlloc, Oilpan, tcmalloc, and system allocator
So the team working on Chrome Blink has removed tcmalloc and system allocators from Blink
Blink (PartitionAlloc+Oilpan) is the second largest consumer of renderer’s memory which consumes 10 - 20% in typical cases and retains some of the memory in Discardable, CC and V8
Inside Blink, the primary memory consumers are:
Large StringImpls (used by JavaScript source code)
shared buffers (used by Resources)
Vectors and HashTables
The recommendation is to: "identify caches that have an impact on Blink’s total memory and implement purgeMemory() only on them."
Reducing the size of (DOM object) won’t have an impact
Discarding caches won’t affect in most cases
They are working on getting rid of "DiscardableMemory" items which will help to do things like forcibly detaching all layout objects which in turn will release memory retained by the layout tree.
I believe it is a result of optimization in chrome, and they make it verbose to you.
The files are always go into disk cache. And they also goes into memory, and flushed very soon.
Chrome is smart enough to ask running process that do they still have a loaded copy of them in memory before seek on disk.
The step has a high hit rate, as those images/js are actively using for something.
You will not have any control how chrome manage TTL of them/capacity of memory could be used to keep blob hot. Chrome dev team doing quite a lots on dynamic tuning based on actual hardware capacity and system loading.
P.S. If you are asking for keep YOUR APP in memory. You are failing into Sun/Adobe way of evil: making their app DLL hot in memory(by tray icon/service) and slow everyone else down.
P.P.S. If it is the app end-user might want to use, use electron and follow Whatsapp/Slack/etc to build an app always running.
I am trying to make the page speed as fast as possible. The problem is that I have one file in particular that takes a long time to download (even if the size is 0 -> loaded from cache)
Here is how it looks without cache (first load):
And with cache:
I agree that the time has dropped by almost 50%, but why do I have a Content Download time from my local cache higher than the one from the server?
NOTE:
I am not sure if the times are explained the same in Safari, but it doesn't seem to take that long to get the file there.
Image from Safari Network tab:
After running tests on several projects and without any code or further breakdown of the content downloaded processes, here is the conclusion I've reached.
System threads set a priority level for each process they are assigned. When a file is stored in the cache then the download performance is, for the most part, dependent on the performance of the machine. If threads are bogged down then it can result in a slower content download than a download from a server. However, a download from a server is typically slower because of all of the processes involved in contacting, requesting, receiving and handling the download.
The advantage of cache is easily seen when handling slow connections and/or large files. For example, if there's 10+ seconds of waiting due to a slow connection then it's not ideal to load the full page more than once. It would take an incredibly slow system to justify sending that page twice.
Since browser cache removes the send/receive process, it is faster but it doesn't remove load times entirely. In this case, it happens to yield a higher load time in one category.
Side note: It seems odd to classify a file as content downloaded when reading it from the cache but it seems to be consistent amongst all of the projects I tested.
I've an embedded system which runs firmware and has USB mass storage with size 79kB. So when you plug in the device to any computer(MAC/Windows), it pops as a 79kB flash drive. The firmware creates files which has transaction records. The objective is to display these transactions (tables and simple graphs) to the user. I've narrowed down to a web browser. So the user (with MAC/Windows PC) can plug in the USB device mass storage and open an HTML file in the mass storage drive and view all the transactions in the form of tables and simple bar graphs. The tricky part comes here: the device(firmware) needs to update it's clock, and this time input has to be sourced from the MAC/Windows PC. How can this be achieved?
This is the minimum requirement. Further, through the web browser the user wants to write some configuration parameters for e.g. through a text box and a submit button in the HTML page.
NOTE: Here the device has USB mass storage type and the web browser approach were selected so that there is no prerequisites for the user.
Please suggest an alternative if this can be done using another approach for e.g. a different class of USB or some other application locally available on MAC/Windows desktop/laptop. For e.g. the application should run on both on Mac and Windows i.e. the code should be the same but can be built into separate packages one for Mac and the other (.exe) for Windows. Please suggest a platform for this that has same source but can be built for both mac and windows. Thanks!
As far as I know, there is no way a web browser could write to a file. If such a thing was possible, it would be a huge security issue.
You have to write a piece of native software to do all the tasks you name. That can be done in pretty much any programming language, and if you're developing embedded systems I reckon you must have some experience in programming.
I'm looking at doing something similar and have an idea, though you may be better equipped to run with it than I am. Have the define contain a directory called "SET_DATE" with files "YEAR15" through "YEAR99", "MON01" through "MON12", "DATE01" through "DATE31", "H00" through "H23", "M00" through "M59", "S00" through "S59", and "SET"; each such file should start at a different sector, though none of the sectors in question need to contain any data (they need not physically be stored anywhere). To set the date to July 4, 2020 at 12:34:56pm, read the following files in sequence:
SET_DATE/YEAR20
SET_DATE/MONTH07
SET_DATE/DATE04
SET_DATE/H12
SET_DATE/M34
SET_DATE/S56
SET_DATE/SET
The last access should cause the unit to set its clock. If a user might want to set the clock more than once, that could be accommodated by either having a bunch of essentially-identical directories under SET_DATE (so setting the date the first time would use SET_DATE/00/YEAR20, the second time SET_DATE/01/YEAR20, etc.) and/or having the drive unmount/remount itself if necessary to clear out any caching.
I would think it unwise to have directory fetches trigger actions, since Windows or an anti-virus tool might decide to pre-cache all the directories in a drive when it is mounted. I would not expect Windows or a browser to eagerly load files, however, so I would think one could have read accesses trigger actions.
Suppose I've got the following parts in my system: Storage (S) and a number of Clients (C). The clients are separate Web Workers and I'm actually trying to emulate something like shared memory for them.
Right now I've got just one Client and it's communicating with the Storage pretty intensively. For the sake of testing it is spinning in a for-loop, requesting some information from the Storage and processing it (processing is really cheap).
It turns out that this is slow. I've checked the process list and noticed chrome --type=renderer eating lots of CPU, so I thought that it might be redrawing the page or doing some kind of DOM processing after each message, since the Storage is running in the page context. Ok, I've decided to try to move the Storage to a separate Worker so that the page is totally idle now and… ended up getting even worse performance—exactly twice slower (I've tried a Shared Worker and a Dedicated Worker with explicit MessageChannels with the same results).
So, here is my question: why sending a message from a Worker to another Worker is exactly twice slower than sending a message from a Worker to the page? Are they sending messages through the page? Is it “by design” or a bug? I was going to check the source code, but I'm afraid it's a bit too complex and, probably, someone is already familiar with this part of Chromium internals…
P.S. I'm testing in Chrome 27.0.1453.93 on Linux and Chrome 28.0.1500.20 on Windows.
As far as I know, at the current moment, late 2011 the max-connections-per-server limit remains 6. Please correct me if I am wrong. This is bad that we cannot fix this easily as in Firefox. As far as I know this value is hardcoded.
One of the solutions is to download the Chromium's sources and rebuild them. Is there a more easy solution?
Is there any tricky way to hack this without creating a dozen of mirror-domains?
Why I'm asking the question: My task is to create a html-javascript slideshow that will run inside a fullscreened browser, and a huge monitor is hanging on the wall. The javascript is really complicated, it preloads photos and makes a lot of ajax calls to my web services. If WIFI connection is slow, if 6 photos are loading, the AJAX calls fail, the application runs bad. I want a fast solution based, on http or browser or ubuntu tweak something else, because rebuilding the javascript app will take days.
Offtopic: do you know any other things that can be tweaked in my concrete situation?
IE is even worse with 2 connection per domain limit. But I wouldn't rely on fixing client browsers. Even if you have control over them, browsers like chrome will auto update and a future release might behave differently than you expect. I'd focus on solving the problem within your system design.
Your choices are to:
Load the images in sequence so that only 1 or 2 XHR calls are active at a time (use the success event from the previous image to check if there are more images to download and start the next request).
Use sub-domains like serverA.myphotoserver.com and serverB.myphotoserver.com. Each sub domain will have its own pool for connection limits. This means you could have 2 requests going to 5 different sub-domains if you wanted to. The downfall is that the photos will be cached according to these sub-domains. BTW, these don't need to be "mirror" domains, you can just make additional DNS pointers to the exact same website/server. This means you don't have the headache of administrating many servers, just one server with many DNS records.
I don't know that you can do it in Chrome outside of Windows -- some Googling shows that Chrome (and therefore possibly Chromium) might respond well to a certain registry hack.
However, if you're just looking for a simple solution without modifying your code base, have you considered Firefox? In the about:config you can search for "network.http.max" and there are a few values in there that are definitely worth looking at.
Also, for a device that will not be moving (i.e. it is mounted in a fixed location) you should consider not using Wi-Fi (even a Home-Plug would be a step up as far as latency / stability / dropped connections go).
BTW, HTTP 1/1 specification (RFC2616) suggests no more than 2 connections per server.
Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.
There doesn't appear to be an external way to hack the behaviour of the executables.
You could modify the Chrome(ium) executables as this information is obviously compiled in. That approach brings a lot of problems with support and automatic upgrades so you probably want to avoid doing that. You also need to understand how to make the changes to the binaries which is not something most people can pick up in a few days.
If you compile your own browser you are creating a support issue for yourself as you are stuck with a specific revision. If you want to get new features and bug fixes you will have to recompile. All of this involves tracking Chrome development for bugs and build breakages - not something that a web developer should have to do.
I'd follow #BenSwayne's advice for now, but it might be worth thinking about doing some of the work outside of the client (the web browser) and putting it in a background process running on the same or different machines. This process can handle many more connections and you are just responsible for getting the data back from it. Since it is local(ish) you'll get results back quickly even with minimal connections.