Chrome localStorage Limits - google-chrome

From what I have understood, and been able to test, thus far Chrome (v 19 on Windows) "limits" local storage to 2.49 Mb, a figure I have verified for myself. However, the storage scenario I have to deal with is rather more complicted. I have an IDE like interface for which I am fetching context sensitive help from the server when the user hovers over something relevant. Once this is done I store that help text (HTML typically between 120 and 1024 chars) to localStorage. No problems thus far. However, the IDE is a very large and complex one and in due course the localStorage will contain 100s or even 1000s of keys. What is not clear to me is this - will the results of the rather rudimentary localStorage limits tests (my own and ones I ran into on the web) still be valid - the tests are done by storing a long char string under one key which is significantly different from what I have described above. I assume that at the very least there is an overhead associated with the space consumed by key storage itself.

Related

Trying to find connectivity in the Wikipedia Graph: Should I use the API or the relevant SQL Dump?

I apologize for the slightly obnoxious nature of my question.
A few years ago, I came across a game that could be played on Wikipedia. The goal is to start from a random page and try to get to the 'Adolf Hitler' page by utilizing internal wikilinks within 5 clicks(6 degrees of separation). I found this to be an interesting take on the Small-world experiment and tried it a few times, and was able to reach the target article within 5-6 clicks almost every time (however, there was no way for me to know if that was the shortest path or not).
As a project, I want to find out the degree of separation between a few(maybe hundreds, or thousands, if feasible) of random Wikipedia pages and Adolf Hitler's page in order to create a Histogram of sorts. My intention is to do an exhaustive search in a DFS manner from the root page (restricting the 'depth' of the search to 10, in order to ensure that the search terminates in case it has selected a bad path, or is running in cycles). So, the program would visit every article reachable within 10 clicks of the root article, and find the shortest way in which the target article is reachable.
I realise that the algorithm described above would certainly take too long to run. I have ideas for optimizations which I will play around with.
perhaps I will use a single source shortest path BFS based approach which seems to be more feasible considering that the degree of the graph would be quite high(mentioned later)
I will not mention all ideas for the algorithms here as they are not relevant to the question as in any possible implementation, I will have to query (directly by downloading relevant tables on my machine, or through the API) the:
pagelinks table which contains information about all internal links in all pages in the form of the 'id' of page containing the wikilink and the 'title' of the page that is being linked
page table which contains relevant information for mapping the page 'title' to 'id'. This mapping is needed because of the way data is stored in the pagelinks table.
Before I knew Wikipedia's schema, naturally, I started exploring the Wikipedia API and quickly found out that the following API query returns the list of all internal links on a given page, 500 at a time:
https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=links&titles=Mahatma%20Gandhi&pllimit=max
Runnig this on MediaWiki's API sandbox a few times gives a request time of about 30ms for 500 link results returned. This is not ideal as even 100 queries of this nature would end up taking 3 seconds, which means I would certainly have to reduce the scope of my search somehow.
On the other hand, I could download the relevant SQL Dumps for the two tables here and use them with MySQL. English Wikipedia has aroud 5.5 million articles (which can be considered as vertices of graph). The compressed sizes of the tables are: around 5GB for the pagelinks table and 1GB for the page table. A single tuple of the page table is a lot bigger than that of the pagelinks table, however (max size of around 630 and 270 bytes, by my estimate). Sorry I could not provide the number of tuples in each table, as I haven't yet downloaded and imported the database.
Downloading the database seems appealing because: since I have the entire list of pages in the page table, I could resort to a single pair shortest path BFS approach from Adolf Hitler's page by tracking all the internal backlinks. This would end up finding the degree of seperation of every page in the database. I would also imagine that eliminating the bottleneck (internet connection) would speed up the process.
Also, I would not be overusing the API.
However, I'm not sure that my Desktop would be able to perform even at par with the API considering the size of the database.
What would be a better approach to pursue?
More generally, what kind of workload requires the use of an offline copy of the database rather than just the API?
While we are on the topic, in your experience, is such a problem even feasible without the use of supercomputing, or perhaps a server to run the SQL queries?

How does Chrome update URL bar completions?

I really enjoy using Chrome's URL bar because it remembers commonly-visited sites and often suggests a good completion based on what I've typed and/or visited before. So, for example, I can type t in the URL bar and Chrome will automatically fill it in with twitter.com, or I can type maps and Chrome will fill in the .google.com. This gives me the convenience of data-driven domain name shortcuts without having to maintain an explicit list.
What I'm wondering, though, is how Chrome determines that an old shortcut should be replaced with a new one. For example, if I visit twitter.com often, then that becomes the completion when I type t. But if I then start visiting twilio.com often enough, then, after some time, Chrome will start to fill that in as the default completion for t. What I can't figure out is how or when that transition takes place. It also seems that there are (at least) two cases involved : one for domain names, and another for path strings, because if I visit a certain full URL often, and then want to get to the root of the same domain, I end up having to type the entire domain name out to get Chrome to ignore the full-URL completion.
If I had to guess, I'd imagine that Chrome stores the things that I type in the URL bar in a trie whose values are the number of times that a particular string has been typed (and/or visited ?). Then I'd imagine it has some sort of exponential decay model for the "counts" in the trie. But this is just a guess. Does anyone know how this updating process happens ?
Well, I ended up finding some answers by having a look at the Chromium source code ; I'd imagine that Chrome itself uses this code without too much modification.
When you type something into the search/URL bar (which is apparently called the "Omnibox"), Chrome starts looking for suggestions and completions that match what you've typed. To do this, there are several "providers" registered with the browser, each of which knows how to make a particular type of suggestion. The URL history provider is one of these.
The querying process is pretty cool, actually. It all happens asynchronously, with particular attention paid to which activity happens in which thread (the main thread being especially important not to block). When the providers find suggestions, they call back to the omnibox, which appears to merge and sort things before updating the UI widget.
History provider
It turns out that URLs in Chrome are stored in at least one, and probably two, sqlite databases (one is on disk, and the second, which I know less about, seems to be in memory).
This comment at the top of HistoryURLProvider explains the lookup process, complete with multithreaded ASCII art !
Sqlite lookup
Basically, typing in the omnibox causes sqlite to run this SQL query for looking up URLs by prefix. The suggestions are ordered by the number of visits to the URL, as well as by the number of times that a URL has been typed.
Interestingly, this is not a trie ! The lookup is indeed based on prefix, but the scoring of those lookups does not appear to be aggregated by prefix, like I'd imagined.
I had a little less success in determining how the scores in the database are updated. This part of the code updates a URL after a visit, but I haven't yet run across where the counts are decremented (if at all ?).
Updating suggestions
What I think is happening regarding the updating of suggestions -- and this is still just a guess right now -- is that the in-memory sqlite database essentially has priority over the on-disk DB, and then whenever Chrome restarts or otherwise flushes the contents of the in-memory DB to disk, the visit and typed counts for each URL get updated at that time. Again, just a guess, but I'll keep looking as I get time.
The code is really nice to read through, actually. I definitely recommend it if you have similar questions about Chrome.

HTML localStorage setItem and getItem performance near 5MB limit?

I was building out a little project that made use of HTML localStorage. While I was nowhere close to the 5MB limit for localStorage, I decided to do a stress test anyway.
Essentially, I loaded up data objects into a single localStorage Object until it was just slightly under that limit and must requests to set and get various items.
I then timed the execution of setItem and getItem informally using the javascript Date object and event handlers (bound get and set to buttons in HTML and just clicked =P)
The performance was horrendous, with requests taking between 600ms to 5,000ms, and memory usage coming close to 200mb in the worser of the cases. This was in Google Chrome with a single extension (Google Speed Tracer), on MacOSX.
In Safari, it's basically >4,000ms all the time.
Firefox was a surprise, having pretty much nothing over 150ms.
These were all done with basically an idle state - No YouTube (Flash) getting in the way, not many tabs (nothing but Gmail), and with no applications open other than background process + the Browser. Once a memory-intensive task popped up, localStorage slowed down proportionately as well. FWIW, I'm running a late 2008 Mac -> 2.0Ghz Duo Core with 2GB DDR3 RAM.
===
So the questions:
Has anyone done a benchmarking of sorts against localStorage get and set for various different key and value sizes, and on different browsers?
I'm assuming the large variance in latency and memory usage between Firefox and the rest is a Gecko vs Webkit Issue. I know that the answer can be found by diving into those code bases, but I'd definitely like to know if anyone else can explain relevant details about the implementation of localStorage on these two engines to explain the massive difference in efficiency and latency across browsers?
Unfortunately, I doubt we'll be able to get to solving it, but the closer one can get is at least understanding the limitations of the browser in its current state.
Thanks!
Browser and version becomes a major issue here. The thing is, while there are so-called "Webkit-Based" browsers, they add their own patches as well. Sometimes they make it into the main Webkit repository, sometimes they do not. With regards to versions, browsers are always moving targets, so this benchmark could be completely different if you use a beta or nightly build.
Then there is overall use case. If your use case is not the norm, the issues will not be as apparent, and it's less likely to get noticed and adressed. Even if there are patches, browser vendors have a lot of issues to address, so there a chance it's set for another build (again, nightly builds might produce different results).
Honestly the best course of action would to be to discuss these results on the appropriate browser mailing list / forum if it hasn't been addressed already. People will be more likely to do testing and see if the results match.

Tools for viewing logs of unlimited size

It's no secret that application logs can go well beyond the limits of naive log viewers, and the desired viewer functionality (say, filtering the log based on a condition, or highlighting particular message types, or splitting it into sublogs based on a field value, or merging several logs based on a time axis, or bookmarking etc.) is beyond the abilities of large-file text viewers.
I wonder:
Whether decent specialized applications exist (I haven't found any)
What functionality might one expect from such an application? (I'm asking because my student is writing such an application, and the functionality above has already been implemented to a certain extent of usability)
I've been using Log Expert lately.
alt text http://www.log-expert.de/images/stories/logexpertshowcard.gif
I can take a while to load large files, but it will in fact load them. I couldn't find the file size limit (if there is any) on the site, but just to test it, I loaded a 300mb log file, so it can at least go that high.
Windows Commander has a built-in program called Lister which works very quickly for any file size. I've used it with GBs worth of log files without a problem.
http://www.ghisler.com/lister/
A slightly more powerful tool I sometimes use it Universal Viewer from http://www.uvviewsoft.com/.

Does anyone know how I can store large binary values in Riak?

Does anyone know how I can store large binary values in Riak?
For now, they don't recommend storing files larger than 50MB in size without splitting them. See: FAQ - Riak Wiki
If your files are smaller than 50MB, than proceed as you would with storing non binary data in Riak.
Another reason one might pick Riak is for flexibility in modeling your data. Riak will store any data you tell it to in a content-agnostic way — it does not enforce tables, columns, or referential integrity. This means you can store binary files right alongside more programmer-transparent formats like JSON or XML. Using Riak as a sort of “document database” (semi-structured, mostly de-normalized data) and “attachment storage” will have different needs than the key/value-style scheme — namely, the need for efficient online-queries, conflict resolution, increased internal semantics, and robust expressions of relationships.Schema Design in Riak - Introduction
#Brian Mansell's answer is on the right track - you don't really want to store large binary values (over 50 MB) as a single object, in Riak (the cluster becomes unusably slow, after a while).
You have 2 options, instead:
1) If a binary object is small enough, store it directly. If it's over a certain threshold (50 MB is a decent arbitrary value to start with, but really, run some performance tests to see what the average object size is, for your cluster, after which it starts to crawl) -- break up the file into several chunks, and store the chunks separately. (In fact, most people that I've seen go this route, use chunks of 1 MB in size).
This means, of course, that you have to keep track of the "manifest" -- which chunks got stored where, and in what order. And then, to retrieve the file, you would first have to fetch the object tracking the chunks, then fetch the individual file chunks and reassemble them back into the original file. Take a look at a project like https://github.com/podados/python-riakfs to see how they did it.
2) Alternatively, you can just use Riak CS (Riak Cloud Storage), to do all of the above, but the code is written for you. That's exactly how RiakCS works -- it breaks an incoming file into chunks, stores and tracks them individually in plain Riak, and reassembles them when it comes time to fetch it back. And provides an Amazon S3 API for file storage, for your convenience. I highly recommend this route (so as not to reinvent the wheel -- chunking and tracking files is hard enough). Yes, CS is a paid product, but check out the free Developer Trial, if you're curious.
Just like every other value. Why would it be different?
Use either the Erlang interface ( http://hg.basho.com/riak/src/461421125af9/doc/basic-client.txt ) or the "raw" HTTP interface ( http://hg.basho.com/riak/src/tip/doc/raw-http-howto.txt ). It should "just work."
Also, you'll generally find a better response on the riak-users mailing list than you will here. http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com (No offense to z8000, who seems to also have answers.)