Chrome Extension: How to get data out? - google-chrome

I appreciate this question may appear broad. But it is because I am looking anywhere and everywhere for a possible solution to do something very simple.
The goal is from a web page opened in Chrome, to scan the DOM, extract specific elements and save them silently in some way that I can then access.
There is no intention for any of this to be published as an app or extension, it is simply me wanting to access my own rendered browser data and extract and store this data on my own computer. For this reason, I am currently finding Chrome's exhaustive sandboxing security frustrating and irrelevant to say the least.
I have a working Chrome Extension which extracts all of the data I want, has a list of 5 strings that I want to save and that's as far as I have gotten.
I have looked into these areas:
Existing NPAPI Plugins (could not get npapi file io to work).
Creating my own NPAPI Plugin - seems like a huge overhead and learning exercise simply to get external access to 5 strings
Every aspect of Chrome extension (and even App) apis (particularly their localstorage which is not accessible from outside the extension)
Any other thoughts?
I realise there is a solution through creating my own NPAPI plugin but I would like to believe that there is another approach that allows me to link a constructed DOM with my local system. I am open to any other option? (I have considered a Linux purely bash approach but I need to generate the DOM as though it was in my browser).
I just want to be able to access specifically extracted parts of a DOM on my local system, not write an entirely new C++ plugin to facilitate this very basic feature.

Related

How to access stored values of other Chrome or Firefox extensions

I need a clarification regarding the Chrome local storage access. Say, I have designed two Chrome extensions - Extension1 and Extension2. I set a value val_ext1 in Extension1 and val_ext2 in Extension2 by calling "chrome.storage.sync.set".
I need to know whether I can access val_ext2 from Extension1 and val_ext1 from Extension2. If yes, then it will be helpful if someone can share some pointers on how to do the same. Additionally, I would also like to know whether is it the same for Mozilla extensions as well.
Chrome
No, this is not possible. chrome.storage.* is only accessible to the extension which stored the data. There is no API which would allow you to access data stored in a StorageArea.
If you want to gain access to that data, you will need to use Cross-extension messaging. However, this requires both extensions to be written to facilitate doing so. This is unlikely to be the case for an extension which you do not control.
Firefox
In this regard (and in most other ways), Firefox WebExtensions have the same restrictions as Chrome extensions.
Other types of Firefox extensions are significantly more capable than WebExtensions/Chrome extensions. Using any of the other types of Firefox extensions, you should be able to write code which will grant you access to data stored in a different WebExtensions' StorageArea data. The WebExtensions APIs are primarily written as JavaScript. Other types of Firefox extensions have the ability, if they choose, to modify the JavaScript that is shipped with Firefox. Thus, while there is not a specific API available that would allow you to do so, you could write an extension to do it, even if that requires that extension to modify the JavaScript code which ships with Firefox.1 Obviously, modifying the stock Firefox code should not be your first choice of how to accomplish something, but it may be what is needed to accomplish what you desire.
One of my extensions modifies the stock Firefox code in order to change a value in one file from being defined as a const to being configurable via an options page. In that particular extension, the code is completely replaced. However, it is also possible to programmatically change portions of the Firefox code.

File protocol to get directory listing

I hope you can help me.
I am writing a desktop program that will run in a web browser (in HTML/CSS/javascript in case that wasn't clear). It will be entirely disconnected from the internet and obtain files and data using only FILE protocol. My question is: how can you obtain a listing of the contents of a directory referenced this way?
I've been searching for months and really turned up almost nothing! Maybe I just don't know how to search but there doesn't seem to be much information about how browsers actually deal with File protocol.
For example, when you open a directory in Chrome, it gives you a nice table with hyperlinks of all the parent directory's children. However, when you look at the source code, it's as if Chrome just magically knew exactly what files were in the directory. I feel that if I could understand how it knew that, maybe I'd be able to get somewhere...
Also, I am open to other ideas about how to get a directory listing. I've read about being able to do it with php but it requires running a server. Does anyone know if it is possible to run php code with File protocol rather than HTTP?
Thanks for reading this far and truely any information that could remove me from this standstill is appreciated!
Web Apps do not have access to the user's file system so you will not be able to do what the chrome file browser does with a web app. I believe chrome is using some sort of native code to do this.
I would recommend trying something a little more on the native-side. A chrome app will let you use html, css, javascript while also allowing you access to the file system. https://developer.chrome.com/apps/app_storage#filesystem
Another alternative is you could write some sort of native java application. That would allow you to read/write all the files you want.

Get URL for all tabs in browser using plugin OSX

I want to get all the open URLs from browsers running on the device without having to develop extensions. There are two reasons I don't want to develop extensions. First for Chrome, the user has to go to the chrome store to install the extension. Second, I have to write an extension for all browsers installed.
So I started off by looking into Scripting Bridge, but turns out it doesn't work for Chrome without GUI scripting(for which users have to enable assistive devices).
So instead, I am looking into building a plugin instead. The thing though is plugins can only support certain mime types. How do I make sure my plugin is called from any webpage? Unless there is a universal mime type which is present in all webpages, I am not sure how to solve this problem.
In any case, do you guys think this is the best way to go? Or is there any other way to get URLs of all open tabs.
The only way to get a plugin to automatically be added to all pages would be with an extension, and there is no way without having the plugin be loaded in all pages to know about other pages other than the one that a given plugin instance is loaded in.
Plugins are not aware of the browser, only of the page they are inserted into (or loaded to handle, in the case of a plugin that handles a mimetype such as .pdf). see http://npapi.com/extensions for more information on the capabilities of a plugin vs an extension.
Because plugins only know about a page, though, that means that they can't find out about other pages in the same browser process, including tabs. They simply don't have any method for doing this, and that is by design; the API developers didn't want anyone to be able to have a plugin that handles a media type that could somehow tie into your banking site window in another tab without you realizing it. Of course, certain extension frameworks might allow you to find a way to do that anyway, but a plugin itself cannot.

Access Google Chrome's cache

Is it possible to access Google Chrome's cache from within an extension?
I'd like to write an extension that loads a cached version of a page when the online one can't be accessed (e.g. Internet connectivity issue).
Updated: I know I could write an NPAPI plugin accessible through an extension to accomplish this but I'd rather not suffer writing one... I am after a solution without resorting to NPAPI, please.
Note: as far as I can tell, Google Chrome doesn't support this functionality (at least not out-of-the-box): I just had an episode of "no Internet access" and I was stranded...
Unfortunately, I'm 99% sure that this is impossible without using an NPAPI in your extension.
Chrome extensions are sandboxed to their own process, and can only access files within the extension's folder.
There is some support for things like chrome://favicon/. But that's about it, at least for now.
Source (Google Chrome Extensions Reference)
P.S. I just had a crazy idea. Extensions only have access to files in their folder... but Chrome stores it's cache in the Cache folder. What you might try is, copy (or move) the Cache folder into a subfolder within the extension. The extension should now be able to access the cache.
Whether this is enough to actually enable offline mode... I don't know. I do see some HTML files (and obviously a lot of images) within my Cache folder, though.
In fact, even without using an extension, I can open up the HTML files in Chrome. And because they're stored on your computer, you should be able to access them even without internet.
P.S. the Cache folder is stored at PATH-TO-CHROME/Default/Cache
P.P.S. there is a way to store an entire webpage and archive it for later use. Check out this extension:
https://chrome.google.com/extensions/detail/mpiodijhokgodhhofbcjdecpffjipkle
Just make a simple plugin manifest that calls an AJAX page which loads jQuery from CDN, and then uses it to parse all the <a> elements on the page and alter the href values to have this prefix: http://webcache.googleusercontent.com/search?q=cache:
So <a href="http://stackoverflow.com/questions/blah"> becomes:
<a href="http://webcache.googleusercontent.com/search?q=cache:http://stackoverflow.com/questions/blah">
Voilà, you are cache surfing, but you still need to get to Google. I understand this answer is a bit outside the scope of the question but still solves a lot of web connectivity issues.
I'm tempted to just go write this plugin but I bet it'd be taboo in Google's eyes, so it'd get blocked or removed rather quickly. :)

What are the pros and cons of various ways of analyzing websites?

I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.
I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.
Ideas about how I might progress.
1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?
2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.
3) Grab the site on the server side. Use libraries on the server to parse.
4) YQL. Isn't this pretty much built for parsing sites?
My suggestion would be:
a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.
b) Load the home page via a script, using a python or perl library.
Try Perl WWW::Mechanize module.
Python has plenty of built-in module, try a look also at www.feedparser.org
c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).
d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.
e)You can collect the data in a SQLite db, then post process them in Excel.
You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.
iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.
Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.
It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.
I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).
I suggest you try option #4 first (YQL):
The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.
If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).
I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.
Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website
THE AJAX DYNAMIC CONTENT ISSUE:
There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:
Prototype library: http://www.prototypejs.org/api/ajax/updater
Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17
Or if you are willing to spend money, take a look at this:
http://aptana.com/jaxer/guide/develop_sandbox.html
Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com.
http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx
Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
One thing I just thought of that might work is:
grab the content and evaluate it using AJAX.
send the content to your server.
evaluate the page, links, etc..
[OPTIONAL] save the content as a local page on your server .
return the statistics info back to the page.
[OPTIONAL] display cached local version with highlighting.
^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.
Good luck!
Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...
That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.
I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.
Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)
Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.
Edit:
In Adobe AIR, there are two ways to access a foreign website’s DOM:
Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
Create an iframe, and load the site in untrusted sandbox.
I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:
window.childSandboxBridge = {
// ... some methods returning data
}
(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)
So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.
end of edit.
The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.
So, to sum up:
PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).
I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.
Perl
Python
.NET language of choice
Java
whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.