Python Web Crawler with stored Web History

Python Web Crawler with stored Web History - google-chrome

I'm Creating a Python Web crawler, with the ability to browse web history & parse through the information and store important information within a Database for Forensics/Academic Purposes. I understand the functionality to browse web sites but the part I'm struggling with is to be able too crawl through web history I will give a scenario:
During Forensic Investigation.
You have been given a full Forensic Image of Suspects Computer, you then locate the AppData folder for Google Chrome which stores all information about suspect including form information, credentials & web history.
How would I set up the web crawler to only search through data in the suspects web history.
I am also having issues accessing the information stored within Google Chrome User Data to try view my personal information which is stored here as a start, I am currently attempting to use DB Browser to view the files to try see my own web history however I'm not having much luck with this. Any Suggestions
For those interested in this project of mine I can update this thread as I go so you can see the progress of my web-crawler the end result will have the ability to take web-history and data from public & private websites sort important information i.e. name, address, D.O.B into a database for to be used later as a biographic dictionary.
I WILL STRESS THIS AGAIN AS THIS IS ALL FOR ACADEMIC PURPOSES IN CONTROLLED ENVIROMENT AND USED ON A TEST/FAKE ACCOUNT

Hindsight (https://github.com/obsidianforensics/hindsight) is an open source tool written in Python that can parse a ton of information from the files in /Google/Chrome/User Data/ directory.
You could look at it's source for inspiration, or just run the tool and parse its output (it can produce XLSX, JSON, or SQLite) in your crawler.

Related

Scraping AJAX generated table to download PDFs in bulk

I'm trying to download (or alternatively open and save) approximately 30,000 PDF documents. The documents that are only accessible through a 3rd party service provider's website/platform (there are no ethical dilemmas here).
The website is secure and needs to be logged into (I have access) and the table is generated via AJAX. The report I intend on reading from has a URL of the form https://sub.website.com/au/report/index?id=1001# that doesn't change when dates or other filters change. In total there are 180,000+ table entries, not all have an associated invoice and not all invoices are required.
Using Chrome DevTools I can see the elements; table name is #reportResults, invoice details are in a html element.
There also looks to be an API but I don't know where to start here either.
How do I scrape data from this using VBA? I have downloaded the JSON.bas module recommended in other solutions for scraping JSON and AJAX. But for this situation I don't know how to use it and where to go from here.
I'm handy with VBA but have no experience with any other languages.

File name conversion for cloud storages?

Lets say I have a web URL to a file on a cloud storage (like Dropbox, Google Drive, etc). How do I convert that to the corresponding file path on my pc? On Android? On iOS?
Assuming of course I have the utilities/apps installed locally.
EDIT: I interested in file name the reverse direction too. (I.e. when I have the local file path, what is the web path?)
EDIT 2: #Greg just made me realize that the problem with file name is much worse on Google Drive than on Dropbox.
And that is very bad. :-(
The reason? Google has good search capabilities on Drive and therefor I and many, many others have put their documents on Drive. However, once I found it I must locate it on my on computer/device. (If I want to edit a pdf for example.)
EDIT 3: #Dan McGrath kindly asked what parts remain unsolved.
Short answer: All. ;-)
Long answer: My actual use case, see below.
My actual use case is a Zotero web app. Zotero is a reference database where you store references to scientific articles, web pages, etc. The items stored in Zotero may include PDF files or - which I prefer - links to PDF files.
I just want to be able to easy access (read) this PDF files from any computer through the web app. And on my own computer I want to be able to edit the files with my local PDF editor. (Be it Android, Windows or whatever.)
By using a cloud storage I do not have to download/upload the files myself. The cloud storage takes care of that part.

For the "reverse" scenario, that is, you have a file and you want the Dropbox shared link, you can use this API endpoint, assuming you're connected to the account via the API:
https://www.dropbox.com/developers/core/docs#shares

Where to store application data for webapp?

I have some data for a webapp that I would like to store on the server. What would be a good location to put those files?
I have a couple of static HTML pages that contain instance specific information. They need to survive a re-deploy of the webapp. They need to be editable by the server's administrator. They are included in other HTML pages using the html object tag.
I want to store preferences on the server, but cannot use a database. I am using JSP to write and read the preferences. There is no sensitive data in the preferences. Currently I am using the log directory. But obviously that is not a great choice.
I am using Tomcat. I thought of creating an appdata/myapp directory under the webapp directory. Is that good or bad?

If the server's administrator can also deploy the app, I would add the data file itself into the source control for the app, and deploy it all together. This way you get revision control of the data, and you get the ability to revert to known good data if the server fails.
If the administrator can't deploy the app, but can only edit the file, then you need plans to back up that file in the case that the server or server filesystem dies.
A third solution would be a hybrid: put the app in one source code repository. Put the data in a second source code repository. The administrator can edit the data and deploy the data. The developer can edit the app source code, and deploy the source code. This way, both are revision controlled, but you've separated responsibility for who maintains what.

Can chrome extension modify a file in our hard drive?

I am making a chrome extension which needs to add/delete/modify file in any location in our hard drive. The location can be temporary folder. How is it possible to make it. Please give comments and helpful links which can lead to me have this work done.

You can not, but adding a local server (nodejs/deno/cs-script/go/python/lua/..) to have a fixed logic (security) to do file stuff and providing a http server to answer back in an ajax/jsonp request would work.
The extension will not be able to install the software part.
edit: if you want to get started using nodejs, this could help
edit2: With File and Directory Entries API (this could help) you can get hold of a FILE OR complete FOLDER (getDirectory(), showDirectoryPicker()).

Thankfully, this is impossible.
Google or any other company wouldn't have many friend if their extension(s') installation caused compromise including complete control over any files(ie. control over machine) on your hard drive. The extension can save information to disk in a location that is available for storing local information as mentioned. You will not have any execute permission on the root or anywhere nor will you have any read or write permission outside of the storage location.
However, extensions can still be malicious if they gather information from a user of a web page (I am sure that Google can filter some suspicious extensions).
If you really need to make changes on your hard drive you can store information on a server and poll for changes with a windows client application or perhaps you can find where the storage information is kept and access it from there from a windows app.

Google Drive Live API: Server Export of Collaboration Document

I have a requirement to build an application with the following features:
Statistical and Source data is presented on simple HTML pages
Some missing Source data can be added from that HTML page ( data will be both exact numerical values and discriptive text )
Some new Source data can be added from those pages
Confirmed and verified data will NOT be editable via the HTML interface
Data is stored and made continuously available via the HTML interface
Periodically the data added/changed from the interface needs to be pulled back into the source data - but in a VERY controlled way. All data changes and submissions will need verification and checking - and some will trigger re-runs of models ( some of which take hours to run ).
In terms of overview architecture I have:
Large DB that stores and manages the data - this is designed for import process's and analysis. It is not ideal for web presentation or interface
Code servers that manipulate the data for imports and analysis
Frontend server that works as a proxy to add layer of security to S3
Collection of generated html files on S3 presenting the data required
Before reading about the Google Drive Realtime API my rough plan was to simply serialize data from the HTML interface and post to S3. The import server scripts would then check for new information, grab it, check it, log it and process it into the main data set.
That basic process however would mean that once changes were submitted from the web page - they would be lost from the users view until they had been processed by the backend.
With the Google Drive Realtime API it would appear I could get the best of both worlds.
However for the above to work I would need to be able to access the Collaboration Document in code from the code servers and export the data.
The Realtime API gives javascript access to Export and hand off to a function - however in my use case I want to automate the Export from the Collaboration Document.
The Google Drive SDK does not as far as I can see give any hints on downloading/exporting a file of type "Collaboration File".
What "non-browser-user" triggered methods are there for interfacing with the Collaboration Documents and exporting them?
David

Server-side export is not supported right now. What you could do is save the realtime model to a regular drive file, and read from that using the standard Drive API. See https://developers.google.com/drive/realtime/models-files for some discussion on different ways to setup interactions between realtime models and Drive Files.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Python Web Crawler with stored Web History - google-chrome

Related

Scraping AJAX generated table to download PDFs in bulk

File name conversion for cloud storages?

Where to store application data for webapp?

Can chrome extension modify a file in our hard drive?

Google Drive Live API: Server Export of Collaboration Document

Categories

Resources