I am using , and have difficulties to launch 4 scripts at the same time.
I have used theses variable for local browser
let CHROMIUM_DATA_DIR = `/Users/yo/dataDir/datadir${this.cmd}`
let CHROMIUM_EXEC_PATH = `/Applications/Google-Chrome${this.cmd}.app/Contents/MacOS/Google Chrome`
I have multiplied by 4, the same datadir, et the same executable. I have just renamed the files/directories.
It does not work well. What would be your recomendation, to quickly scale the launch of the scrappers (). How could I install various chromes instance, et managing according datadir (to save some login session etc..)
tks
Since you are using playwright, you can use persistent contexts.
You do not need to create your own data directories or executables by copying them, simply pass location of an empty directory when launching the browser and playwright will populate it itself, storing any session data.
I do not use node.js, but just to give an idea, sample code in python:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch_persistent_context(user_data_dir=r'C:\Users\me\Desktop\dir', headless=False)
page = browser.new_page()
page.goto("http://playwright.dev")
print(page.title())
browser.close()
I'm making a web scraper that I using for drop add at my school but I'm having trouble getting it to run on my debian VPS. It works perfectly on my mac but for some reason pywebcopy just doesn't want to work on the VPS. I'm also willing to use something else to do it because I only need the HTML when from what I have seen pywebcopy is built for getting everything from a website. So recomendations on other ways to go about it or a fix to my issue would be a blessing. Also this class is getting ran every second and is intended to be run until I get all of my classes added. On my mac it is being done with python 3.6 but on the vps it is with python 3.7 so that might be another issue.
def getwebpage(crn):
#18139
kwargs = {'project_name': 'site folder'}
url='https://.edu/bprod/bwckschd.p_disp_listcrse?term_in=202101&subj_in=CSC&crse_in=4780&crn_in='+crn
config.setup_config(
# url pf the website
url,
# folder where the copy will be saved
project_folder='rawhtml/'+crn,
**kwargs
)
wp = WebPage()
wp.get(url)
wp.save_html()
dirs = os.listdir("rawhtml/"+crn+"/site folder/.edu/bprod")
file_path = "rawhtml/"+crn+"/site folder/.edu/bprod/"+dirs[0]
return file_path
I've looked through various sources online and done a number of Google searches, but I can't seem to find any specific instructions as to how to work with the V8 --trace-* flags in Google Chrome. I've seen a few "You can do this as well in Chrome", but I haven't been able to find what I'm looking for, which is output like this: (snippets are near the near bottom of the post) Optomizing for V8.
I found reference that the data is logged to a file: Profiling Chromium with V8 and I've found that the file is likely named v8.log: (Lost that link) but I haven't found any clues as to how to generate that file, or where it is located. It didn't appear to be in the chrome directory or the user directory.
Apparently I need to enable .map files for chrome.dll as well, but I wasn't able to find anything to help me with that.
The reason I would prefer to use Chrome's V8 for this as opposed to building V8 and using a shell is because the JavaScript I would like to test makes use of DOM, which I do not believe would be included in the V8 shell. However if it is, that would be great to know, then I can rewrite the code to work sans-html file and test. But my guess is that V8 by itself is sans-DOM access, like node.js
So to sum things up;
Running Google Chrome Canary on Windows 7 ultimate x64
Shortcut target is "C:\Users\ArkahnX\AppData\Local\Google\Chrome SxS\Application\chrome.exe" --no-sandbox --js-flags="--trace-opt --trace-bailout --trace-deop" --user-data-dir=C:\chromeDebugProfile
Looking for whether this type of output can be logged from chrome
If so, where would the log be?
If not, what sort of output should I expect, and again, where could I find it?
Thank you for any assistance!
Amending with how I got the answer to work for me
Using the below answer, I installed python to it's default directory, and modified the script so it had the full path to chrome. From there I set file type associations to .py files to python and executed the script. Now every time I open Chrome Canary it will run that python script (at least until I restart my pc, then I'll have to run that script again)
The result is exactly what I was looking for!
On Windows stdout output is suppressed by the fact that chrome.exe is a GUI application. You need to flip Subsystem field in the PE header from IMAGE_SUBSYSTEM_WINDOWS_GUI to WINDOWS_SUBSYSTEM_WINDOWS_CUI to see what V8 outputs to stdout.
You can do it with the following (somewhat hackish) Python script:
import mmap
import ctypes
GUI = 2
CUI = 3
with open("chrome.exe", "r+b") as f:
map = mmap.mmap(f.fileno(), 1024, None, mmap.ACCESS_WRITE)
e_lfanew = (ctypes.c_uint.from_buffer(map, 30 * 2).value)
subsystem = ctypes.c_ushort.from_buffer(map, e_lfanew + 4 + 20 + (17 * 4))
if subsystem.value == GUI:
subsystem.value = CUI
print "patched: gui -> cui"
elif subsystem.value == CUI:
subsystem.value = GUI
print "patched: cui -> gui"
else:
print "unknown subsystem: %x" % (subsystem.value)
Close all Chrome instances and execute this script. When you restart chrome.exe you should see console window appear and you should be able to redirect stdout via >.
If your not keen on hacking the PE entry of chrome then there is alternative for windows.
Because the chrome app doesn't create a console stdout on windows all tracing in v8 (also d8 compiler) is sent to the OutputDebugString instead.
The OutputDebugString writes to a shared memory object that can be read by any other application.
Microsoft has a tool called DebugView which monitors and if required also stream to a log file.
DebugView is free and downloadable from microsoft: http://technet.microsoft.com/en-us/sysinternals/bb896647.aspx
A forum I frequent was down today, and upon restoration, I discovered that the last two days of forum posting had been rolled back completely.
Needless to say, I'd like to get back what data I can from the forum loss, and I am hoping I have at least some of it stored in the cache files that Chrome created.
I face two problems -- the cache files have no filetype, and I'm unsure how to read them in an intelligent manner (trying to open them in Chrome itself seems to "redownload" them in a .gz format), and there are a ton of cache files.
Any suggestions on how to read and sort these files? (A simple string search should fit my needs)
EDIT: The below answer no longer works see here
In Chrome or Opera, open a new tab and navigate to chrome://view-http-cache/
Click on whichever file you want to view.
You should then see a page with a bunch of text and numbers.
Copy all the text on that page.
Paste it in the text box below.
Press "Go".
The cached data will appear in the Results section below.
Try Chrome Cache View from NirSoft (free).
EDIT: The below answer no longer works see here
Chrome stores the cache as a hex dump. OSX comes with xxd installed, which is a command line tool for converting hex dumps. I managed to recover a jpg from my Chrome's HTTP cache on OSX using these steps:
Goto: chrome://cache
Find the file you want to recover and click on it's link.
Copy the 4th section to your clipboard. This is the content of the file.
Follow the steps on this gist to pipe your clipboard into the python script which in turn pipes to xxd to rebuild the file from the hex dump:
https://gist.github.com/andychase/6513075
Your final command should look like:
pbpaste | python chrome_xxd.py | xxd -r - image.jpg
If you're unsure what section of Chrome's cache output is the content hex dump take a look at this page for a good guide:
http://www.sparxeng.com/blog/wp-content/uploads/2013/03/chrome_cache_html_report.png
Image source: http://www.sparxeng.com/blog/software/recovering-images-from-google-chrome-browser-cache
More info on XXD: http://linuxcommand.org/man_pages/xxd1.html
Thanks to Mathias Bynens above for sending me in the right direction.
EDIT: The below answer no longer works see here
If the file you try to recover has Content-Encoding: gzip in the header section, and you are using linux (or as in my case, you have Cygwin installed) you can do the following:
visit chrome://view-http-cache/ and click the page you want to recover
copy the last (fourth) section of the page verbatim to a text file (say: a.txt)
xxd -r a.txt| gzip -d
Note that other answers suggest passing -p option to xxd - I had troubles with that presumably because the fourth section of the cache is not in the "postscript plain hexdump style" but in a "default style".
It also does not seem necessary to replace double spaces with a single space, as chrome_xxd.py is doing (in case it is necessary you can use sed 's/ / /g' for that).
Note: The flag show-saved-copy has been removed and the below answer will not work
You can read cached files using Chrome alone.
Chrome has a feature called Show Saved Copy Button:
Show Saved Copy Button Mac, Windows, Linux, Chrome OS, Android
When a page fails to load, if a stale copy of the page exists in the browser cache, a button will be presented to allow the user to load that stale copy. The primary enabling choice puts the button in the most salient position on the error page; the secondary enabling choice puts it secondary to the reload button. #show-saved-copy
First disconnect from the Internet to make sure that browser doesn't overwrite cache entry. Then navigate to chrome://flags/#show-saved-copy and set flag value to Enable: Primary. After you restart browser Show Saved Copy Button will be enabled. Now insert cached file URI into browser's address bar and hit enter. Chrome will display There is no Internet connection page alongside with Show saved copy button:
After you hit the button browser will display cached file.
I've made short stupid script which extracts JPG and PNG files:
#!/usr/bin/php
<?php
$dir="/home/user/.cache/chromium/Default/Cache/";//Chrome or chromium cache folder.
$ppl="/home/user/Desktop/temporary/"; // Place for extracted files
$list=scandir($dir);
foreach ($list as $filename)
{
if (is_file($dir.$filename))
{
$cont=file_get_contents($dir.$filename);
if (strstr($cont,'JFIF'))
{
echo ($filename." JPEG \n");
$start=(strpos($cont,"JFIF",0)-6);
$end=strpos($cont,"HTTP/1.1 200 OK",0);
$cont=substr($cont,$start,$end-6);
$wholename=$ppl.$filename.".jpg";
file_put_contents($wholename,$cont);
echo("Saving :".$wholename." \n" );
}
elseif (strstr($cont,"\211PNG"))
{
echo ($filename." PNG \n");
$start=(strpos($cont,"PNG",0)-1);
$end=strpos($cont,"HTTP/1.1 200 OK",0);
$cont=substr($cont,$start,$end-1);
$wholename=$ppl.$filename.".png";
file_put_contents($wholename,$cont);
echo("Saving :".$wholename." \n" );
}
else
{
echo ($filename." UNKNOWN \n");
}
}
}
?>
I had some luck with this open-source Python project, seemingly inactive:
https://github.com/JRBANCEL/Chromagnon
I ran:
python2 Chromagnon/chromagnonCache.py path/to/Chrome/Cache -o browsable_cache/
And I got a locally-browsable extract of all my open tabs cache.
The Google Chrome cache directory $HOME/.cache/google-chrome/Default/Cache on Linux contains one file per cache entry named <16 char hex>_0 in "simple entry format":
20 Byte SimpleFileHeader
key (i.e. the URI)
payload (the raw file content i.e. the PDF in our case)
SimpleFileEOF record
HTTP headers
SHA256 of the key (optional)
SimpleFileEOF record
If you know the URI of the file you're looking for it should be easy to find. If not, a substring like the domain name, should help narrow it down. Search for URI in your cache like this:
fgrep -Rl '<URI>' $HOME/.cache/google-chrome/Default/Cache
Note: If you're not using the default Chrome profile, replace Default with the profile name, e.g. Profile 1.
It was removed on purpose and it won't be coming back.
Both chrome://cache and chrome://view-http-cache have been removed starting chrome 66. They work in version 65.
Workaround
You can check the chrome://chrome-urls/ for complete list of internal Chrome URLs.
The only workaround that comes into my mind is to use menu/more tools/developer tools and having a Network tab selected.
The reason why it was removed is this bug:
https://chromium.googlesource.com/chromium/src.git/+/6ebc11f6f6d112e4cca5251d4c0203e18cd79adc
https://bugs.chromium.org/p/chromium/issues/detail?id=811956
The discussion:
https://groups.google.com/a/chromium.org/forum/#!msg/net-dev/YNct7Nk6bd8/ODeGPq6KAAAJ
The JPEXS Free Flash Decompiler has Java code to do this at in the source tree for both Chrome and Firefox (no support for Firefox's more recent cache2 though).
EDIT: The below answer no longer works see here
Google Chrome cache file format description.
Cache files list, see URLs (copy and paste to your browser address bar):
chrome://cache/
chrome://view-http-cache/
Cache folder in Linux: $~/.cache/google-chrome/Default/Cache
Let's determine in file GZIP encoding:
$ head f84358af102b1064_0 | hexdump -C | grep --before-context=100 --after-context=5 "1f 8b 08"
Extract Chrome cache file by one line on PHP (without header, CRC32 and ISIZE block):
$ php -r "echo gzinflate(substr(strchr(file_get_contents('f84358af102b1064_0'), \"\x1f\x8b\x08\"), 10,
-8));"
Note: The below answer is out of date since the Chrome disk cache format has changed.
Joachim Metz provides some documentation of the Chrome cache file format with references to further information.
For my use case, I only needed a list of cached URLs and their respective timestamps. I wrote a Python script to get these by parsing the data_* files under C:\Users\me\AppData\Local\Google\Chrome\User Data\Default\Cache\:
import datetime
with open('data_1', 'rb') as datafile:
data = datafile.read()
for ptr in range(len(data)):
fourBytes = data[ptr : ptr + 4]
if fourBytes == b'http':
# Found the string 'http'. Hopefully this is a Cache Entry
endUrl = data.index(b'\x00', ptr)
urlBytes = data[ptr : endUrl]
try:
url = urlBytes.decode('utf-8')
except:
continue
# Extract the corresponding timestamp
try:
timeBytes = data[ptr - 72 : ptr - 64]
timeInt = int.from_bytes(timeBytes, byteorder='little')
secondsSince1601 = timeInt / 1000000
jan1601 = datetime.datetime(1601, 1, 1, 0, 0, 0)
timeStamp = jan1601 + datetime.timedelta(seconds=secondsSince1601)
except:
continue
print('{} {}'.format(str(timeStamp)[:19], url))
I've got a Perl script that groks a bunch of log files looking for "interesting" lines, for some definition of interesting. It generates an HTML file which consists of a table whose columns are a timestamp, a filename/linenum reference and the "interesting" bit. What I'd love to do is have the filename/linenum be an actual link that will bring up that file with the cursor positioned on that line number, in emacs.
emacsclientw will allow such a thing (e.g. emacsclientw +60 foo.log) but I don't know what kind of URL/URI to construct that will let FireFox call out to emacsclientw. The original HTML file will be local, so there's no problem there.
Should I define my own MIME type and hook in that way?
Firefox version is 3.5 and I'm running Windows, in case any of that matters. Thanks!
Go to about:config page in firefox. Add a new string :
network.protocol-handler.app.emacs
value: path to a script that parse the url without protocol (what's after emacs://) and then call emacsclient with the proper argument.
You can't just put the path of emacsclient because everything after the protocol is passed as one arg to the executable so your +60 foo.log would be a new file named that way.
But you could easily imagine someting like emacs:///path/to/your/file/LINENUM and have a little script that remove the final / and number and call emacsclient with the number and the file :-)
EDIT: I could do that in bash if you want but i don't know how to do that with the windows "shell" or whatever it is called.
EDIT2: I'm wrong on something, the protocol is passed in the arg string to !
Here is a little bash script that i just made for me, BTW thanks for the idea :-D
#!/bin/bash
ARG=${1##emacs://}
LINE=${ARG##*/}
FILE=${ARG%/*}
if wmctrl -l | grep emacs#romuald &>/dev/null; then # if there's already an emacs frame
ARG="" # then just open the file in the existing emacs frame
else
ARG="-c" # else create a new frame
fi
emacsclient $ARG -n +$LINE "$FILE"
exit $?
and my network.protocol-handler.app.emacs in my iceweasel (firefox) is /home/p4bl0/bin/ffemacsclient. It works just fine !
And yes, my laptop's name is romuald ^^.
Thanks for the pointer, p4bl0. Unfortunately, that only works on a real OS; Windows uses a completely different method. See http://kb.mozillazine.org/Register_protocol for more info.
But, you certainly provided me the start I needed, so thank you very, very much!
Here's the solution for Windows:
First you need to set up the registry correctly to handle this new URL type. For that, save the following to a file, edit it to suit your environment, save it and double click on it:
Windows Registry Editor Version 5.00
[HKEY_CLASSES_ROOT\emacs]
#="URL:Emacs Protocol"
"URL Protocol"=""
[HKEY_CLASSES_ROOT\emacs\shell]
[HKEY_CLASSES_ROOT\emacs\shell\open]
[HKEY_CLASSES_ROOT\emacs\shell\open\command]
#="\"c:\\product\\emacs\\bin\\emacsclientw.exe\" --no-wait -e \"(emacs-uri-handler \\\"%1\\\")\""
This is not as robust as p4bl0's shell script, because it does not make sure that Emacs is running first. Then add the following to your .emacs file:
(defun emacs-uri-handler (uri)
"Handles emacs URIs in the form: emacs:///path/to/file/LINENUM"
(save-match-data
(if (string-match "emacs://\\(.*\\)/\\([0-9]+\\)$" uri)
(let ((filename (match-string 1 uri))
(linenum (match-string 2 uri)))
(with-current-buffer (find-file filename)
(goto-line (string-to-number linenum))))
(beep)
(message "Unable to parse the URI <%s>" uri))))
The above code will not check to make sure the file exists, and the error handling is rudimentary at best. But it works!
Then create an HTML file that has lines like the following:
file: c:/temp/my.log, line: 60
and then click on the link.
Post Script:
I recently switched to Linux (Ubuntu 9.10) and here's what I did for that OS:
$ gconftool -s /desktop/gnome/url-handlers/emacs/command '/usr/bin/emacsclient --no-wait -e "(emacs-uri-handler \"%s\")"' --type String
$ gconftool -s /desktop/gnome/url-handlers/emacs/enabled --type Boolean true
Using the same emacs-uri-handler from above.
Might be a great reason to write your first FF plugin ;)