HTTrack returns file not found - html

I downloaded a website with HTTrack using the following command:
/usr/local/bin/httrack https://www.website.com -O /Users/mainuser/Desktop/website -n -j
I than located the index.html file in website folder and run it. Chrome returns the message: file not found. That's funny, because normally the websites I parse with httrack work just fine on my file system. What cold be the reason for this behaviour?

Try the following, using the v (verbose) flag to enable HTTrack to give you as much debugging information as possible:
/usr/local/bin/httrack "https://www.website.com" -O "/Users/mainuser/Desktop/website" -n -j -v

Related

Running chromium as headless from shell, how do I get the log for loading the page resources?

I have a shell script that runs chromium-browser headless, using it to print html into a pdf file.
$ chromium-browser --headless --disable-gpu \
--print-to-pdf=/home/www-data/build/out.pdf \
/home/www-data/build/out.html
If there are any errors with the page (anything from Javascript errors to unavailable resources), I would like my shell script to behave accordingly. But how do I even tell if such errors occur?
Below is a screenshot from Chromium, where an environment variable was not set, and therefore the path for the CSS files is incorrect.
I have tried in various ways to get this information out of chrome headless, but without luck.
My best suggestion is to look for it in the log in userdata, but even with the highest logging level, I don't see that there is anything about not being able to log the resources. All I can find about loading those resources is shown below:
$ rm /home/www-data/userdata -rf
$ chromium-browser --headless --disable-gpu \
--enable-logging
--v=4 \
--user-data-dir=/home/www-data/userdata \
--print-to-pdf=/home/www-data/build/out.pdf \
/home/www-data/build/out.html
$ grep css userdata -r
userdata/Default/chrome_debug.log:[0921/163104.096612:VERBOSE1:file_url_loader_factory.cc(451)] FileURLLoader::Start: file:///css/bootstrap.css
userdata/Default/chrome_debug.log:[0921/163104.096746:VERBOSE1:file_url_loader_factory.cc(451)] FileURLLoader::Start: file:///css/email.css
userdata/Default/chrome_debug.log:[0921/163104.096873:VERBOSE1:file_url_loader_factory.cc(451)] FileURLLoader::Start: file:///css/print_common.css
userdata/Default/chrome_debug.log:[0921/163104.097021:VERBOSE1:file_url_loader_factory.cc(451)] FileURLLoader::Start: file:///css/print_A4P.css

Wget get past "infected with a virus" screen on Google Drive

So I've been trying to get wget to download a Google Drive file that I uploaded. Unfortunately, Google Drive incorrectly flags the file as a virus, so wget can't get the direct download link.
Things I've tried:
using the gdrive.pl fie that someone made, but I'm on Windows, and /tmp/cookies.txt does not exist.
doing wget --no-check-certificate https://docs.google.com/uc?export=download&id=FILEID -O FILENAME, but it says 400 Bad Request
using https://docs.google.com/uc?export=download&id=ID, but it fails because of the download infected file warning.
Does anyone have any suggestions to solve this?
Here is what I was able to do, based on a starting point I found at https://medium.com/#acpanjan/download-google-drive-files-using-wget-3c2c025a8b99 :
Edit I noticed you said Windows, so this command with sed won't work natively in Windows - I'll put steps without sed for Windows below
You of course start by sharing the file and getting the file ID from the share link on google drive. Then:
wget --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p' > /tmp/confirm && wget --load-cookies /tmp/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm="$(cat /tmp/confirm)"&id=SHARE_LINK_ID" -O YOUR_FILENAME && rm /tmp/cookies.txt /tmp/confirm
Replace SHARE_LINK_ID with your ID from your shared file link. Replace YOUR_FILENAME with your desired output file name.
This attempts to download the file and gets the html of the warning message about potential viruses in the file. It uses cookies as you need to use the same session ID for the subsequent download with the confirmation code.
It then gets the generated confirm code from that response and writes it to a temporary file.
I then does another wget adding the confirmation code to the query string to download the file, using the saved cookie to allow the confirmation code to work for the saved session.
Most likely this could be worked into a script, passing an argument of the share link ID to make it more useful.
For Windows (without sed)
wget --save-cookies %TMP%/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O %TMP%/confirm.txt
Downloads the confirmation html.
notepad %TMP%/confirm.txt
Opens %TMP%/confirm.txt in Notepad to get the confirm code string (CTRL+F to look for "confirm=" and get the code right after that). Replace it in the below command line (along with putting in the filename you want and the share link ID from google drive)
wget --load-cookies %TMP%/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm=CONFIRM_CODE&id=SHARE_LINK_ID" -O YOUR_FILENAME
Delete the temp files:
del %TMP%/cookies.txt %TMP%/confirm.txt
Try this. Don't forget to replace two FILEID and one FILENAME fields with your desired file's file id and the output file's name respectively.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
source: https://medium.com/geekculture/wget-large-files-from-google-drive-336ba2e1c991

How to translate my cURL command into Chrome command?

I want to fire a POST request in command line, to post my image to a image searching site. At first, I tried cURL and get this command which works:
curl -i -X POST -F file=#search.png http://saucenao.com/search.php
It will post a file in FORM to the searching site and returns a HTML page result full with JavaScript which makes it hard to read in terminal. And it's also hard to preview online image in terminal.
Then I remember that I can open Chrome with arguments in command line, which I think may solve my problem. After some digging, I found Chrome switches, but seams it's just about Chrome starting flags (I'm not sure is this right, but I didn't find how to fire a post request like cURL do.)
So, can I use Chrome in command line to start it with a POST request just like my cURL command above?
There are a couple of things you could do.
You could write a script in JavaScript that will send the POST request and display the results inside the <body> element or the like;
You could keep the cURL command and use the -o (or --output) to save the resulting HTML in a file (but lose the -i switch, to avoid having the headers in the file), then open the file in Chrome or whichever browser you prefer. You could combine the two commands as a one-liner in any operating system. If you use Ubuntu, for example:
$ curl -o search.html -X POST -F file=#search.png http://saucenao.com/search.php && google-chrome search.html && rm search.html
According to this answer you could use bcat in order to avoid using a temporary file. Install it by apt-get install ruby-bcat and then just run
$ curl -X POST -F file=#search.png http://saucenao.com/search.php | bcat
I think the easier option is #2, but whichever you prefer.

Wget -i gives no output or results

I'm learning data analysis in Zeppelin, I'm a mechanical engineer so this is outside my expertise.
I am trying to download two csv files using a file that contains the urls, test2.txt. When I run it I get no output, but no error message either. I've included a link to a screenshot showing my code and the results.
When I go into Ambari Sandbox I cannot find any files created. I'm assuming the directory the file is in is where the csv files will be downloaded too. I've tried using -P as well with no luck. I've checked in man wget but it did not help.
So I have several questions:
How do I show the output from running wget?
Where is the default directory that wget stores files?
Do I need additional data in the file other than just the URLs?
Screenshot: Code and Output for %sh
Thanks for any and all help.
%sh
wget -i /tmp/test2.txt
%sh
# list the current working directory
pwd # output: home/zeppelin
# make a new folder, created in "tmp" because it is temporary
mkdir -p /home/zeppelin/tmp/Folder_Name
# change directory to new folder
cd /home/zeppelin/tmp/Folder_Name
# transfer the file from the sandbox to the current working directory
hadoop fs -get /tmp/test2.txt /home/zeppelin/tmp/Folder_Name/
# download the URL
wget -i test2.txt

Wget recognizes some part of my URL address as a syntax error

I am quite new with wget and I have done my research on Google but I found no clue.
I need to save a single HTML file of a webpage:
wget yahoo.com -O test.html
and it works, but, when I try to be more specific:
wget http://search.yahoo.com/404handler?src=search&p=food+delicious -O test.html
here comes the problem, wget recognizes &p=food+delicious as a syntax, it says: 'p' is not recognized as an internal or external command
How can I solve this problem? I really appreciate your suggestions.
The & has a special meaning in the shell. Escape it with \ or put the url in quotes to avoid this problem.
wget http://search.yahoo.com/404handler?src=search\&p=food+delicious -O test.html
or
wget "http://search.yahoo.com/404handler?src=search&p=food+delicious" -O test.html
In many Unix shells, putting an & after a command causes it to be executed in the background.
Wrap your URL in single quotes to avoid this issue.
i.e.
wget 'http://search.yahoo.com/404handler?src=search&p=food+delicious' -O test.html
if you are using a jubyter notebook, maybe check if you have downloaded
pip install wget
before warping from URL