Wget copy text from html - html

I'm really new to programming and linux/unix so I was wondering what command I can use to copy the text only of a webpage and save it in a file in the directory. I want to copy the text of something like this
http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words
would wget do it? also what specific commands get it saved into the directory?

Another option using wget like you wondered about would be:
wget -O file.txt "http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words"
The -O option lets you specify which file name you want to save it to.

One option would be:
curl -s http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words > file

Related

How can I create a hyperlink via the command line?

I would like to generate a directory of links for some friends who are not technologically savvy. I'm running Ubuntu and would like to do this via the command line.
My attempts so far have been:
touch https:...
which returns:
touch: cannot touch 'https:...': No such file or directory
cat >> https://...
which also returns the No such file or directory exception.
I also tried echo where the link was the filename and the file type was .html, which returned the same exception.
If I drag and drop the link from the address bar into a folder, it creates the hyperlink - however I would like to batch these according to a list of links.
EDIT: This can be done in Python.
I was able to find a question on SO which supplied an alternative using Python.
you could try the following
$ ln -s {source-filename} {symbolic-filename}
source-filename - the target file which you want to create the link
for
symbolic-filename - the name of the symbolic link
Example:
ln -s source_file.txt link_file.txt
You can verify the link creating using the following command
ls -l link_file.txt

wget command to download web-page & rename file with with html title?

I would like to download an html web-page and have the filename be the title of the html page.
I have found a command to get the html title:
wget -qO- 'https://www.linuxinsider.com/story/Austrumi-Linux-Has-Great-Potential-if-You-Speak-Its-Language-86285.html/' | gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
And it prints this: Austrumi Linux Has Great Potential if You Speak Its Language | Reviews | LinuxInsider
Found on: https://unix.stackexchange.com/questions/103252/how-do-i-get-a-websites-title-using-command-line
How could i pipe the title back into wget to use it as the filename when downloading that web-page?
EDIT: in case there is no way to do this directly in wget, i found a way to simply rename the html files once downloaded
Renaming HTML files using <title> tags
You can't wget a file, analyze it's contents and then make the same wget execution that downloaded the file magically go back in time and output it to a new file named after it's contents that you analyzed in step 2. Just do this:
wget '...' > tmp &&
name=$(gawk '...' tmp) &&
mv tmp "$name"
Add protection against / in name as necessary.

Wget get past "infected with a virus" screen on Google Drive

So I've been trying to get wget to download a Google Drive file that I uploaded. Unfortunately, Google Drive incorrectly flags the file as a virus, so wget can't get the direct download link.
Things I've tried:
using the gdrive.pl fie that someone made, but I'm on Windows, and /tmp/cookies.txt does not exist.
doing wget --no-check-certificate https://docs.google.com/uc?export=download&id=FILEID -O FILENAME, but it says 400 Bad Request
using https://docs.google.com/uc?export=download&id=ID, but it fails because of the download infected file warning.
Does anyone have any suggestions to solve this?
Here is what I was able to do, based on a starting point I found at https://medium.com/#acpanjan/download-google-drive-files-using-wget-3c2c025a8b99 :
Edit I noticed you said Windows, so this command with sed won't work natively in Windows - I'll put steps without sed for Windows below
You of course start by sharing the file and getting the file ID from the share link on google drive. Then:
wget --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p' > /tmp/confirm && wget --load-cookies /tmp/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm="$(cat /tmp/confirm)"&id=SHARE_LINK_ID" -O YOUR_FILENAME && rm /tmp/cookies.txt /tmp/confirm
Replace SHARE_LINK_ID with your ID from your shared file link. Replace YOUR_FILENAME with your desired output file name.
This attempts to download the file and gets the html of the warning message about potential viruses in the file. It uses cookies as you need to use the same session ID for the subsequent download with the confirmation code.
It then gets the generated confirm code from that response and writes it to a temporary file.
I then does another wget adding the confirmation code to the query string to download the file, using the saved cookie to allow the confirmation code to work for the saved session.
Most likely this could be worked into a script, passing an argument of the share link ID to make it more useful.
For Windows (without sed)
wget --save-cookies %TMP%/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O %TMP%/confirm.txt
Downloads the confirmation html.
notepad %TMP%/confirm.txt
Opens %TMP%/confirm.txt in Notepad to get the confirm code string (CTRL+F to look for "confirm=" and get the code right after that). Replace it in the below command line (along with putting in the filename you want and the share link ID from google drive)
wget --load-cookies %TMP%/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm=CONFIRM_CODE&id=SHARE_LINK_ID" -O YOUR_FILENAME
Delete the temp files:
del %TMP%/cookies.txt %TMP%/confirm.txt
Try this. Don't forget to replace two FILEID and one FILENAME fields with your desired file's file id and the output file's name respectively.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
source: https://medium.com/geekculture/wget-large-files-from-google-drive-336ba2e1c991

download html page for offline use

I want to make an html page available for offline viewing by downloading the html and all images / css resources from it, but not other pages which are links.
I was looking at httrack and wget but could not find the right set of arguments (I need the command line).
Any ideas?
If you want to download using the newest version of wget, get it using cygwin installer
and use this version
wget -m –w 2 –p -E -k –P {target-dir} http://{website}
to mirror {website} to {target-dir} (without images in 1.11.4).
Leave out -w 2 to speed up the progress.
For one page, the following wget command line parameters should be enough. Please keep in mind that it might not download everything including background images attached to CSS files etc.
wget -p <webpage>
Also try wget --help for a list of all command line parameters.

wget SSRS report

I'm trying to pull a report down using the following:
https://user:password#domain.com/ReportServer?%2fFolder+1%2fReportName&rs:Format=CSV&rs:Command=Render
And it just pulls an html page and not the csv file. Any ideas?
What does the HTML file say? Something like "acess denied"? And while you're at it, try
wget --user bob --password 123456 'https://domain.com/ReportServer?%2fFolder+1%2fReportName&rs:Format=CSV&rs:Command=Render'
Make sure you are using quotes. Otherwise, the shell will cut off the command before the first ampersand.