wget command to download web-page & rename file with with html title? - html

I would like to download an html web-page and have the filename be the title of the html page.
I have found a command to get the html title:
wget -qO- 'https://www.linuxinsider.com/story/Austrumi-Linux-Has-Great-Potential-if-You-Speak-Its-Language-86285.html/' | gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
And it prints this: Austrumi Linux Has Great Potential if You Speak Its Language | Reviews | LinuxInsider
Found on: https://unix.stackexchange.com/questions/103252/how-do-i-get-a-websites-title-using-command-line
How could i pipe the title back into wget to use it as the filename when downloading that web-page?
EDIT: in case there is no way to do this directly in wget, i found a way to simply rename the html files once downloaded
Renaming HTML files using <title> tags

You can't wget a file, analyze it's contents and then make the same wget execution that downloaded the file magically go back in time and output it to a new file named after it's contents that you analyzed in step 2. Just do this:
wget '...' > tmp &&
name=$(gawk '...' tmp) &&
mv tmp "$name"
Add protection against / in name as necessary.

Related

How can I create a hyperlink via the command line?

I would like to generate a directory of links for some friends who are not technologically savvy. I'm running Ubuntu and would like to do this via the command line.
My attempts so far have been:
touch https:...
which returns:
touch: cannot touch 'https:...': No such file or directory
cat >> https://...
which also returns the No such file or directory exception.
I also tried echo where the link was the filename and the file type was .html, which returned the same exception.
If I drag and drop the link from the address bar into a folder, it creates the hyperlink - however I would like to batch these according to a list of links.
EDIT: This can be done in Python.
I was able to find a question on SO which supplied an alternative using Python.
you could try the following
$ ln -s {source-filename} {symbolic-filename}
source-filename - the target file which you want to create the link
for
symbolic-filename - the name of the symbolic link
Example:
ln -s source_file.txt link_file.txt
You can verify the link creating using the following command
ls -l link_file.txt

Wget get past "infected with a virus" screen on Google Drive

So I've been trying to get wget to download a Google Drive file that I uploaded. Unfortunately, Google Drive incorrectly flags the file as a virus, so wget can't get the direct download link.
Things I've tried:
using the gdrive.pl fie that someone made, but I'm on Windows, and /tmp/cookies.txt does not exist.
doing wget --no-check-certificate https://docs.google.com/uc?export=download&id=FILEID -O FILENAME, but it says 400 Bad Request
using https://docs.google.com/uc?export=download&id=ID, but it fails because of the download infected file warning.
Does anyone have any suggestions to solve this?
Here is what I was able to do, based on a starting point I found at https://medium.com/#acpanjan/download-google-drive-files-using-wget-3c2c025a8b99 :
Edit I noticed you said Windows, so this command with sed won't work natively in Windows - I'll put steps without sed for Windows below
You of course start by sharing the file and getting the file ID from the share link on google drive. Then:
wget --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p' > /tmp/confirm && wget --load-cookies /tmp/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm="$(cat /tmp/confirm)"&id=SHARE_LINK_ID" -O YOUR_FILENAME && rm /tmp/cookies.txt /tmp/confirm
Replace SHARE_LINK_ID with your ID from your shared file link. Replace YOUR_FILENAME with your desired output file name.
This attempts to download the file and gets the html of the warning message about potential viruses in the file. It uses cookies as you need to use the same session ID for the subsequent download with the confirmation code.
It then gets the generated confirm code from that response and writes it to a temporary file.
I then does another wget adding the confirmation code to the query string to download the file, using the saved cookie to allow the confirmation code to work for the saved session.
Most likely this could be worked into a script, passing an argument of the share link ID to make it more useful.
For Windows (without sed)
wget --save-cookies %TMP%/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O %TMP%/confirm.txt
Downloads the confirmation html.
notepad %TMP%/confirm.txt
Opens %TMP%/confirm.txt in Notepad to get the confirm code string (CTRL+F to look for "confirm=" and get the code right after that). Replace it in the below command line (along with putting in the filename you want and the share link ID from google drive)
wget --load-cookies %TMP%/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm=CONFIRM_CODE&id=SHARE_LINK_ID" -O YOUR_FILENAME
Delete the temp files:
del %TMP%/cookies.txt %TMP%/confirm.txt
Try this. Don't forget to replace two FILEID and one FILENAME fields with your desired file's file id and the output file's name respectively.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
source: https://medium.com/geekculture/wget-large-files-from-google-drive-336ba2e1c991

Wget -i gives no output or results

I'm learning data analysis in Zeppelin, I'm a mechanical engineer so this is outside my expertise.
I am trying to download two csv files using a file that contains the urls, test2.txt. When I run it I get no output, but no error message either. I've included a link to a screenshot showing my code and the results.
When I go into Ambari Sandbox I cannot find any files created. I'm assuming the directory the file is in is where the csv files will be downloaded too. I've tried using -P as well with no luck. I've checked in man wget but it did not help.
So I have several questions:
How do I show the output from running wget?
Where is the default directory that wget stores files?
Do I need additional data in the file other than just the URLs?
Screenshot: Code and Output for %sh
Thanks for any and all help.
%sh
wget -i /tmp/test2.txt
%sh
# list the current working directory
pwd # output: home/zeppelin
# make a new folder, created in "tmp" because it is temporary
mkdir -p /home/zeppelin/tmp/Folder_Name
# change directory to new folder
cd /home/zeppelin/tmp/Folder_Name
# transfer the file from the sandbox to the current working directory
hadoop fs -get /tmp/test2.txt /home/zeppelin/tmp/Folder_Name/
# download the URL
wget -i test2.txt

Wget copy text from html

I'm really new to programming and linux/unix so I was wondering what command I can use to copy the text only of a webpage and save it in a file in the directory. I want to copy the text of something like this
http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words
would wget do it? also what specific commands get it saved into the directory?
Another option using wget like you wondered about would be:
wget -O file.txt "http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words"
The -O option lets you specify which file name you want to save it to.
One option would be:
curl -s http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words > file

Recursive directory parsing with Pandoc on Mac

I found this question which had an answer to the question of performing batch conversions with Pandoc, but it doesn't answer the question of how to make it recursive. I stipulate up front that I'm not a programmer, so I'm seeking some help on this here.
The Pandoc documentation is slim on details regarding passing batches of files to the executable, and based on the script it looks like Pandoc itself is not capable of parsing more than a single file at a time. The script below works just fine in Mac OS X, but only processes the files in the local directory and outputs the results in the same place.
find . -name \*.md -type f -exec pandoc -o {}.txt {} \;
I used the following code to get something of the result I was hoping for:
find . -name \*.html -type f -exec pandoc -o {}.markdown {} \;
This simple script, run using Pandoc installed on Mac OS X 10.7.4 converts all matching files in the directory I run it in to markdown and saves them in the same directory. For example, if I had a file named apps.html, it would convert that file to apps.html.markdown in the same directory as the source files.
While I'm pleased that it makes the conversion, and it's fast, I need it to process all files located in one directory and put the markdown versions in a set of mirrored directories for editing. Ultimately, these directories are in Github repositories. One branch is for editing while another branch is for production/publishing. In addition, this simple script is retaining the original extension and appending the new extension to it. If I convert back again, it will add the HTML extension after the markdown extension, and the file size would just grow and grow.
Technically, all I need to do is be able to parse one branches directory and sync it with the production one, then when all changed, removed, and new content is verified correct, I can run commits to publish the changes. It looks like the Find command can handle all of this, but I just have no clue as to how to properly configure it, even after reading the Mac OS X and Ubuntu man pages.
Any kind words of wisdom would be deeply appreciated.
TC
Create the following Makefile:
TXTDIR=sources
HTMLS=$(wildcard *.html)
MDS=$(patsubst %.html,$(TXTDIR)/%.markdown, $(HTMLS))
.PHONY : all
all : $(MDS)
$(TXTDIR) :
mkdir $(TXTDIR)
$(TXTDIR)/%.markdown : %.html $(TXTDIR)
pandoc -f html -t markdown -s $< -o $#
(Note: The indented lines must begin with a TAB -- this may not come through in the above, since markdown usually strips out tabs.)
Then you just need to type 'make', and it will run pandoc on every file with a .html extension in the working directory, producing a markdown version in 'sources'. An advantage of this method over using 'find' is that it will only run pandoc on a file that has changed since it was last run.
Just for the record: here is how I achieved the conversion of a bunch of HTML files to their Markdown equivalents:
for file in $(ls *.html); do pandoc -f html -t markdown "${file}" -o "${file%html}md"; done
When you have a look at the script code from the -o argument, you'll see it uses string manipulation to remove the existing html with the md file ending.