Convert HTML with missing external references to epub? - html

I have save the web page via "save as..." in the browser as HTML file (single-file) on the disk.
http://pedrokroger.net/2012/10/using-sphinx-to-write-books/
Now, I'd like to convert it to epub.
pandoc -f html -t epub -S -R -s Using\ Sphinx\ to\ Write\ Technical\ Books\ -\ Pedro\ Kroger.html -o Using\ Sphinx\ to\ Write\ Technical\ Books\ -\ Pedro\ Kroger.epub
But pandoc throws an error:
pandoc: /images/pages/profile.png: openBinaryFile: does not exist (No such file or directory)
Is there a option to tell pandoc to ignore all external references and just convert the bare text?
This would be very convenient!

Related

wget command to download web-page & rename file with with html title?

I would like to download an html web-page and have the filename be the title of the html page.
I have found a command to get the html title:
wget -qO- 'https://www.linuxinsider.com/story/Austrumi-Linux-Has-Great-Potential-if-You-Speak-Its-Language-86285.html/' | gawk -v IGNORECASE=1 -v RS='</title' 'RT{gsub(/.*<title[^>]*>/,"");print;exit}'
And it prints this: Austrumi Linux Has Great Potential if You Speak Its Language | Reviews | LinuxInsider
Found on: https://unix.stackexchange.com/questions/103252/how-do-i-get-a-websites-title-using-command-line
How could i pipe the title back into wget to use it as the filename when downloading that web-page?
EDIT: in case there is no way to do this directly in wget, i found a way to simply rename the html files once downloaded
Renaming HTML files using <title> tags
You can't wget a file, analyze it's contents and then make the same wget execution that downloaded the file magically go back in time and output it to a new file named after it's contents that you analyzed in step 2. Just do this:
wget '...' > tmp &&
name=$(gawk '...' tmp) &&
mv tmp "$name"
Add protection against / in name as necessary.

Pandoc fails to embed metadata from the supplied YAML file

I need to convert some .xhtml files to regular .html (html5) with pandoc, and during the conversion I would like to embed some metadata (supplied via a YAML file) in the final files.
The conversion runs smoothly, but any attempt to embed the metadata invariably fails.
I tried many variations of this command, but it should be something like:
pandoc -s -H assets/header -c css/style.css -B assets/prefix -A assets/suffix --metadata-file=metadata.yaml input_file -o output_file --to=html5
The error I get is:
pandoc: unrecognized option `--metadata-file=metadata.yaml'
Try pandoc --help for more information.
I really don't get what's wrong with this, since I found this option in the pandoc manual
Any ideas?
Your pandoc version is too old. Update to pandoc 2.3 or later.

Wget get past "infected with a virus" screen on Google Drive

So I've been trying to get wget to download a Google Drive file that I uploaded. Unfortunately, Google Drive incorrectly flags the file as a virus, so wget can't get the direct download link.
Things I've tried:
using the gdrive.pl fie that someone made, but I'm on Windows, and /tmp/cookies.txt does not exist.
doing wget --no-check-certificate https://docs.google.com/uc?export=download&id=FILEID -O FILENAME, but it says 400 Bad Request
using https://docs.google.com/uc?export=download&id=ID, but it fails because of the download infected file warning.
Does anyone have any suggestions to solve this?
Here is what I was able to do, based on a starting point I found at https://medium.com/#acpanjan/download-google-drive-files-using-wget-3c2c025a8b99 :
Edit I noticed you said Windows, so this command with sed won't work natively in Windows - I'll put steps without sed for Windows below
You of course start by sharing the file and getting the file ID from the share link on google drive. Then:
wget --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p' > /tmp/confirm && wget --load-cookies /tmp/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm="$(cat /tmp/confirm)"&id=SHARE_LINK_ID" -O YOUR_FILENAME && rm /tmp/cookies.txt /tmp/confirm
Replace SHARE_LINK_ID with your ID from your shared file link. Replace YOUR_FILENAME with your desired output file name.
This attempts to download the file and gets the html of the warning message about potential viruses in the file. It uses cookies as you need to use the same session ID for the subsequent download with the confirmation code.
It then gets the generated confirm code from that response and writes it to a temporary file.
I then does another wget adding the confirmation code to the query string to download the file, using the saved cookie to allow the confirmation code to work for the saved session.
Most likely this could be worked into a script, passing an argument of the share link ID to make it more useful.
For Windows (without sed)
wget --save-cookies %TMP%/cookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=SHARE_LINK_ID" -O %TMP%/confirm.txt
Downloads the confirmation html.
notepad %TMP%/confirm.txt
Opens %TMP%/confirm.txt in Notepad to get the confirm code string (CTRL+F to look for "confirm=" and get the code right after that). Replace it in the below command line (along with putting in the filename you want and the share link ID from google drive)
wget --load-cookies %TMP%/cookies.txt --no-check-certificate "https://docs.google.com/uc?export=download&confirm=CONFIRM_CODE&id=SHARE_LINK_ID" -O YOUR_FILENAME
Delete the temp files:
del %TMP%/cookies.txt %TMP%/confirm.txt
Try this. Don't forget to replace two FILEID and one FILENAME fields with your desired file's file id and the output file's name respectively.
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
source: https://medium.com/geekculture/wget-large-files-from-google-drive-336ba2e1c991

HTTrack returns file not found

I downloaded a website with HTTrack using the following command:
/usr/local/bin/httrack https://www.website.com -O /Users/mainuser/Desktop/website -n -j
I than located the index.html file in website folder and run it. Chrome returns the message: file not found. That's funny, because normally the websites I parse with httrack work just fine on my file system. What cold be the reason for this behaviour?
Try the following, using the v (verbose) flag to enable HTTrack to give you as much debugging information as possible:
/usr/local/bin/httrack "https://www.website.com" -O "/Users/mainuser/Desktop/website" -n -j -v

Use Pandoc to convert tex file

I try to use pandoc to convert my tex file in html or epub. It is not a complex Latex file with Math formule. It is something like a book.
But I have a problem. When I convert the file in pdf with pdflatex, the all file is ok. But when I use
pandoc book.tex -s --webtex -o book.html
or
pandoc -S book.tex -o book.epub
It is as if there was no compilation.. << are not replaced by «. Each command, like \emph{something}, are just ignored and the word is delete from the paragraph.
In fact, it is as if I had made a simple copy and paste, without commands.
In older versions of pandoc, I think you needed to tell it the input format, otherwise it would assume it was markdown.
pandoc -f latex book.tex -S -o book.epub