I need to convert some .xhtml files to regular .html (html5) with pandoc, and during the conversion I would like to embed some metadata (supplied via a YAML file) in the final files.
The conversion runs smoothly, but any attempt to embed the metadata invariably fails.
I tried many variations of this command, but it should be something like:
pandoc -s -H assets/header -c css/style.css -B assets/prefix -A assets/suffix --metadata-file=metadata.yaml input_file -o output_file --to=html5
The error I get is:
pandoc: unrecognized option `--metadata-file=metadata.yaml'
Try pandoc --help for more information.
I really don't get what's wrong with this, since I found this option in the pandoc manual
Any ideas?
Your pandoc version is too old. Update to pandoc 2.3 or later.
I am trying to extract a particular section of a HTML document from a bash shell script and have been using xmlstarlet sel but I can't quite get it to return actual HTML, rather than just the text values from the HTML tags.
I'm trying a command line as follows:
xmlstarlet sel -t -m "//div[#id='mw-content-text']" -v "." wiki.html
But it is giving text only, without any HTML/XML markup. For info, I'm trying to export this data into a HTML format outside the mediawiki instance it has come from.
If xmlstarlet is the wrong tool, any suggestions for other tools also gratefully received!
-v means --value-of which is the contents of tags. You should use -c or --copy-of to get the tags themselves.
xmlstarlet sel -t -m "//div[#id='mw-content-text']" -c "." wiki.html
Or just
xmlstarlet sel -t -c "//div[#id='mw-content-text']" wiki.html
I have save the web page via "save as..." in the browser as HTML file (single-file) on the disk.
http://pedrokroger.net/2012/10/using-sphinx-to-write-books/
Now, I'd like to convert it to epub.
pandoc -f html -t epub -S -R -s Using\ Sphinx\ to\ Write\ Technical\ Books\ -\ Pedro\ Kroger.html -o Using\ Sphinx\ to\ Write\ Technical\ Books\ -\ Pedro\ Kroger.epub
But pandoc throws an error:
pandoc: /images/pages/profile.png: openBinaryFile: does not exist (No such file or directory)
Is there a option to tell pandoc to ignore all external references and just convert the bare text?
This would be very convenient!
I'm really new to programming and linux/unix so I was wondering what command I can use to copy the text only of a webpage and save it in a file in the directory. I want to copy the text of something like this
http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words
would wget do it? also what specific commands get it saved into the directory?
Another option using wget like you wondered about would be:
wget -O file.txt "http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words"
The -O option lets you specify which file name you want to save it to.
One option would be:
curl -s http://cseweb.ucsd.edu/classes/wi12/cse130-a/pa5/words > file
I found this question which had an answer to the question of performing batch conversions with Pandoc, but it doesn't answer the question of how to make it recursive. I stipulate up front that I'm not a programmer, so I'm seeking some help on this here.
The Pandoc documentation is slim on details regarding passing batches of files to the executable, and based on the script it looks like Pandoc itself is not capable of parsing more than a single file at a time. The script below works just fine in Mac OS X, but only processes the files in the local directory and outputs the results in the same place.
find . -name \*.md -type f -exec pandoc -o {}.txt {} \;
I used the following code to get something of the result I was hoping for:
find . -name \*.html -type f -exec pandoc -o {}.markdown {} \;
This simple script, run using Pandoc installed on Mac OS X 10.7.4 converts all matching files in the directory I run it in to markdown and saves them in the same directory. For example, if I had a file named apps.html, it would convert that file to apps.html.markdown in the same directory as the source files.
While I'm pleased that it makes the conversion, and it's fast, I need it to process all files located in one directory and put the markdown versions in a set of mirrored directories for editing. Ultimately, these directories are in Github repositories. One branch is for editing while another branch is for production/publishing. In addition, this simple script is retaining the original extension and appending the new extension to it. If I convert back again, it will add the HTML extension after the markdown extension, and the file size would just grow and grow.
Technically, all I need to do is be able to parse one branches directory and sync it with the production one, then when all changed, removed, and new content is verified correct, I can run commits to publish the changes. It looks like the Find command can handle all of this, but I just have no clue as to how to properly configure it, even after reading the Mac OS X and Ubuntu man pages.
Any kind words of wisdom would be deeply appreciated.
TC
Create the following Makefile:
TXTDIR=sources
HTMLS=$(wildcard *.html)
MDS=$(patsubst %.html,$(TXTDIR)/%.markdown, $(HTMLS))
.PHONY : all
all : $(MDS)
$(TXTDIR) :
mkdir $(TXTDIR)
$(TXTDIR)/%.markdown : %.html $(TXTDIR)
pandoc -f html -t markdown -s $< -o $#
(Note: The indented lines must begin with a TAB -- this may not come through in the above, since markdown usually strips out tabs.)
Then you just need to type 'make', and it will run pandoc on every file with a .html extension in the working directory, producing a markdown version in 'sources'. An advantage of this method over using 'find' is that it will only run pandoc on a file that has changed since it was last run.
Just for the record: here is how I achieved the conversion of a bunch of HTML files to their Markdown equivalents:
for file in $(ls *.html); do pandoc -f html -t markdown "${file}" -o "${file%html}md"; done
When you have a look at the script code from the -o argument, you'll see it uses string manipulation to remove the existing html with the md file ending.