HTML-to-RTF document conversion, preserving classes as styles - html

I need a HTML2RTF tool, that is, a software that converts HTML format to RTF format... But not "any convertion": I need to preserve the HTML class attributes (ex. of paragraphs) as MS-Word "styles".
My first option was some terminal command of LibreOffice, like
libreoffice --convert-to
because LibreWriter have the bigger community and suppose the best software convertion... But disappointed because not preserve class attributes as styles, even when testing as user in the graphical interface.
I need a Linux solution (also abiword not solved)... Or, last option, a webservice to easy plug in a intranet's Windows server.
Input sample:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sample1 doc</title>
<!-- no style need, but can be declarated with anything, don't matter -->
<style type="text/css">
.myStyle1 {color: #F00;} .myStyle2 {color: #880;}
.a {color: #00F;} .b {color: #088;}
</style>
</head>
<body><!-- important to preserve class names -->
<p class="myStyle1">Hello in <i>style#1</i>.
<span class="a">SPAN S1</span>.</p>
<p class="myStyle2">... Hello in style#2...</p>
<p class="myStyle1">Bye <span class="b">S2</span>.</p>
</body>
</html>
In MS-Word this sample is imported and looks ok, with styles where was classes.
In LibreOffice (and libreoffice terminal tools) not.
So, there are another tool for LibreOffice? There are a tool for Linux?
PS: last possibility, if none for Linux, a webservice for Windows and MS-Office.

Works for me in Libreoffice 4.3.3.2. Just opened the HTML file you provided and I can see styles named Text.Body.myStyle1 and myStyle2.
Clues, for Debian Stable and UBUNTU LTS 64bits... See this How-To. Basic steps:
sudo apt-get remove libreoffice*
wget http://download.documentfoundation.org/libreoffice/stable/4.3.3/deb/x86_64/LibreOffice_4.3.3_Linux_x86-64_deb.tar.gz
tar -xzvf LibreOffice_4.3.3_Linux_x86-64_deb.tar.gz
cd LibreOffice_4.3.3*_Linux_x86-64_deb/DEBS
sudo dpkg -i *.deb
After v4.3.3, need also to install:
sudo apt-get install libreoffice-writer
then, the cited command:
libreoffice --headless -convert-to rtf libreTeste.html

Related

bash grep only numbers lastest database

curl -s http://virusradar.com/en/update/info/latest | grep -oP "(?\<=\<h1\>Update )\[0-9\]+"
When there was a base number in the header, it worked.
Now removed how to change the request.
Tell me good people.
That URL appears to be defunct:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved here.</p>
</body></html>
What do you mean by "base" ?
Please provide the original form of the command, so we could see how it was (incorrectly?) modified.

Pandoc: Change font family to sans while converting from Markdown to HTML

I successfully get a nice formatted text I could paste anywhere using:
cat myFile.md | pandoc -s -f markdown -t html | xclip -selection clipboard -t text/html
xclip is a command line interface to X selections (clipboard). With ... -t html -o myFile.html works fine too.
I'm trying to change the font family, from the default Serif to some other Sans-serif font family. I found a lot of examples with LaTex, PDF and DOC, but no one that works in this scenario. Tried a lot of fonts (listed from fc-list : family, even after installing texlive-xetex package). The Closest answer I could find was this one.
I'm trying to just use certain parameters on CLI, trying to avoid things like --css source/styles.css.
Using pandoc 1.19.2.4 over Ubuntu 18.04.
Some --variable I tried:
-V fontfamily:arev
-V fontfamily:Ubuntu
-V fontfamilyoptions:sfdefault
-V "mainfont:DejaVuSans"
-V mainfont="DejaVu Sans Serif"
-V "sansfont:DejaVuSans"
Edit 1:
Based on mb21's answer, since Pandoc 1.12.x (source) is possible to provide more metadata to Pandoc adding a YAML block code.
On newer Pandoc versions, I also added a title key to avoid the "[WARNING] This document format requires a nonempty element.".
---
title: My File
header-includes: |
<style>
body {
font-family: "Liberation Sans";
}
</style>
---
I still don't see the fundamental difference in this aspect between coming from Markdown instead of LaTeX, and going to HTML instead of PDF.
Update: This is possible in pandoc 2.11. For details, see the MANUAL, but for example:
---
mainfont: sans-serif
---
my markdown
If your font name includes spaces then specify name in quotes escaped with backslash:
---
mainfont: \"Sanskrit 2020\"
---
Old answer: The font variables you mention are only for LaTeX/PDF output. To style HTML, you need CSS. You can for example put this in your markdown file:
---
header-includes: |
<style>
body {
font-family: sans-serif;
}
</style>
---
my markdown
Alternatively you can:
use --css
copy the default styles.html partial in ~/.pandoc/templates/styles.html and modify it. (You can just create the directories if they doen't exist.)
use a template like this one...
Also: pandoc 1.19 is ancient, see https://pandoc.org/installing.html
Another solution based on mb21's one is using a separated YAML file, --metadata-file option with that code in for e.g. metadata.yaml.
I provide the title with --metadata.
cat myFile.md | pandoc -s -f markdown -t html --metadata-file metadata.yaml --metadata title="My File" -o myFile.html
A metadata.yaml content example:
---
header-includes: |
<style>
body {
font-family: "DejaVu Sans";
}
</style>
---
AFAIK is not possible to provide the whole styling just through --metadata on the same on-liner command.
Another very useful onliner to convert clipboard to formatted rendered text is (on Mac use pbpaste):
xsel -b | pandoc -s -f markdown -t html | xclip -selection clipboard -t text/html

Brew package, rapache, not interpreted code for R

I have installed rapache, and created the r.load and r.conf files in /etc/apache2/mods-available. in r.conf I have placed the following lines.
<Locarion /R>
SetHandler r-script
RHandler sys.source
</Location>
And similary, for RApacheInfo. So far so good, because I have put files with R code inside / R and I can visualize the output in the browser.
But brew had problems, in r.conf for brew
<Location /brew>
SetHandler r-script
RHandler brew::brew
</Location>
And installed brew as follows
su root
R
> install.packages("brew")
> 127
127 indicates the mirror, but I only downloaded it, then I had to put it
> install.packages('package.tar.gz', lib="/usr/local/lib/R/site-library", repos = NULL)
And then I put a file inside /brew with html code and R code as well.
<html>
<head>
<title>R, HTML </title>
</head>
<body>
<h3>Test R and HTML </h3>
<p> rnorm </p>
<%
print(rnorm(100));
%>
</body>
</html>
But it only shows output for HTML, the code for R is shown as text.
¿Why it's not interpreting the R code?
And as I install several libraries so they all work in /R.

Pandoc HTML variables: `quotes` and `math`

Pandoc default HTML template contains these two variables:
quotes,
math.
How are they supposed to be used?
More specifically I see that quotes sets the values for the tag <q>. Is this tag used in markdown to HTML conversion?
tl;dr: they seem to be mostly obsolete legacies from previous versions of pandoc
quotes
A little archeology of pandoc commits shows that 'quotes' was added when pandoc switched from using <q> tags to directly adding quotes signs. A new option, --html-q-tags was added to keep the previous behavior: the option wraps quotes in <q> and sets quotes to true so that a piece of css code is added as explained in the html template. See this commit to pandoc and this commit to pandoc-templates. See the behavior with the following file:
"hello world"
This:
pandoc test.md -t html --smart --standalone
Produces (skipping the usual head, with no css affecting <q>)
<p>“hello world”</p>
While this
pandoc test.md -t html --standalone --html-q-tags --smart
produces (skipping the usual header)
<style type="text/css">q { quotes: "“" "”" "‘" "’"; }</style>
</head>
<body>
<p><q>hello world</q></p>
</body>
You have to use --smart though.
math
It looks like this was introduced to include math rendering scripts inside the standalone file. See this commit from 2010. I think some command-line options picking non-(currently)-default math rendering systems, like --mathml, sets this variable to a value that actually makes sense (like copying the math rendering scripts). Try:
pandoc -t html --mathml
For the quotes variable, see #scoa.
As regards the math variable, I found what follows.
When using MathML, that is the option --mathml, the code block:
$if(math)$
$math$
$endif$
in the default HTML conversion template adds a portability script to the HTML output.
Anyway, Chrome and Edge do not currently support MathML and Firefox seems to support it without this script.
So, for a custom template, removing the $if(math)$ ... code block will not affect MathML rendering.
When using MathJax, that is the option --mathjax, $if(math)$ ... adds to the HTML output the script block:
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML-full" type="text/javascript"></script>
This is always necessary to render the maths formulae.
When using the --latexmathml, a giant script, converting the LaTeX style math into MathML, is inserted by the $if(math)$ ... code block. Without this code block in the conversion template, the script is not inserted and the maths can't be rendered.

Single and Double Backslash in Path

I have tried everything to get a background image to work but have had no luck.
I'm using most current versions of Windows and IE.
Works fine server side.
Does anyone have an example?
Note: The img tag in the body renders the image just fine.
also tried background:url...
<!DOCTYPE html>
<html>
<head>
<style>
html { height:100%; width:100%;
background-image:url("file://C:\Users\Public\Pictures\Sample Pictures\florida-orlando-resort.jpg");
}
</style>
</head>
<body>
123...<img src="C:\Users\Public\Pictures\Sample Pictures\florida-orlando-resort.jpg"
style="width:100px; height:100px; display:cover;">...456
</body>
</html>
1. Use a single Forward-slash / like C:/Folder/Images/image.jpg (preferred)
2. Escape your backslashes \\ like C:\\Folder\\Images\\image.jpg
Theoretically you should escape also the backslashes that you use in image's src:
<img src="file://C:\\Folder\\Images\\image.jpg">
(or again use simply a single /).
Due to some accidents in programming history Windows paths uses \. You would normally access your image using: C:\Folder\Images\image.jpg.
Browser gateways tries to normalize that issue for you and looks like it works in HTML syntax. CSS style instead (I believe the way it's parsed) needs to follow the escaping directive for unwise characters (\) translating it to a Windows understandable path.
I encourage you to simply forget about \ and use it the way you'd do on a live server:
background-image: url("C:/Folder/Images/image.jpg");
and respectively in HTML
src="C:/Folder/Images/image.jpg"
An additional note is that you should preferably use lowercase folder names.
P.S: from file: environment on Windows (NTFS filesystem) an all-lowercase path might match the desired file, but the same might not work on a live server. Such mistake might lead to small headaches, so try always to use lowercase