Converting HTML with equations pages to docx - html

I am trying to convert an html document to docx using pandoc.
pandoc -s Template.html --mathjax -o Test.docx
During the conversion to docx everything goes smooth less the equations.
In the html file the equation look like this:
<div class="jp-Cell jp-MarkdownCell jp-Notebook-cell">
<div class="jp-Cell-inputWrapper">
<div class="jp-Collapser jp-InputCollapser jp-Cell-inputCollapser">
</div>
<div class="jp-InputArea jp-Cell-inputArea"><div class="jp-RenderedHTMLCommon jp-RenderedMarkdown jp-MarkdownOutput " data-mime-type="text/markdown">
\begin{equation}
\log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391}
\end{equation}
</div>
</div>
</div>
</div>
After running the pandoc command the result in the docx document is:
\begin{equation} \log_{10}(\mu)={-2.64}+\frac{4437.038}{T-544.391} \end{equation}
Do you have idea how can I overcome this issue?
Thanks

A Lua filter can help here. The code below looks for div elements with a data-mime-type="text/markdown" attribute and, somewhat paradoxically, parses it context as LaTeX. The original div is then replaced with the parse result.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
return pandoc.read(stringify(div), 'latex').blocks
end
end
Save the code to a file parse-math.lua and let pandoc use it with the --lua-filter / -L option:
pandoc --lua-filter parse-math.lua ...
As noted in a comment, this gets slightly more complicated if there are other HTML elements with the text/markdown media type. In that case we'll check if the parse result contains only math, and keep the original content otherwise.
local stringify = pandoc.utils.stringify
function Div (div)
if div.attributes['mime-type'] == 'text/markdown' then
local result = pandoc.read(stringify(div), 'latex').blocks
local first = result[1] and result[1].content or {}
return (#first == 1 and first[1].t == 'Math')
and result
or nil
end
end

Related

Inserting html produced in r function

I want to construct html on the fly and have that html rendered
in Quarto.
The actual application involves inserting an iFrame,
but for simplicity, let's just make an <img> tag.
Here is my .qmd code:
```{r}
source("awash-functions.r")
```
How do you inject html text produced in an r function into a **quarto** document?
In R markdown, I had the function `sprintf` a string. That doesn't seem to work here!
Here is `awash-functions.r`:
imageLink <- function(iUrl, iText) {
sprintf("<img src = '%s' width='24'> %s", iUrl, iText)
}
let's call the function and see what appears:
```{r echo=FALSE}
imageLink("https://www.united.com/8cd8323f017505df6908afc0b72b4925.svg", "united logo")
```
and now, here's what it's supposed to look like:
<img src = 'https://www.united.com/8cd8323f017505df6908afc0b72b4925.svg'> united logo
It renders, and the function clearly gets called,
but it shows the html code, not the image:
I know it's something simple, but I can't find it. Many thanks!
Two things to note:
Firstly, Quarto by default wraps any code chunk output within the <pre><code> tag. To get the output asis you need to use the chunk option results: asis.
Secondly, sprintf (or even print) returns output enclosed within quotes. So after using results: asis, you would get the html tags but would also get the quotes. So you need to wrap the sprintf with cat to get intended results.
---
format: html
---
```{r}
#| echo: false
imageLink <- function(iUrl, iText) {
cat(sprintf("<img src = '%s'> %s", iUrl, iText))
}
```
```{r}
#| echo: false
#| results: asis
imageLink("https://www.united.com/8cd8323f017505df6908afc0b72b4925.svg", "united logo")
```
and now, here's what it's supposed to look like:
<img src = 'https://www.united.com/8cd8323f017505df6908afc0b72b4925.svg'> united logo

Formatting divs using Pandoc

I am using Pandoc to convert Pandoc Markdown documents to HTML5 documents. In my md input, I write custom divs using a special Pandoc syntax, for example :
::: Resources
A nice document
Informative website
:::
The resulted HTML is this :
<div class="Resources">
<p>A nice document Informative website</p>
</div>
I would like the output to be something like this instead :
<div class="Resources">
<div>A nice document</div>
<div>Informative website</div>
</div>
Ie. I want the two resources to be in two different containers. I did not find any solution to do that (it is possible that the pandoc filters can, but I don't quite understand how to write them).
Thank you very much for any kind of help. Cheers.
If the main goal is to have separate Resource blocks, I'd suggest to use a list inside the div:
::: Resources
- A nice document
- Informative website
:::
This will give
<div class="Resources">
<ul>
<li>A nice document</li>
<li>Informative website</li>
</ul>
</div>
It's not what you want yet, but get's us half way there. It already marks all resources as separate blocks. This simplifies our task to refine the document structure further through filtering. The following uses pandoc's Lua filter functionality; put the code into a file and pass it to pandoc via the --lua-filter command line parameter.
local list_to_resources = {
BulletList = function (el)
local resources = {}
local resource_attr = pandoc.Attr('', {'Resource'}, {})
for i, item in ipairs(el.content) do
resources[i] = pandoc.Div(item, resource_attr)
end
return resources
end
}
function Div (el)
-- return div unaltered unless it is of class "Resources"
if not el.classes:includes'Resources' then
return nil
end
return pandoc.walk_block(el, list_to_resources)
end
Calling pandoc with this filter will produce your desired output:
<div class="Resources">
<div class="Resource">
A nice document
</div>
<div class="Resource">
Informative website
</div>
</div>
For the sake of completeness, I'll also add a solution to the question when taking it literally. However, I do not recommend using it for various reasons:
It is far less "markdowny". Using only linebreaks to separate items is uncommon in Markdown and goes against its philosophy of having readable text without surprises.
The necessary code is more complex and fragile.
You won't be able to add additional information to the Resources div, as it will always be mangeled-up by the filter. With the previous solution, only bullet lists have a special meaning.
That being said, here's the code:
-- table to collect elements in a line
local elements_in_line = {}
-- produce a span from the collected elements
local function line_as_span()
local span = pandoc.Span(elements_in_line)
elements_in_line = {}
return span
end
local lines_to_blocks = {
Inline = function (el)
print(el.t)
if el.t == 'SoftBreak' then
return line_as_span()
end
table.insert(elements_in_line, el)
return {}
end,
Para = function (el)
local resources = {}
local content = el.content
-- last line is not followed by SoftBreak, add it here
table.insert(content, line_as_span())
local attr = pandoc.Attr('', {'Resource'})
for i, line in ipairs(content) do
resources[i] = pandoc.Div(pandoc.Plain(line.content), attr)
end
return resources
end
}
function Div (el)
-- return div unaltered unless it is of class "Resources"
if not el.classes:includes'Resources' then
return nil
end
return pandoc.walk_block(el, lines_to_blocks)
end

How can I display a TemporaryUploadedFile from Django in HTML as an image?

In Django, I have programmed a form in which you can upload one image. After uploading the image, the image is passed to another method with the type TemporaryUploadedFile, after executing the method it is given to the HTML page.
What I would like to do is display that TemporaryUploadedFile as an image in HTML. It sounds quite simple to me but I could not find the answer on StackOverflow or on Google to the question: How to display a TemporaryUploadedFile in HTML without having to save it first, hence my question.
All help is appreciated.
Edit 1:
To give some more information about the code and the variables while debugging.
input_image = next(iter(request.FILES.values()))
output_b64 = (input_image.content_type, str(base64.b64encode(input_image.read()), 'utf8'))
Well, you can encode the image to base64 and use a data url as the value for src.
A base64 data url looks like this:
<img src="">
\_______/ \__________________/
| |
File type base64 encoded data
Read the Mozilla docs for more on data urls.
Here's some relevant code:
import base64
def my_view(request):
# assuming `image` is a <TemporaryUploadedFile object>
image_b64 = base64.b64encode(image.read())
image_b64 = image_b64.decode('utf8') # convert bytes to string
image_type = image.content_type # png or jpeg or something else
return render('template', {'image_b64': image_b64, 'image_type': image_type})
Then in your template:
<img src="data:{{ image_type }};base64,{{ image_b64 }}">
I want to thank xyres for pushing me in the right direction. As you can see, I used some parts of his solution in the code below:
# As input I take one image from the form.
temp_uploaded_file = next(iter(request.FILES.values()))
# The TemporaryUploadedFile is converted to a Pillow Image
input_image = pil_image.open(temp_uploaded_file)
# The input image does not have a name so I set it afterwards. (This step, of course, is not mandatory)
input_image.filename = temp_uploaded_file.name
# The image is saved to an InMemoryFile
output = BytesIO()
input_image.save(output, format=img.format)
# Then the InMemoryFile is encoded
img_data = str(base64.b64encode(output.getvalue()), 'utf8')
output_b64 = ('image/' + img.format, img_data)
# Pass it to the template
return render(request, 'visualsearch/similarity_output.html', {
"output_image": output_b64
})
In the template:
<img id="output_image" src="data:{{ image.0 }};base64,{{ image.1 }}">
The current solution works but I don't think it is perfect because I expect that it can be done with less code and faster, so if you know how this can be done better you are welcome to post your answer here.

Creating HTML links from images in :colons: with Ruby

I have a simple HTML document:
<div should-not-be-replaced=":smile:">
Hello :smile:!
</div>
How would I replace the :smile: text with <img src="smile.png">, but keeping the first :smile: unchanged, to get this:
<div should-not-be-replaced=":smile:">
Hello <img src="smile.png">!
</div>
I tried this, but Nokogiri escapes my HTML as plain text:
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
next unless x.text?
x.content = x.text.gsub(':smile:', '<img src="smile.png">')
end
My solution is very similar to Ku's, although I've tried to handle situations where the replaced text could be in the source text multiple times by completely replacing the content text with an HTML Doc Fragment
doc = Nokogiri::HTML::DocumentFragment.parse(DATA.read)
doc.traverse do |x|
next unless x.text?
if x.text.match(%r{:(\w+):})
replace_text = x.text.gsub(%r{:(\w+):}, "<img src='#{$1}.png'>")
x.content = ""
x.add_next_sibling replace_text
end
end
I think this might be what you want, and it also deals with strings between two colons like :something: and produces "something.png" as well.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.traverse do |x|
if x.text? && x.content =~ /:\w+:/
x.content = x.content.sub(/:(\w+):/, '')
a = Nokogiri::HTML::DocumentFragment.parse('<a src="'+$1+'.png">')
x.add_next_sibling(a)
end
end
You are making it much too hard, and using traverse which is slow because it forces Nokogiri to walk through every node in the document; In a large page that is costly.
Instead take advantage of selectors to find the specific node(s) you want:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div parm=":smile:">
Hello :smile:!
</div>
EOT
div = doc.at('div[parm=":smile:"]')
div.inner_html = div.text.sub(/:smile:/, '<img src="smile.png">')
puts doc.to_html
Running that results in:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div parm=":smile:">
Hello <img src="smile.png">!
</div>
</body></html>
I'm using at, which finds the first occurrence. If you need to process more than one then use search. search returns a NodeSet, which is like an array so you'll want to iterate over it. That are innumerable examples of doing so on Stack Overflow and elsewhere.
Do you mean it returns &lt or &gt?
I recommend to wrap CGI#unescape_html method
try,
require 'cgi'
CGI::unescape_html(doc.to_s)

Splitting HTML file using AWK

I was wondering if it's possible to split a HTML file into seperate .html files using awk? I'd like to look for the pattern:
<div class="post">
And when it finds this create the new file for each instance, I've tried to compile the command but can't get it working? My file is called working.html and this is what I got back from the command I've constructed.
awk '/<div class="post">/{x="F"++i;}{print > x;}' working.html
Any ideas?
It looks like it's bombing out because x is not initialized and can't be used as a filename until it is first set on a <div> line.
One way to fix that is to add a BEGIN pattern to initialize it.
BEGIN {
x = "F0"
}
/<div class="post">/ {
x = "F" ++i
}
{ print > x }