R knitr Add linebreak in table header kable() - html

I am using knitr to generate some reports. I use kable to generate an HTML table in the document. In the headers I want to use linebreaks (or other html tags) to enhance the table
<!--begin.rcode results='asis'
s <- rbind(c(1,2,3,4),c(1,2,3,4),c(1,2,3,4))
kable(s, col.names=c("Try Newline\nn","Try HTML break<br>%","Past 6 months\nn","\n%"))
end.rcode-->
As you can see I am trying different options without much success.
In my result linebreaks (\n) are just translated in a linebreak in the HTML source. tags are translated to HTML special characters.
Any suggestions?

As far as I know, the pipe table syntax does not support line breaks in the cells, so if using pandoc to convert markdown to HTML (this is what RStudio uses), then you'd better choose some more feature-rich table syntax, e.g. multiline or grid. Not sure how to do that with kable, but pander supports those:
> library(pander)
> colnames(s) <- c("Try Newline\nn","Try HTML break<br>%","Past 6 months\nn","\n%")
> pander(s, keep.line.breaks = TRUE)
-------------------------------------------------------
Try Newline Try HTML break<br>% Past 6 months %
n n
------------- --------------------- --------------- ---
1 2 3 4
1 2 3 4
1 2 3 4
-------------------------------------------------------
But this is not enough, as line breaks are automatically removed by pandoc, so you have to put hard line-breaks ("a backslash followed by a newline") there based on the related docs. E.g. the following code converts to HTML as expected:
> colnames(s) <- c("Try Newline\\\nn","Try HTML break\\\n%","Past 6 months\\\nn","\\\n%")
> pander(s, keep.line.breaks = TRUE)
-----------------------------------------------------
Try Newline\ Try HTML break\ Past 6 months\ \
n % n %
-------------- ----------------- ---------------- ---
1 2 3 4
1 2 3 4
1 2 3 4
-----------------------------------------------------

There is a way to limit column width which you could use to help achieve this in kable. use column_spec() where you can specify which columns, and the width in different units like cm,in,em.

So it appears that kable converts the <> to HTML equivalents, i.e. "<" and ">", so I have a quick fix that will work as long as you don't actually require the <> anywhere else. This has allowed me to get a line break in the column headings in my table.
Essentially once your table is complete just substitute the "<" and ">" in the HTML for < and > and then save it as HTML file. Like so:
tbl_output <- gsub("<", "<", tbl_output)
tbl_output <- gsub(">", ">", tbl_output)
write(tbl_output, "TableOutput.html")
where tbl_output is the output from kable.
Alternatively, and in particular if you need to use <> elsewhere in your table, you could create your own string for a newline and then gsub it for <br> at the end.

Related

Get 2 separate xpath values from one span with a line break

I've got my HTML which looks like this:
<span>
Word 1
Sentence 1
</span>
I can extract it with:
//span/text()
which gives me
Word 1
Sentence 1
Is it possible in XPATH, to get/extract Word 1 and Sentence 1 separately?
(XPath extractor in Python for Scrapy)
I've tried:
//span/text()[1]
//span/text()[2]
substring-before(//span/text(),'\n')
but both were wild guesses and not working.
You can get the first item "Word 1" with
normalize-space(substring-before(substring-after(translate(span/text(),'
',''),'
'),'
'))
and get the second item "Sentence 1" with
normalize-space(substring-after(substring-after (translate(span/text(),'
',''),'
'),'
'))
You can remove the normalize-space(...) if you don't need it.
The context node should be the parent of span, otherwise you should prefix the expression with //. Your main problem has been that there was a line feed (\n) before the first item.
EDIT:
I added a solution for handling the CR char for Windows' CRLF. It simply removes the CR char and acts on the LF char.
See a previous question to understand how to properly access the inner content of the element.
Then, process the output string to fit your needs.

Issues with parsing HTML with ragel

In my project I need to extract links from HTML document.
For this purpose I've prepared ragel HTML grammar, primarily based on this work:
https://github.com/brianpane/jitify-core/blob/master/src/core/jitify_html_lexer.rl
(mentioned here: http://ragel-users.complang.narkive.com/qhjr33zj/ragel-grammars-for-html-css-and-javascript )
Almost all works well (thanks for the great tool!), except one issue I can't overcome to date:
If I specify this text as an input:
bbbb <a href="first_link.aspx"> cccc<a href="/second_link.aspx">
my parser can correctly extract first link, but not the second one.
The difference between them is that there is a space between 'bbbb' and '<a', but no spaces between 'cccc' and '<a'.
In general, if any text, except spaces, exists before '<a' tag it makes parses consider it as content, and parser do not recognize tag opening.
Please find in this repo: https://github.com/amdei/ragel_html_sample intentionally simplified sample with grammar, aiming to work as C program ( ngx_url_html_portion.rl ).
There is also input file input-nbsp.html , which expected to contain input for the application.
In order to play with it, make .c-file from grammar:
ragel ngx_url_html_portion.rl
then compile resulting .c-file and run programm.
Input file should be in the same directory.
Will be sincerely grateful for any clue.
The issue with the defined FSM is that it includes into 'content' all characters until the space. You should exclude HTML tag opening '<' from the rule. Here is the diff for illustration:
$ git diff
diff --git a/ngx_url_html_portion.rl b/ngx_url_html_portion.rl
index ccef0ca..1f8dcf0 100644
--- a/ngx_url_html_portion.rl
+++ b/ngx_url_html_portion.rl
## -145,7 +145,7 ## void copy2hrefbuf(par_t* par, u_char* p){
);
content = (
- any - (space )
+ any - (space ) - '<'
)+;
html_space = (

Displaying an EBNF grammar using HTML and CSS

What would be a good way to display an EBNF (or EBNF-like) grammar using HTML and CSS? I do not want to use the code tag. I want be able to display stuff aligned (viz. definition symbols ‘=’ with alternation symbols ‘|’, but also remarks).
Here’s an example of a grammar I want to display, only as a simple plain text:
<expression> = <integral>
| <variable> (only variables of type Integer are allowed)
<integral> = <digit>+
<variable> = x | y | … (any lower case latin letter)
…
The spacing within definitions like <integral> = <digit>+’ should be “as usual”, i.e. not compromised by some way of aligning the definition symbols (as would be the case when using tabulars for example).

Find and replace Heading tags with regex in Notepad++

There's a OCR scanned book and there's a tool which converts the OCR'd PDF to XML but most of the XML tags are wrong so there's another tool to fix it. But I need to break the lines from <h1> to <h5>, 1. & 1.1. & 1.1.1. so its easy to re-tag using the tool.
The XML code looks like this:
`<h1>text</h1><h2>text</h3><h3>text</h3>"
and
1.text.2.text.3.text.1.1.text.1.1.1.text
And I need to break the lines like this using a Regex in notepad++.
<h1>text</h1>
<h2>text</h2>
<h3>text</h3>
and
1.text.
2.text.
3.text.
and
1.1.text.
1.1.1.text.
I used </h1>\s* to find an </h1>\n but it only breaks h1 tags. I need to break all "H" tags and 1., 2., 1.1., 1.1.1. tags too.
At the risk of getting downvoted, i think you may be better served by a parser. In the past when I've had to manage similar tasks, I would write a small script/program to parse the file and re-write it as needed. Parsing the xml first, and then reformatting using regex might be easier to accomplish your goal.
You can use this search and replace (if your h1, h2, ... tags don't contain other tags):
search: (?<!^)(<h[1-6][^<]*|(?<![0-9]\.)[0-9]+\.)
replace: \n$1
note: if you need Windows newlines, you must change \n with \r\n.
pattern details:
(?<!^) # not preceded by the begining of the string
( # open the capture group 1
<h[1-6][^<]* # <h, a digit between 1 to 6, all characters until
# the next < (to skip all the content between
# h1, h2... tags)
| # OR
(?<![0-9]\.)[0-9]+\. # one or more digits and a dot not preceded by a digit
# and a dot
) # close the capture group 1
$1 is a reference to the content of the capture group 1

Advanced HTML multiline formatting - removing not need spaces from new lines

Question is very simple but I am not found solution yet - probably it is not possible or very hard to find since it is very trivial.
Question is how to avoid adding spaces in formatted HTML after new line - especially in list of values.
First example see example:
1, 2
It produces required HTML like this:
1, 2
Now another example which not works:
1
,
2
It produces invalid HTML like this:
1 , 2 required is 1, 2
How to achieve same result as in first example but using multiline text layout - I know that we could do it in one line but want to do in many lines to simplify program code (not HTML).
It works as defined: in normal content, a newline is equivalent to a space. There is no way to change this principle in HTML. Just divide you content into lines so that the principle works for you, not against you. That is, break a line only at a point where a space is OK.