How To Display Text in Bokeh While Preserving Line Gaps - html

I am trying to generate a report consisting of plots and some text using Bokeh 2.0.1, and I am trying to preserve the original line gaps in the text that I feed into bokeh.layouts.gridplot.
However, I see that all the line gaps get removed after the HTML file is generated, and essentially, all paragraphs get merged into one continuous sentence. I am not sure why that is happening. I didn't find much help in the Bokeh documentation regarding this issue.
Here's a minimal example of the section of the code I am trying to get to work.
from bokeh.layouts import gridplot
from bokeh.models import Paragraph
from bokeh.plotting import output_file, save
sampText = "PARAGRAPH 1: This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text.\n
PARAGRAPH 2: This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text. This is a minimal example of a text."
fig1 = ...
fig2 = ...
text = Paragraph(text = sampText, width = 1500, height_policy = "auto", style = {'fontsize': '10pt', 'color': 'black', 'font-family': 'arial'})
output_file(filename = "myMinimalExampleOutput" + ".html")
output = gridplot([fig1, fig2, text], ncols = 1, plot_width = 1500, sizing_mode = 'scale_width')
save(output)
Could someone please point out why this is happening?

This is how HTML works - it removes all "unnecessary" whitespace from the text.
For your two paragraphs to work, you have to wrap each of them in <p></p>. But since Paragraph generates the same tag, you can't use that class. Use the regular Div and set its text input parameter to the HTML that you want to be rendered.
Div(text='<p>paragraph 1</p><p>paragraph 2</p>')

Related

Rotate strip text with ggtext

I'm trying to make a plot with a two layer strip. I want the first layer of strips to have a horizontal text orientation and the second layer to have a vertical text orientation.
In the example below, I want the strip layers that say 'horizontal' to be horizontal and I want '1999' and '2008' to remain vertical.
library(ggplot2)
library(ggtext)
library(glue)
df <- mpg
df$outer <- "horizontal"
p <- ggplot(df, aes(displ, cty)) +
geom_point() +
theme(
strip.text.y.left = element_markdown()
)
p + facet_grid(
outer + year ~ .,
switch = "y"
)
The ggtext package is great, because it allows us to use ggtext::element_markdown() to conditionally format layers of a strip with html tags, such as in the example below:
p + facet_grid(
glue("<span style = 'color:red'>{outer}</span>") + year ~ .,
switch = "y"
)
Created on 2021-07-11 by the reprex package (v1.0.0)
Instead of applying a red color, is there an (HTML) tag I could use to make the text orientation horizontal? I'm not very fluent in HTML. After googling some options, I've tried the following spans with no success:
"<span style = 'transform:rotate(90deg)'>"
"<span style = 'text-orientation:sideways'>"
As a side-note: I know that I can edit the gtable of a plot to manually make edits to labels and whatnot. That is exactly what I'm trying not to do!
In addition to a solution to my problem, there are two other ways I'd consider my question answered.
A link to some documentation that says it is not (yet) possible to do this with ggtext. Please post it as an answer with a small description so I can accept it, if this is the case. A post by ggtext's creator Claus O. Wilke commenting on this, is also fine.*
A code example where an attempt to use canonical HTML tags (besides the two I already tried) fails to rotate the text. I'd then know that someone with more knowledge than me about HTML tried and my question has no apparent solution.
* I'm aware of the paragraph in ggtext's readme that reads the following:
As a general rule, any Markdown, HTML, or CSS feature that isn’t shown in any of the ggtext or gridtext documentation likely doesn’t exist.
I'm fishing for a more explicit statement that says text cannot be rotated with tags.

Python - Beautifulsoup, differentiate parsed text inside of an html element by using internal tags

So, I'm working on an html parser to extract some text data from a list of and format it before giving an output. I have a title that I need to set as bold, and a description which I'll leave as it is. I've found myself stuck when I reached this situation:
<div class ="Content">
<Strong>Title:</strong>
description
</div>
As you can see the strings are actually already formatted but I can't seem to find a way to get the tags and the text out together.
What my script does kinda looks like:
article = "" #this is where I normally store all the formatted text, it's necessary that I get all the formatted text as one loooong string before I Output
temp1=""
temp2""
result = soup.findAll("div", {"class": "Content"})
if(result!=none):
x=0
for(i in result.find("strong")):
if(x==0):
temp1 = "<strong>" + i.text + "</strong>"
article += temp1
x=1
else:
temp2 = i.nextSibling #I know this is wrong
article += temp2
x = 0
print(article)
It actually throws an AttributeError but it's a wrong one since the output is "Did you call find_all() when you meant to call find()?".
I also know I can't just use .nextSibling like that and I'm litterally losing it over something that looks so simple to solve...
what I need to get is: "Title: description"
Thanks in advance for any response.
I'm sorry if I couldn't explain really well what I'm trying to accomplish but that's kind of articulated; I actually need the data to generate a POST request to a CKEditor session so that it adds the text to the html page, but I need the text to be formatted in a certain way before uploading it. In this case I would need to get the element inside the tags and format it in a certain way, then do the same with the description and print them one after the other, for example a request could look like:
http://server/upload.php?desc=<ul>%0D%0A%09<li><strong>Title%26nbsp%3B<%2strong>description<%2li><%2ul>
So that the result is:
Title1: description
So what I need to do is to differentiate between the element inside the tag and the one out of it using the tag itself as a refernce
EDIT
To select the <strong> use:
soup.select_one('div.Content strong')
and then to select its nextSibling:
strong.nextSibling
you my need to strip it to get rid of whitespaces, ....:
strong.nextSibling.strip()
Just in case
You can use ANSI escape sequences to print something bold, ... but I am not sure, why you would do that. That is something should be improved in your question.
Example
from bs4 import BeautifulSoup
html='''
<div class ="Content">
<Strong>Title:</strong>
description
</div>
'''
soup = BeautifulSoup(html,'html.parser')
text = soup.find('div', {'class': 'Content'}).get_text(strip=True).split(':')
print('\033[1m'+text[0]+': \033[0m'+ text[1])
Output
Title: description
You may want to use htql for this. Example:
text="""<div class ="Content">
<Strong>Title:</strong>
description
</div>"""
import htql
ret = htql.query(text, "<div>{ col1=<strong>:tx; col2=<strong>:xx &trim }")
# ret=[('Title:', 'description')]

I need to pull all paragraphs from any website

I need to take any random website and pull all chunks of text from the website.
I am calling this "paragraph disambiguation" (see "sentence disambiguation" in Wikipedia).
I don't care if these chunks themselves contain other HTML like or as I can get rid of these after I extract the paragraphs text.
I also need to distinguish between the paragraphs as in, this is paragraph 1 and this is paragraph 2 and so on.
I am aware that most paragraphs would typically be contained in a tag. But this is not always the case. Text can also be contained in the following:
<div>
<span>
<td>
<li>
Is there any other HTML elements that might contain a block of text?
Is there any other methodology of extracting text blocks from a random webpage, like looking for "white words" and then finding their boundaries?
Thanks in advance
Jeff
Nearly all HTML elements may include texts:
p
dt
dd
td
th
And many more I can't recall at the moment. Take a look at the Complete list of HTML tags and see which is suitable to contain text, and which is not.
Use Python's Beautiful Soup and call .get_text() on the body element. This will give you all the text in the page.
From Documentation on get_text():
>>> markup = '\nI linked to <i>example.com</i>\n'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

Using XPath to get text of paragraph with links inside

I'm parsing HTML page with XPath and want to grab whole text of some specific paragraph, including text of links.
For example I have following paragraph:
<p class="main-content">
This is sample paragraph with link inside.
</p>
I need to get following text as result: "This is sample paragraph with link inside", however applying "//p[#class'main-content']/text()" gives me only "This is sample paragraph with inside".
Could you please assist? Thanks.
To get the whole text content of a node, use the string function:
string(//p[#class="main-content"])
Note that this gets a string value. If you want text nodes (as returned by text()), you can do this. You need to search at all depths:
//p[#class="main-content"]//text()
This returns three text nodes: This is sample paragraph with, link and inside.

Inserting LTR marks automatically

I am working with bidirectional text (mixed English and Hebrew) for a project. The text is displayed in HTML, so sometimes a LTR or RTL mark (‎ or ‏) is required to make 'weak characters' like punctuation display properly. These marks are not present in the source text due to technical limitations, so we need to add them in order for the final displayed text to appear correct.
For instance, the following text: (example: מדגם) sample renders as sample (מדגם :example) in right-to-left mode. The corrected string would look like ‎(example:‎ מדגם) sample and would render as sample (מדגם (example:.
We'd like to do on-the-fly insertion of these marks rather than re-authoring all the text. At first this seems simple: just append an ‎ to each instance of punctuation. However, some of the text that needs to get modified on-the-fly contains HTML and CSS. The reasons for this are unfortunate and unavoidable.
Short of parsing HTML/CSS, is there a known algorithm for on-the-fly insertion of Unicode directional marks (pseudo-strong characters)?
I don't know of an algorithm to insert directional marks into an HTML string safely without parsing it. Parsing the HTML into a DOM and manipulating the text nodes is the safest way of ensuring you don't accidentally add directional marks to text inside <script> and <style> tags.
Here is a short Python script which might help you transform your files automatically. The logic should be easy to translate into other languages if necessary. I'm not familiar enough with the RTL rules you're trying to encode, but you can tweak the regexp '(\W([^\W]+)(\W)' and substituion pattern ur"\u200e\1\2\3\u200e" to get your expected result:
import re
import lxml.html
_RE_REPLACE = re.compile('(\W)([^\W]+)(\W)', re.M)
def _replace(text):
if not text:
return text
return _RE_REPLACE.sub(ur'\u200e\1\2\3\u200e', text)
text = u'''
<html><body>
<div>sample (\u05de\u05d3\u05d2\u05dd :example)</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>
'''
# convert the text into an html dom
tree = lxml.html.fromstring(text)
body = tree.find('body')
# iterate over all children of <body> tag
for node in body.iterdescendants():
# transform text with trails after the current html tag
node.tail = _replace(node.tail)
# ignore text inside script and style tags
if node.tag in ('script','style'):
continue
# transform text inside the current html tag
node.text = _replace(node.text)
# render the modified tree back to html
print lxml.html.tostring(tree)
Output:
python convert.py
<html><body>
<div>sample (מדגם ‎:example)‎</div>
<script type="text/javascript">var foo = "ignore this";</script>
<style type="text/css">div { font-size: 18px; }</style>
</body></html>