escaping raw html in rmarkdown - html

In R Markdown, with html output, I want to use the following raw HTML code, together with some R Markdown code to include an image in a grid.
The HTML code starting with "<" appears to propagate to HTML output without problems. However, HTML code such as "::before" gets converted to <p>::before</p> which is not what I want.
How can I specify to R Markdown that I want to 'escape' certain pieces of code, such as "::before" and "::after", preventing the automatic encapsulation of them in <p> tags?
<div class = "row", id = "abc">
::before
<div class = "col-md-4">
![](images/logo.png){ style="height: 70px"}
</div>
::after
</div>

You can output those parts of your code using knitr::raw_html(). For your example, the middle line is Markdown, not HTML, so you'd want:
```{r echo=FALSE}
knitr::raw_html(
'<div class = "row", id = "abc">
::before
<div class = "col-md-4">')
```
![](images/logo.png){ style="height: 70px"}
```{r echo=FALSE}
knitr::raw_html(
'</div>
::after
</div>')
```

Related

cannot get tag however it is appear on html

I am trying a scraping job using BeatifulSoup and find methods, I get the HTML with lxml parser as following :
result = requests.get('https://wuzzuf.net/jobs/p/xgUqkfYngXZL-Senior-Python-Developer-Remote---Part-Time-Cairo-Egypt?o=2&l=sp&t=sj&a=python|search-v3|hpb')
#print(result.status_code)
soup1 =BeautifulSoup(result.content , "html5lib")
sections = soup1.find( 'section' ,class_="css-3kx5e2")
divs = sections.find_all('div')
spans = sections.find_all('span')
span = divs[3].find('span' , class_ ='css-47jx3m')
divs[3]
I get the following
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span></div>
however, the original HTML is
<div class="css-rcl8e5"><span class="css-wn0avc">Salary<!-- -->:</span>
<span class="css-47jx3m"><span class="css-8il94u">Confidential, Hourly Based</span>
</span>
</div>
I need to get the ('span class="css-8il94u"') which have the text ('Confidential, Hourly Based') but it does not appear
thanks

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

Why is HTML I put in a Mongodb document as JSON displaying as text on the webpage?

I have a mongo database with a document that contains some html in it, and when I try to take it and put it on a webpage, it just displays as the actual text and not the html. Here is the json with the html:
db.games.insert({
title: "Minecraft",
background: "/images/minecraft.jpg",
code: "<div id=\"gameBackround\" class=\"col-lg-2 popular-games view view-first\" style=\"background-image:url( /images/gameArt/minecraft.jpg )\"> <div class=\"mask\"> <h2>Minecraft</h2> <p>Amount of groups playing this title now: 11,075</p> Join Lobby </div> </div> <style> </style>"
})
and here is how I display it on the page:
<template name="example">
{{code}}
</template>
Use should use triple curly braces to escape html code returned from a helper.
<template name="example">
{{{code}}}
</template>
Here is an example.

how to parse nested html tag using xpath

This is my sample html code.
using HtmlXpathSelector i need to parse the html file.
def parse(self, response):
edxData = HtmlXpathSelector(response)
first i need to get all the tag which contain
edxData.xpath('//h2[#class = "title course-title"]')
inside of that tag i need to check a tag value.
then need to parse the div tag with class name subtitle course-subtitle copy-detail.
how can i parse this value kindly give some suggestion
sample html response data:
<html>
<body>
<h2 class="title course-title">
<a href="https://www.edx.org/course/mitx/mitx-14-73x-challenges-global-poverty-1350">The Challenges of Global Poverty
</a>
</h2>
<div class="subtitle course-subtitle copy-detail">A course for those who are interested in the challenge posed by massive and persistent world poverty.
</div>
</body>
</html>
one way to loop over the inner tag could be:
>>> for h2 in sel.xpath('//h2[#class = "title course-title"]'):
... print h2.xpath('a')
...
[<Selector xpath='a' data=u'<a href="https://www.edx.org/course/mitx'>]
or even simply:
>>> sel.xpath('//h2[#class = "title course-title"]/a')
[<Selector xpath='//h2[#class = "title course-title"]/a' data=u'<a href="https://www.edx.org/course/mitx'>]
to find another xpath, simply do:
>>> sel.xpath('//div[#class="subtitle course-subtitle copy-detail"]')
[<Selector xpath='//div[#class="subtitle course-subtitle copy-detail"]' data=u'<div class="subtitle course-subtitle cop'>]
it seem like you're using scrapy, pls also tag that question as such

Indenting generated markup in Jekyll/Ruby

Well this is probably kind of a silly question but I'm wondering if there's any way to have the generated markup in Jekyll to preserve the indentation of the Liquid-tag. World doesn't end if it isn't solvable. I'm just curious since I like my code to look tidy, even if compiled. :)
For example I have these two:
base.html:
<body>
<div id="page">
{{content}}
</div>
</body>
index.md:
---
layout: base
---
<div id="recent_articles">
{% for post in site.posts %}
<div class="article_puff">
<img src="/resources/images/fancyi.jpg" alt="" />
<h2>{{post.title}}</h2>
<p>{{post.description}}</p>
Read more
</div>
{% endfor %}
</div>
Problem is that the imported {{content}}-tag is rendered without the indendation used above.
So instead of
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
I get
<body>
<div id="page">
<div id="recent_articles">
<div class="article_puff">
<img src="/resources/images/fancyimage.jpg" alt="" />
<h2>Gettin' down with responsive web design</h2>
<p>Everyone's talking about it. Your client wants it. You need to code it.</p>
Read more
</div>
</div>
</div>
</body>
Seems like only the first line is indented correctly. The rest starts at the beginning of the line... So, multiline liquid-templating import? :)
Using a Liquid Filter
I managed to make this work using a liquid filter. There are a few caveats:
Your input must be clean. I had some curly quotes and non-printable chars that looked like whitespace in a few files (copypasta from Word or some such) and was seeing "Invalid byte sequence in UTF-8" as a Jekyll error.
It could break some things. I was using <i class="icon-file"></i> icons from twitter bootstrap. It replaced the empty tag with <i class="icon-file"/> and bootstrap did not like that. Additionally, it screws up the octopress {% codeblock %}s in my content. I didn't really look into why.
While this will clean the output of a liquid variable such as {{ content }} it does not actually solve the problem in the original post, which is to indent the html in context of the surrounding html. This will provide well formatted html, but as a fragment that will not be indented relative to tags above the fragment. If you want to format everything in context, use the Rake task instead of the filter.
-
require 'rubygems'
require 'json'
require 'nokogiri'
require 'nokogiri-pretty'
module Jekyll
module PrettyPrintFilter
def pretty_print(input)
#seeing some ASCII-8 come in
input = input.encode("UTF-8")
#Parsing with nokogiri first cleans up some things the XSLT can't handle
content = Nokogiri::HTML::DocumentFragment.parse input
parsed_content = content.to_html
#Unfortunately nokogiri-pretty can't use DocumentFragments...
html = Nokogiri::HTML parsed_content
pretty = html.human
#...so now we need to remove the stuff it added to make valid HTML
output = PrettyPrintFilter.strip_extra_html(pretty)
output
end
def PrettyPrintFilter.strip_extra_html(html)
#type declaration
html = html.sub('<?xml version="1.0" encoding="ISO-8859-1"?>','')
#second <html> tag
first = true
html = html.gsub('<html>') do |match|
if first == true
first = false
next
else
''
end
end
#first </html> tag
html = html.sub('</html>','')
#second <head> tag
first = true
html = html.gsub('<head>') do |match|
if first == true
first = false
next
else
''
end
end
#first </head> tag
html = html.sub('</head>','')
#second <body> tag
first = true
html = html.gsub('<body>') do |match|
if first == true
first = false
next
else
''
end
end
#first </body> tag
html = html.sub('</body>','')
html
end
end
end
Liquid::Template.register_filter(Jekyll::PrettyPrintFilter)
Using a Rake task
I use a task in my rakefile to pretty print the output after the jekyll site has been generated.
require 'nokogiri'
require 'nokogiri-pretty'
desc "Pretty print HTML output from Jekyll"
task :pretty_print do
#change public to _site or wherever your output goes
html_files = File.join("**", "public", "**", "*.html")
Dir.glob html_files do |html_file|
puts "Cleaning #{html_file}"
file = File.open(html_file)
contents = file.read
begin
#we're gonna parse it as XML so we can apply an XSLT
html = Nokogiri::XML(contents)
#the human() method is from nokogiri-pretty. Just an XSL transform on the XML.
pretty_html = html.human
rescue Exception => msg
puts "Failed to pretty print #{html_file}: #{msg}"
end
#Yep, we're overwriting the file. Potentially destructive.
file = File.new(html_file,"w")
file.write(pretty_html)
file.close
end
end
We can accomplish this by writing a custom Liquid filter to tidy the html, and then doing {{content | tidy }} to include the html.
A quick search suggests that the ruby tidy gem may not be maintained but that nokogiri is the way to go. This will of course mean installing the nokogiri gem.
See advice on writing liquid filters, and Jekyll example filters.
An example might look something like this: in _plugins, add a script called tidy-html.rb containing:
require 'nokogiri'
module TextFilter
def tidy(input)
desired = Nokogiri::HTML::DocumentFragment.parse(input).to_html
end
end
Liquid::Template.register_filter(TextFilter)
(Untested)