Issue parsing HTML using Nokogiri - html

I have some HTML and wish to get the content under the <body> element. However, with whatever I tried, after the HTML is parsed using Nokogiri, everything inside <doctype> and <head> is also becoming part of the <body> element and when I retrieve the <body> element, I see stuff inside <doctype> and the <meta> and <script> tags too.
My original HTML is:
<!DOCTYPE html \"about:legacy-compat\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<title>Some Title</title>
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
</head>
<body marginwidth=\"6\" marginheight=\"6\" leftmargin=\"6\" topmargin=\"6\">
<div class=\"hello-status\">Hello World</div>
<div valign=\"top\"></div>
</body>
</html>
The solution I am using is:
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
What am I getting:
<p>about:legacy-compat\"></p>
\n
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
\n
<title>Some title</title>
\n
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
\n
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
\n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
What am I expecting:
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
Any idea what's happening in here?

I got your example to work by first cleaning up the original HTML. I removed the "about:legacy-compat" from the Doctype which seemed to be messing Nokogiri up:
# clean up the junk in the doctype
my_html.sub!("\"about:legacy-compat\"", "")
# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
# => "\n <div class=\"hello-status\">Hello World</div>\n <div valign=\"top\"></div>\n "
In general, when you're parsing potentially dirty third-party data such as HTML, you should clean it up first so the parser doesn't choke and do unexpected things. You could run the HTML through a linter or "tidy" tool to try and automatically clean it up. When all else fails, you'll have to clean it by hand as above.
HTML tidy/cleaning in Ruby 1.9

Related

I want to take the input value from an HTML input block using PyScript, but it doesn't work. Why?

I am trying to use PyScript in my HTML projects, since I'm more fluent in it than JavaScript. I need to pull input from an input block, but what should work doesn't seem like it does.
When the user presses the submit button, it's supposed to change the words above it to the value of the input box. However, the code only returns "None", as if the value doesn't exist.
HTML code:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>PyScript Template</title>
<link href="style.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" href="https://pyscript.net/alpha/pyscript.css" />
<script defer src="https://pyscript.net/alpha/pyscript.js"></script>
<py-env>
- paths:
- /demo.py
</py-env>
</head>
<body style="font-family:Times New Roman;margin-left:40px">
<br>
<p id=main_description class=text>
Hello, world.
</p>
<input class=player_input type=text id=input_box name=input autocomplete=off placeholder="What will you do?">
<button type=button id=submit pys-onclick=submit_input>submit</button>
<py-script src=/demo.py></py-script>
</body>
</html>
PyScript code:
# demo.py
def submit_input(e):
input_box = Element("input_box")
player_input = input_box.select(".value")
main_description = Element("main_description")
main_description.write(player_input)
I have searched for a solution to this everywhere, but it looks like PyScript is still relatively new, so I couldn't find anything except traditional Python solutions. Any advice would be greatly appreciated.
On this line player_input = input_box.select(".value") you are trying to access an element with a class of value but no elements exist inside the input. To retrieve the value of the input you need to access it with the value property like so:
# demo.py
def submit_input(e):
input_box = Element("input_box")
player_input = input_box.value
main_description = Element("main_description")
main_description.write(player_input)

how to force a line break inside a textarea?

I want to force a line break inside the textarea like this <textarea>hello!\nbye<br>hello again\r\n i'm here</textarea>
but this way the page only shows like this: hello!\nbye<br>hello again\r\n i'm here
and it should appear something like:
bye
hello again
I'm here
it's something like &nbsp?
which element should i use to make this line break?
A literal new line.
<textarea>bye
hello again
I'm here</textarea>
<!DOCTYPE html>
<html>
<head>
<title>Hello, World!</title>
<link rel="stylesheet" href="styles.css" />
</head>
<body>
<textarea id="message">Hello</textarea>
<script type="text/javascript" charset="utf-8">
const message = document.getElementById('message');
message.value = 'bye' + '\r\n' + 'hello again' + '\r\n' + 'Im here';
console.log(message.value);
</script>
</body>
</html>

Remove a specific tag and replace it with awk

I'm using a tool to generate some html that looks something like this:
<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>
But I'd like a way to replace that style tag with some custom styling
<link rel="stylesheet" href="style.css">
possibly with awk or sed so that I can add it to my Makefile.
Is this possible?
awk to the rescue!
This is not xml/html aware but a basic text substitution...
$ awk '/<style>/ {f=1}
!f;
/<\/style>/ {f=0;
print "<link rel=\"stylesheet\" href=\"style.css\">"}' file
will give
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
If you like tricks, check also this out:
$ ht=$'<html>\n<head>\n<title>Blah</title>\n<style>\n/* stuff */\n</style>\n</head>\n'
$ st=$'<link rel="stylesheet" href="style.css">'
$ echo "$ht"
<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>
$ echo "$ht" |perl -0777 -pe "s/\n/\0/g;s/<style>.*<\/style>/$st/g;s/\0/\n/g"
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
echo '<html>
<head>
<title>Blah</title>
<style>
/* stuff */
</style>
</head>'|sed -e '/<style>/{:a;/<\/style>/!{N;ba};c <link rel="stylesheet" href="style.css">' -e'}'
<html>
<head>
<title>Blah</title>
<link rel="stylesheet" href="style.css">
</head>
#mef51: try: taking adaption from karafka's nice code, only thing is this code will print and tags too along with your new line.
awk '/<style>/ {print;f=1}
!f;
/<\/style>/ {f="";
print "<link rel=\"stylesheet\" href=\"style.css\">" ORS $0}' Input_file
Explanation: searching for string and then set variable f's value to ON/TRUE/one(1) and then checking condition !f if variable f's value is NULL(when line doesn't have or it will be NULL) so print the current line. now looking for string and printing the new line along with ORS(Output field separator, whose default value is new line) and the current line.

Multimarkdown well configured header data

Hi I'm trying to get the top of my multimarkdown file to look like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>Test of markdown</title>
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<link rel="stylesheet" type="text/css" href="../main.css" />
</head>
I know how to add the following metatags:
Title: Test of markdown
CSS: ../main.css
Quotes language: english
which gives me :
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Test of markdown</title>
<link type="text/css" rel="stylesheet" href="../main.css"/>
</head>
But I'm not sure how to add the rest. Would appreciate any help. Thanks
I can't find any native markdown way to do this but you could run a little script across the generated HTML if you really feel you need to do this.
This is a simple Python 3 option that might get you started. This could be improved in many ways but wanted to keep it simple. An obvious idea would be to give it a folder and have it process every HTML file in the folder. But I hope this gives the idea.
Example code:
filepath = input('What is the full file path to the file? - ')
htmldoctype = ' '.join([
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"',
'"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">',
'\n'
])
htmlinfo = ('<html xmlns="http://www.w3.org/1999/xhtml">\n')
inlines = []
try:
with open(filepath, mode='r', encoding='utf-8') as infile:
for line in infile:
if line.strip() == '<!DOCTYPE html>':
inlines.append(htmldoctype)
elif line.strip() == '<html>':
inlines.append(htmlinfo)
else:
inlines.append(line)
except Exception:
print('something went wrong in get')
try:
with open(filepath, mode='w', encoding='utf-8') as outfile:
for line in inlines:
outfile.write(line)
except Exception:
print('something went wrong in write')
Input:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>Test of markdown</title>
<link type="text/css" rel="stylesheet" href="../main.css"/>
</head>
<body>
test
</body>
</html>
Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
<title>Test of markdown</title>
<link type="text/css" rel="stylesheet" href="../main.css"/>
</head>
<body>
test
</body>
</html>

Strip text from HTML document using Ruby

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>