HTML to DITA through perl - html

I am attempting to convert DOCX to DITA topics through an intermediate HTML step.
Now, with simple substitutions either in 'sed' or 'emacs' or 'vi', I can do most of the changes, but not certain types. For that I may need Perl or Python. Below is an example of what I am trying to accomplish:
From:
<h1> Head 1 </H1>
<body>
</body>
<h2>Sub Head 1 </h2>
<body>
</body>
<h3>SubSub Head 1 </h3>
<body>
</body>
<h2>Sub Head 2 </h2>
<body>
</body>
<h1>Head 2 </h1>
<body>
</body>
To:
<topic><title> Head 1 </title>
<body>
</body>
<topic><title> Sub Head 1 </title>
<body>
</body>
<topic><title> SubSub Head 1 </title>
<body>
</body>
</topic>
</topic>
<topic><title> Sub Head 2 </title>
<body>
</body>
</topic>
</topic>
<topic><title> Head 2 </title>
<body>
</body>
</topic>
The part I have trouble with is the part where I need to place the tags for nested topics (and yes, I do have nested topics; my needs are somewhat unique since I am migrating existing documents). If someone can suggest a perl snippet (or a pointer to one similar) for this (placement of tags on a per tag basis), I can build my script around it.
Thanks in advance for looking and suggestions.

That's the kind of processing I often use XML::Twig for.
The wrap_children method is designed just for this: it lets you define a regexp-like expression that will be wrapped in an element. See the example below and the docs for more:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More tests => 1;
use XML::Twig;
# reads the DATA section, the input doc first, then the expected result
my( $in, $expected)= do{ local $/="\n\n"; <DATA>};
my $t=XML::Twig->new->parse( $in);
my $root= $t->root;
# that's where the wrapping occurs, form inside out
$root->wrap_children( '<h3><body>', topic => { level => 3 });
$root->wrap_children( '<h2><body><topic level="3">*', topic => { level => 2 });
$root->wrap_children( '<h1><body><topic level="2">*', topic => { level => 1 });
# now we cleanup: the levels are not used any more
foreach my $to ($t->descendants( 'topic'))
{ $to->del_att( 'level'); }
# the wrapping will have generated tons of additional id's,
# you may not need this if your elements had id's before the wrapping
foreach my $to ($t->descendants( 'topic|body|h1|h2|h3'))
{ $to->del_att( 'id'); }
# now we can deal with titles
foreach my $h ($t->descendants( 'h1|h2|h3')) { $h->set_tag( 'title'); }
# how did we do?
is( $t->sprint( pretty_print => 'indented'), $expected, 'just one test');
__DATA__
<doc>
<h1> Head 1 </h1>
<body></body>
<h2> Sub Head 1 </h2>
<body></body>
<h3> SubSub Head 1 </h3>
<body></body>
<h2> Sub Head 2 </h2>
<body></body>
<h1> Head 2 </h1>
<body></body>
</doc>
<doc>
<topic>
<title> Head 1 </title>
<body></body>
<topic>
<title> Sub Head 1 </title>
<body></body>
<topic>
<title> SubSub Head 1 </title>
<body></body>
</topic>
</topic>
<topic>
<title> Sub Head 2 </title>
<body></body>
</topic>
</topic>
<topic>
<title> Head 2 </title>
<body></body>
</topic>
</doc>

Related

Add HTML paragraph to the beginning of Dominate Document

From the Dominate github:
The document class also provides helpers to allow you to directly add nodes to the body tag.
d = document()
d += h1('Hello, World!')
d += p('This is a paragraph.')
print(d)
<!DOCTYPE html>
<html>
<head>
<title>Dominate</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</body>
</html>
How do I add a paragraph before the existing paragraph?
I tried:
d = p("Offer Ends Soon") + d
Got this error
Error: TypeError
unsupported operand type(s) for +: 'p' and 'document'
I tried:
d += p("Offer Ends Soon")
But this puts the new paragraph at the bottom, not the top
<!DOCTYPE html>
<html>
<head>
<title>Dominate</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
<p>Offer Ends Soon</p>
</body>
</html>
This is not possible. You can't prepend tags using document().
From here:
Dominate is NOT an HTML parser. It is strictly for creating new documents, not parsing existing html files

Issue parsing HTML using Nokogiri

I have some HTML and wish to get the content under the <body> element. However, with whatever I tried, after the HTML is parsed using Nokogiri, everything inside <doctype> and <head> is also becoming part of the <body> element and when I retrieve the <body> element, I see stuff inside <doctype> and the <meta> and <script> tags too.
My original HTML is:
<!DOCTYPE html \"about:legacy-compat\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<title>Some Title</title>
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
</head>
<body marginwidth=\"6\" marginheight=\"6\" leftmargin=\"6\" topmargin=\"6\">
<div class=\"hello-status\">Hello World</div>
<div valign=\"top\"></div>
</body>
</html>
The solution I am using is:
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
What am I getting:
<p>about:legacy-compat\"></p>
\n
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
\n
<title>Some title</title>
\n
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
\n
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
\n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
What am I expecting:
<div class=\"hello-status\">Hello World</div>
\n
<div valign=\"top\">\n\n</div>
Any idea what's happening in here?
I got your example to work by first cleaning up the original HTML. I removed the "about:legacy-compat" from the Doctype which seemed to be messing Nokogiri up:
# clean up the junk in the doctype
my_html.sub!("\"about:legacy-compat\"", "")
# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
# => "\n <div class=\"hello-status\">Hello World</div>\n <div valign=\"top\"></div>\n "
In general, when you're parsing potentially dirty third-party data such as HTML, you should clean it up first so the parser doesn't choke and do unexpected things. You could run the HTML through a linter or "tidy" tool to try and automatically clean it up. When all else fails, you'll have to clean it by hand as above.
HTML tidy/cleaning in Ruby 1.9

HTML DOM childNodes

I was playing around with HTML DOM, and I noticed that two properties don't agree with each other with no apparent reason. Consider this simple HTML file:
<html>
<head>
<title>DOM Example</title>
</head>
<body>
<p>Hello World!</p>
<p>Isn't this exciting?</p>
<p>You're learning to use the DOM!</p>
</body>
<script type="text/javascript" src="script.js"></script>
</html>
I expected body and alt_body to be identical, but .childNodes insists giving me a text node. (Below is the content of script.js)
body = document.documentElement.childNodes[1]
alt_body = document.documentElement.lastChild;
console.log(body.nodeType) //prints 3 (Node.TEXT_NODE)
console.log(alt_body.nodeType) //prints 1 (Node.ELEMENT_NODE)
console.log(body.childNodes.length) //prints 0
console.log(alt_body.childNodes.length) //prints 8
Does anyone know why it's acting that way?
It's because childNodes returns a text line node as well
<html>
<head>
<title>DOM Example</title>
</head>
<body>
<p>Hello World!</p>
<p>Isn't this exciting?</p>
<p>You're learning to use the DOM!</p>
</body>
<script type="text/javascript" src="script.js"></script>
</html>
0 : will be <head> node
1 : will be empty text ( considered as a node )
2 : will be <body> node ( lastChild in this HTML )
Try after getting rid of all the linefeeds and spaces like this below.
<html><head><title>DOM Example</title></head><body><p>Hello World!</p><p>Isn't this exciting?</p><p>You're learning to use the DOM!</p></body><script type="text/javascript" src="script.js"></script></html>
Then the result will be what you expected.
childNodes[1] isn't the last child, because there are three children:
[0] The Head Element
[1] The Text Node containing the newline character between your </head> and <body>.
[2] The body element
If you were to remove the newline and all spaces between </head> and <body> you would find that childNodes[1] === lastChild:
<html>
<head>
<title>DOM Example</title>
</head><body>
<p>Hello World!</p>
<p>Isn't this exciting?</p>
<p>You're learning to use the DOM!</p>
</body>
<script type="text/javascript" src="script.js"></script>
</html>

How do I put CSS inside a string? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am saving an HTML file:
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
my_str fills in the contents of the HTML page. Inside of my_str are list items that I want to indent. To do this I tried adding a CSS tag to the bottom to indent all li tags like:
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
li {
padding-left: 20px;
}
Unfortunately, the output is displayed on the page instead of adding to the bottom as a padding for li items:
li {
padding-left: 20px;
}
Just add a <style> tag:
File.open("features/output/all_test_breakdown.html", "w") { |file| file.write(
" <!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
<style>
li {
padding-left: 20px;
}</style>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
" )}
Ugh. Here's how to write this more idiomatically. Starting with this rewrite:
my_str = 'foo'
File.open("my_output.html", "w") do |file|
file.write(<<EOT)
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
EOT
end
I'd refine it further using:
my_str = 'foo'
File.write("my_output.html", <<EOT)
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
EOT
If sticking a "here-to" in the write method bugs you, you could do:
my_str = 'foo'
html = <<EOT
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
EOT
File.write("my_output.html", html)
Or:
my_str = 'foo'
html = "
<!DOCTYPE html>
<html>
<head>
<title>'Previous Test Run Breakdown'</title>
</head>
<body>
<h1> Breakdown of results by structure</h1>
#{my_str}
</body>
</html>
"
File.write("my_output.html", html)
In any case:
File.new("features/output/my_output.html", "w")
File.open("features/output/my_output.html", "w") { |file| file.write(
...
is code smell. You don't need to use new to create a file stub then open it followed by a ios.write. Simply IO.write it.
If you're just learning Ruby, the difference between the two will seem hard to decipher, but the first is a writing to a file handle, AKA "ios" AKA "IO-stream". The second is a class method of "IO", AKA "IO.write", which handles the intermediate steps of opening the file, writing the content and closing it automatically.

Strip text from HTML document using Ruby

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
This works too:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
You can scan the string to create an array of "tokens", and then only select those that are html tags:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!##\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
Or even better, just scan for html tags ;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
To grab everything not in a tag, you can use nokogiri like this:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script> or <style> tags, so you could also remove blacklisted tags:
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
I just came up with this, but #andre-r's solution is soo much better!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>