perl HTML::HTMLDoc how to include a base64 img - html

I am trying to include either a base64 image or a src="getImage.pl?image.jpg" when creating a PDF with HTML::HTMLDoc. No Luck.
Does anybody have experience with this module and have some wisdom to share?
Thank You,
~D
+-------------------------------------------------+
#!/usr/bin/perl
use HTML::HTMLDoc;
$html = new HTML::HTMLDoc('mode'=>'file', 'tmpdir'=>'/tmp'); # Start instance
$html->set_page_size('letter'); # set page size
$html->set_bodyfont('Arial'); # set font
$html->set_fontsize(8.0); # set fontsize
$html->set_permissions('no-copy');
$html->set_permissions('no-modify');
$html->set_permissions('no-annotate');
$html->set_html_content(
qq{
<html><body>Hello World...
<br />
<img src="" border="0" alt="Hello Image">
</body></html>});
$html->title();
$html->set_header('.', 't', '.');
$html->set_footer('D', '.', '/');
$pdf = $html->generate_pdf(); # generate document
$http_headers_out{'Content-Type'} = 'application/pdf';
print $pdf->to_string();

It looks like HTML::HTMLDoc will NOT handle img src from base64 data NOR a cgi script.
There was a great response to this question here:
http://www.perlmonks.org/?node_id=1081554

Related

extracting &lt and &gt from html using python

I have a HTML in UTF-8 encoding like below. I want to extract OWNER, NVCODE, CKHEWAT tags from this using python and bs4. But <> is converted to &lt and &gt I am not able to extract text from OWNER, NVCODE, CKHEWAT tags.
kindly guide me to extract text from these tags.
<?xml version="1.0" encoding="utf-8"?><html><body><string xmlns="http://tempuri.org/"><root><OWNER>अराजी मतरुका वासीदेह </OWNER><NVCODE>00108</NVCODE><CKHEWAT>811</CKHEWAT></root></string></body></html>
My code
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
soup.find('string').text
Check this
By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into “&”, “<”, and “>”, so that Beautiful Soup doesn’t inadvertently generate invalid HTML or XML:
soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>")
soup.p
# <p>The law firm of Dewey, Cheatem, & Howe</p>
soup = BeautifulSoup('A link')
soup.a
# A link
You can change this behavior by providing a value for the formatter argument to prettify(), encode(), or decode(). Beautiful Soup recognizes six possible values for formatter.
The default is formatter="minimal". Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML:
french = "<p>Il a dit <<Sacré bleu!>></p>"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
# <html>
# <body>
# <p>
# Il a dit <<Sacré bleu!>>
# </p>
# </body>
# </html>

mPDF error on Codeigniter

I tried to convert a html page to pdf and was decided to use mPDF, I follow what the documentation does. When running the code, it does not prompt out the PDF to ask for save. Btw I get those error code.
Here is the code from Controller.
//this data will be passed on to the view
$data['the_content']='mPDF and CodeIgniter are cool!';
//load the view, pass the variable and do not show it but "save" the output into $html variable
$html=$this->load->view('ajax/pdf_output', $data, true);
//this the the PDF filename that user will get to download
$pdfFilePath = "the_pdf_output.pdf";
//load mPDF library
$this->load->library('m_pdf');
//actually, you can pass mPDF parameter on this load() function
$pdf = $this->m_pdf->load();
//generate the PDF!
$pdf->WriteHTML($html);
//offer it to user via browser download! (The PDF won't be saved on your server HDD)
$pdf->Output($pdfFilePath, "I");
Below is the result i get:
%PDF-1.4 %���� 3 0 obj <> /Contents 4 0 R>> endobj 4 0 obj <> stream x��P]O�#���㓚����^�1�h�7�C��B(h���sW�Fs����vvv�B')�ձCgha�6��Mp�6� �H�U[P��{��-[�uz��#��뮉�r�#Υ�9�R���'�J�h&���e� �J�YW�f����\���/�m�Ӷ�����J.w���j��N�ގ��^�=f!��ƲO����o�92yh�m���9� �e��[��#�3���?u�R%_�¿�)�X|jt2H׆��+��S��™9%�R��:��ƒ7��m��Z����9n� endstream endobj 1 0 obj <> endobj 5 0 obj <> endobj 6 0 obj <> endobj 7 0 obj <> endobj 8 0 obj <> stream /CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> endcodespacerange 1 beginbfrange <0000> <0000> endbfrange endcmap CMapName currentdict /CMap defineresource pop end end endstream endobj 9 0 obj <> endobj 10 0 obj < >> /FontFile2 12 0 R >> endobj 11 0 obj <> stream x����V�����Qfd%{dT*�l-�Ȉ��_����s��ڥ����}��#;���t���:��Nt�S�n�3�m�s��B��h��JW���nt�[��Nw�^�{��&�l�GM7�l�{���z��^4��^��7��]���B�-��J���V[�K�}�[}�G?��f���V���n���w�� endstream endobj 12 0 obj <> stream x��| \TU��9��;�.�zqIe�]ce�t��apfQQ�q�qGKSS+5-K+Ӟ�z�6�35���gi���wν3d=��}����g�s�=������\0�B��4#+76�r����=\��}�O��:�p <7�r���G~BHH���EŖ�j��B�C_U��W /� q<��U�}��<�!��t�Qo��ۀPdwXN���<���X���<�~]��Po3oC(��_��gW��Ўz��r��8�XR!<����+�vG�B4�Q��� ����Z�<���(��^�Dhk��#��r'WQ����h4�F��i�Fw���H$�]�3�$77kBi(��e��A�=7�"凨����C���]D��.=�onf�577���l��$��y#����? ��CA(�G!(���#��:�Ѓ��uE݀�p�"Qw�z�����F}P4��bP,����8�EP<���h����h�yQJF)H�R�(4��1h,JG(e�l4��\��ƣ|4MD�#�S�T4 =���H:�Ev$���-�|�tdF3�r��w�\�����>�����~�N��0OZ�ˀ���;0����>�H����m� $Z�|�^t�E��r�Uo�l/�ì-���a^� �$#�΢O�b�A5h ��L�]��tx1�� ��$d�Y�Y�� Jf��,��W�qo�Cf�"<$�a2z�"�8>=�'�˧� s��*/z[�*�{��#��x���op:.��O�A{�?p�Yz����������-ޡ��!y�� }kP&� T�]��H"0���� Qi�c �&�oL ���Q򒏡�c���������Ҥc�C�H��11*�� ��=6{�|��.EE��H��܉�dO� �����[ ,~O�5�֓��І,>�lozG��s�� ���6Mo��^�m���� 0P�!,T�#DK�"�����SS�ku��#�%�33�̐�hz��I�z�v�zNG��`���N���"�a�[
Anyone can tell me what happening with this?
it looks like the "I" parameter is causing trouble because the browser doesn't recognize your file
according to the docu you've the following Possibilities:
I: send the file inline to the browser. The plug-in is used if available. The name given by $filename is used when one selects the “Save as” option on the link generating the PDF.
D: send to the browser and force a file download with the name given by $filename.
F: save to a local file with the name given by $filename (may include a path).
S: return the document as a string. $filename is ignored.
try something like that :
$pdf->Output($pdfFilePath, "D");
die;
or on the other hand you can try to add some header to tell the browser explicitly this is a pdf document
header('Content-Type: application/pdf');
$pdf->Output($pdfFilePath, "I");
die;
because it could be CIs outpout class overwrites MPDF's header (but this is just a hunch)
$html=$this->load->view("ajax/pdf_output",$data,true);
//load mPDF library
$this->load->library('m_pdf');
//generate the PDF from the given html
$this->m_pdf->pdf->WriteHTML($html);
//download it.
ob_clean();
$this->m_pdf->pdf->Output($pdfFilePath,'F');
check your folder.....
If you want show download dialogue your need to place below code
$filename = time()."_order.pdf"; //your file name
$html = $this->load->view('unpaid_voucher2',$data,true);
/// $data variable is your dynamic data if you have no dynmic data then you can pass empty instead of variable like.
$html = $this->load->view('unpaid_voucher2','',true);
$this->load->library('M_pdf');
$this->m_pdf->pdf->WriteHTML($html);
//For download pass D and save on server pass F.
$this->m_pdf->pdf->Output("./uploads/".$filename, "D");
Here is full configuration to integrate mpdf into codeigniter
The string is a binary PDF representation and its presence means Content-type: application/pdf header is not sent correctly or it is
overriden by your code or setup. Most likely by text/plain or text/html.
Try to figure out these:
Are you resetting Content-type header in PHP code somewhere after calling the mPDF Output method?
Is your server forcing a different Content-type somewhere in your setup?
Does your browser support displaying application/pdf Content-type directly?
Given that the D Output mode gives you the same result, I'd guess the Content-type header is being overriden somewhere after calling the Output method, presumably by CodeIgniter.

Internal links in HTML slides made from markdown with pandoc

According to pandoc(1), pandoc supports internal links in HTML slides. But nothing happens for me when I click one.
A minimal example:
% A minimal example
% moi
% 2015-04-04
# Section 1
la la la
# Section 2
cf. [Section 1](#section-1)
I save the foregoing as example.md. Then in bash I run
file=example && \
pandoc -fmarkdown -tslidy --standalone --self-contained -o$file.html $file.md
Having opened the resulting HTML slides in a web browser, I click "Section 1" on slide "Section 2", but nothing happens. This I have tried in multiple browsers on multiple devices: xombrero on a Macbook running Arch Linux, Chrome on a Moto X running Android and Chrome on a Sony laptop running Windows 8.1. The results are the same. I am using pandoc version 1.13.2.
The link produced by pandoc for the internal reference is different from the link of the relevant slide: in the present example, the former ends in #section-1 and, the latter, in #(2). I suppose that this is why clicking the internal link does not return to the relevant slide. Is there some way to achieve that internal links do go to their relevant slides?
Here's the relevant HTML:
<body>
<div class="slide titlepage">
<h1 class="title">A minimal example</h1>
<p class="author">
moi
</p>
<p class="date">2015-04-04</p>
</div>
<div id="section-1" class="slide section level1">
<h1>Section 1</h1>
<p>la la la</p>
</div>
<div id="section-2" class="slide section level1">
<h1>Section 2</h1>
<p>cf. Section 1</p>
</div>
</body>
Thanks for any help!
Your problem is not with Pandoc but with Slidy. Pandoc is creating the right HTML for an ordinary HTML page but the Slidy slide software does not support going to a <div> - only going to a slide number.
If you change your link to cf. [Section 1](#(2)) ('2' being the number of the slide with 'Section 1') then it will work fine.
BTW - It works perfectly in a reveal.js slideshow created by Pandoc.
Although the question is stated more than five years ago, I recently had the same problem and created a postprocessing script in Python, which works for me. Essentially it is reading the Pandoc -> Slidy html output, scanning for internal links and replacing them with the slide number on which the link id is defined.
def Fix_Internal_Slidy_Links(infilename, outfilename):
"""Replaces all internal link targets with targets of the respective slidy page number
"""
page_pattern = ' class=\"slide';
id_pattern = ' id=\"';
internal_link_pattern = '<a href=\"#';
id_dict = dict();
whole_text = [];
cur_page = 0;
#
# First read all ids and associate them with the current page in id_dict
with open(infilename, 'r', encoding='utf-8') as filecontent:
for idx_cur_line, cur_line in enumerate(filecontent):
whole_text += [cur_line];
if (page_pattern in cur_line):
cur_page += 1;
#
if (id_pattern in cur_line):
while (id_pattern in cur_line):
startidx = cur_line.index(id_pattern);
cur_line = cur_line[startidx+len(id_pattern):];
lineparts = cur_line.split('"');
# Check if the current id is properly ended
if (len(lineparts) > 1):
id_dict.update([(lineparts[0], cur_page)]);
#
# Then process the code again and replace all internal links known in id_dict
with open(outfilename, 'w', encoding='utf-8') as filecontent:
for cur_line in whole_text:
if (internal_link_pattern in cur_line):
temp_line = '';
offset = 0;
while (internal_link_pattern in cur_line):
startidx = cur_line.index(internal_link_pattern);
# Extract name
temp_line += cur_line[offset:startidx+len(internal_link_pattern)];
cur_line = cur_line[startidx+len(internal_link_pattern):];
lineparts = cur_line.split('"');
if (len(lineparts) < 2):
# It seems that the id is not properly finished
break;
#
link = lineparts[0];
try:
# Create a link to the page assigned to that id
replacement_link = '(' + str(id_dict[link]) + ')"';
except:
# The link reference is not known in id_dict so do not change it
replacement_link = lineparts[0] + '"';
#
temp_line += replacement_link;
cur_line = cur_line[len(lineparts[0])+1:];
#
cur_line = temp_line + cur_line;
#
filecontent.write(cur_line);
#

Opening multiple html files & outputting to .txt with Nokogiri

Just wondering if these two functions are to be done using Nokogiri or via more basic Ruby commands.
require 'open-uri'
require 'nokogiri'
require "net/http"
require "uri"
doc = Nokogiri.parse(open("example.html"))
doc.xpath("//meta[#name='author' or #name='Author']/#content").each do |metaauth|
puts "Author: #{metaauth}"
end
doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content").each do |metakey|
puts "Keywords: #{metakey}"
end
etc...
Question 1: I'm just trying to parse a directory of .html documents, get the information from the meta html tags, and output the results to a text file if possible. I tried a simple *.html wildcard replacement, but that didn't seem to work (at least not with Nokogiri.parse(open()) maybe it works with ::HTML or ::XML)
Question 2: But more important, is it possible to output all of those meta content outputs into a text file to replace the puts command?
Also forgive me if the code is overly complicated for the simple task being performed, but I'm a little new to Nokogiri / xpath / Ruby.
Thanks.
I have a code similar.
Please refer to:
module MyParser
HTML_FILE_DIR = `your html file dir`
def self.run(options = {})
file_list = Dir.entries(HTML_FILE_DIR).reject { |f| f =~ /^\./ }
result = file_list.map do |file|
html = File.read("#{HTML_FILE_DIR}/#{file}")
doc = Nokogiri::HTML(html)
parse_to_hash(doc)
end
write_csv(result)
end
def self.parse_to_hash(doc)
array = []
array << doc.css(`your select conditons`).first.content
... #add your selector code css or xpath
array
end
def self.write_csv(result)
::CSV.open("`your out put file name`", 'w') do |csv|
result.each { |row| csv << row }
end
end
end
MyParser.run
You can output to a file like so:
File.open('results.txt','w') do |file|
file.puts "output" # See http://ruby-doc.org/core-2.1.2/IO.html#method-i-puts
end
Alternatively, you could do something like:
authors = doc.xpath("//meta[#name='author' or #name='Author']/#content")
keywrds = doc.xpath("//meta[#name='keywords' or #name='Keywords']/#content")
results = authors.map{ |x| "Author: #{x}" }.join("\n") +
keywrds.map{ |x| "Keywords: #{x}" }.join("\n")
File.open('results.txt','w'){ |f| f << results }

how to fill html file using ruby script

I have a html file has the general design (some div's) and I need to fill this div's with some html code Using ruby script.
any suggests?
example
I have page.html
<html>
<title>html Page</title>
<body>
<div id="main">
</div>
<div id="side">
</div>
</body>
</html>
and a ruby script inside it i collect some data and doing some kind of processing on it and i want to present it in a nice format**
so I want to set the div which it's id=main with some html code to be like this
<html>
<title>html Page</title>
<body>
<div id="main">
<h1>you have 30 files in games folder</h1>
</div>
<div id="side">
</div>
</body>
</html>
** why i don't use ROR? because I don't want to build a web site I just need to build a desktop tool but it's presentation layer is html code interpreted by browser to avoid working with graphics libraries
my problem isn't "how can I write to this html file" I can handle it.
my problem that If I want to create a table in the html file inside main div
I will wrote the whole html code inside the ruby script to print it to the html file, is there any lib or gem that i can tell it that I want a table with 3 rows and 2 columns and it generates the html code?
I historically have used ERB and REXML for things like this, since they both ship with Ruby (removing gem dependencies). You can combine one XML file (content) with one .erb file (for layout) and get simple merging. Here's a script I wrote for this (most of which is argument handling and extending REXML with some convenience methods):
USAGE = <<ENDUSAGE
Usage:
rubygen source_xml [-t template_file] [-o output_file]
-t,--template The ERB template file to merge (default: xml_name.erb)
-o,--output The output file name to write (default: template.txt)
If the template_file is named "somefile_XXX.yyy",
the output_file will default instead to "somefile.XXX"
ENDUSAGE
ARGS = {}
UNFLAGGED_ARGS = [ :source_xml ]
next_arg = UNFLAGGED_ARGS.first
ARGV.each{ |arg|
case arg
when '-t','--template'
next_arg = :template_file
when '-o','--output'
next_arg = :output_file
else
if next_arg
ARGS[next_arg] = arg
UNFLAGGED_ARGS.delete( next_arg )
end
next_arg = UNFLAGGED_ARGS.first
end
}
if !ARGS[:source_xml]
puts USAGE
exit
end
extension_match = /\.[^.]+$/
template_match = /_([^._]+)\.[^.]+$/
xml_file = ARGS[ :source_xml ]
template_file = ARGS[ :template_file] || xml_file.sub( extension_match, '.erb' )
output_file = ARGS[ :output_file ] || ( ( template_file =~ template_match ) ? template_file.sub( template_match, '.\\1' ) : template_file.sub( extension_match, '.txt' ) )
require 'rexml/document'
include REXML
class REXML::Element
# Find all descendant nodes with a specified tag name and/or attributes
def find_all( tag_name='*', attributes_to_match={} )
self.each_element( ".//#{REXML::Element.xpathfor(tag_name,attributes_to_match)}" ){}
end
# Find all child nodes with a specified tag name and/or attributes
def kids( tag_name='*', attributes_to_match={} )
self.each_element( "./#{REXML::Element.xpathfor(tag_name,attributes_to_match)}" ){}
end
def self.xpathfor( tag_name='*', attributes_to_match={} )
out = "#{tag_name}"
unless attributes_to_match.empty?
out << "["
out << attributes_to_match.map{ |key,val|
if val == :not_empty
"##{key}"
else
"##{key}='#{val}'"
end
}.join( ' and ' )
out << "]"
end
out
end
# A hash to tag extra data onto a node during processing
def _mydata
#_mydata ||= {}
end
end
start_time = Time.new
#xmldoc = Document.new( IO.read( xml_file ), :ignore_whitespace_nodes => :all )
#root = #xmldoc.root
#root = #root.first if #root.is_a?( Array )
end_time = Time.new
puts "%.2fs to parse XML file (#{xml_file})" % ( end_time - start_time )
require 'erb'
File.open( output_file, 'w' ){ |o|
start_time = Time.new
output_code = ERB.new( IO.read( template_file ), nil, '>', 'output' ).result( binding )
end_time = Time.new
puts "%.2fs to run template (#{template_file})" % ( end_time - start_time )
start_time = Time.new
o << output_code
}
end_time = Time.new
puts "%.2fs to write output (#{output_file})" % ( end_time - start_time )
puts " "
This can be used for HTML or automated source code generation alike.
However, these days I would advocate using Haml and Nokogiri (if you want structured XML markup) or YAML (if you want simple-to-edit content), as these will make your markup cleaner and your template logic simpler.
Edit: Here's a simpler file that merges YAML with Haml. The last four lines do all the work:
#!/usr/bin/env ruby
require 'yaml'; require 'haml'; require 'trollop'
EXTENSION = /\.[^.]+$/
opts = Trollop.options do
banner "Usage:\nyamlhaml [opts] <sourcefile.yaml>"
opt :haml, "The Haml file to use (default: sourcefile.haml)", type:String
opt :output, "The file to create (default: sourcefile.html)", type:String
end
opts[:source] = ARGV.shift
Trollop.die "Please specify an input Yaml file" unless opts[:source]
Trollop.die "Could not find #{opts[:source]}" unless File.exist?(opts[:source])
opts[:haml] ||= opts[:source].sub( EXTENSION, '.haml' )
opts[:output] ||= opts[:source].sub( EXTENSION, '.html' )
Trollop.die "Could not find #{opts[:haml]}" unless File.exist?(opts[:haml])
#data = YAML.load(IO.read(opts[:source]))
File.open( opts[:output], 'w' ) do |output|
output << Haml::Engine.new(IO.read(opts[:haml])).render(self)
end
Here's a sample YAML file:
title: Hello World
main: "<h1>you have 30 files in games folder</h1>"
side: "I dunno, something goes here."
...and a sample Haml file:
!!! 5
%html
%head
%title= #data['title']
%body
#main= #data['main']
#side= #data['side']
...and finally the HTML they produce:
<!DOCTYPE html>
<html>
<head>
<title>Hello World</title>
</head>
<body>
<div id='main'><h1>you have 30 files in games folder</h1></div>
<div id='side'>I dunno, something goes here.</div>
</body>
</html>
Are you trying to create a dynamic website? For that use Rails.
Are you trying to create a static website? Something like Jekyll is probably best.
Are you trying to to just create some some simple .html files you can FTP up somewhere? Jekyll might be a good option or even hand coding a quick little HTML generator might be a better option.
UPDATE:
Is this what you are looking for?
hash = {
:games => "you have 30 files in games folder",
:puppies => "you have 12 puppies in your pocket",
:pictures => "You have 9 files in pictures folder",
}
array = [
['run','x','y'],
[1,10,3],
[2,12,9],
[3,14,7],
]
hash.each do |key, value|
myfile = File.new("#{key}.html", "w+")
myfile.puts "<html>"
myfile.puts "<title>html Page</title>"
myfile.puts "<body>"
myfile.puts "<div id=\"main\">"
myfile.puts "<h1>#{value}</h1>"
myfile.puts "<table border=\"1\">"
array.each do |row|
myfile.puts "<tr>"
row.each do |cell|
myfile.puts "<td> #{cell} </td>"
end
myfile.puts "<tr>"
end
myfile.puts "</div>"
myfile.puts "<div id=\"side\">"
myfile.puts "</div>"
myfile.puts "</body>"
myfile.puts "</html>"
end
Continuing from #Phrogz's work, the ERB idea is a great idea. I was able to use it to build a simple Rake script that does the work for me. I find this approach to be a little easier.
rakefile.rb
task :default => :generate
task :generate do
require 'erb'
template_file = "page.erb"
output_file = "page.html"
File.open(output_file, 'w') do |o|
puts "Processing file: #{template_file}"
o << ERB.new( IO.read( template_file ), nil, '>', 'output' ).result( binding )
end
end
def render(file)
puts "Rendering file: #{file}"
IO.read(file)
end
$game_count = 30
def game_count
puts "Rendering game count: #{$game_count}"
$game_count
end
page.erb
<html>
<title>html Page</title>
<body>
<div id="main">
<h1>you have <%= game_count %> files in games folder</h1>
</div>
<div id="side">
<%= render "side.html" %>
</div>
</body>
</html>
side.html
<ul class="side">
<li>Side item 1</li>
<li>Side item 2</li>
</ul>
Running it
$ rake
Processing file: page.erb
Rendering game count: 30
Rendering file: side.html
Newly created file page.html
<html>
<title>html Page</title>
<body>
<div id="main">
<h1>you have 30 files in games folder</h1>
</div>
<div id="side">
<ul class="side">
<li>Side item 1</li>
<li>Side item 2</li>
</ul>
</div>
</body>
</html>