bad conversion from html to pdf with htmldoc - html

I'm trying to convert HTML to PDF using htmldoc, but even basic HTML does not convert properly, I have this HTML:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>pdf test</title>
</head>
<body>
<table border="1">
<tr>
<td width="50%">
a
</td>
<td>
<p>
some address
</p>
<p>
some other text
</p>
</td>
</tr>
<tr>
<td>
test<br>
test2<br>
asdfasdf<br>
qwerqwer<br>
fasdfasdf
</td>
<td>
bla
</td>
</tr>
</table>
</body>
</html>
but it renders like this: test.pdf using this command:
htmldoc --webpage --color --charset utf-8 -t pdf14 --size a4 test.html -f test.pdf
it's HTMLDOC Version 1.9svn, I tried to change charset, add thead, tbody etc and nothing helped .. do you know what can be the problem ?
also it doesn't accept style="padding: 10px" in that paragraphs etc

The command:
htmldoc --size universal --webpage -t pdf --firstpage p1 -f test.pdf test.html
renders the page well for me. It is unclear from the original question whether the options for utf-8 color and pdf type you entered are actually needed for your result or are actually the cause of the incorrect rendering.

Related

<style> tag and DOCTYPE in html can be used in many places, and not only in header?

I did search about it, and found nothing, so I ask here.
Recently I worked on a grails project in which nested html templates formed html code for emailing like next example, in which each DOCTYPE+Style corresponds to different templates that are used depending on business rules:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css">
...(styles here)
</style>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<style type="text/css">
...(more styles here)
</style>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="greeting">
Hi
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<style type="text/css">
...(more styles here)
</style>
</td>
</tr>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<style type="text/css">
...(more styles here)
</style>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<style type="text/css">
...(more styles here)
</style>
...(and a lot more of code like this)
</td>
</tr>
</tbody>
</table>
But the html 4.0.1 standard says that <style> elements should go in the <head> of the document, not to mention that DOCTYPE should be the 1st one for a valid html document.
Why does it work?
Why doesn't it need to respect the standard to work?
Why is there a standard if you can skip it this way?
Surely I'm missing something.
I learned today that <html> root tag is optional too, as many others, but found nothing about this matter.
Thanks.-
Why does it work?
HTML parsers perform a lot of error recovery to handle bad HTML.
Why doesn't it need to respect the standard to work?
Because it is generally considered better to show end users a "best effort" rendering of a webpage instead of an error message.
Why is there a standard if you can skip it this way?
Because pages that follow the standard can expect more predictable behaviour in how parsers handle them (since they won't be triggering error recovery).

XPath syntax to edit specific node in HTML using xmlstarlet

Supposing I have an HTML document like what's below:
mypage.html
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<tbody>
...
<tr>
<td id="MY_ID">123</td>
How would I edit the element set to MY_ID? I've used the following command successfully when it was just the table in a document, but placing it in a larger document broke it:
xmlstarlet ed --update '//td[#id="MY_ID"]' --value '456' mypage.html
Your td element needs to be closed (</td>) for it to be valid XML.
You can try the following it here :
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<tbody>
...
<tr>
<td id="MY_ID">123</td>
<td> id="NOTMYID">127</td>
</tr>
</tbody>
</table>
</body>
</html>
Using your own expression:
//td[#id="MY_ID

' symbol in html 4.01 email newsletter turn into different symbols, what to do?

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Butiqs newsletter</title>
Then some internal css follows, the following text in the first table:
<!--Tekst.-->
<table id="inleiding" width="600" cellpadding="0" cellspacing="0">
<tr height="30"><td></td></tr>
<tr>
<td width="20"></td>
<td><font size="5">Hi yara,</font></td>
<td width="20"></td>
</tr>
<tr height="10"><td></td></tr>
<tr>
<td width="20"></td>
<td><font>We are excited to have you as part of our Butiqs’ family and we’d like to keep you in the loop of our progress. So, without further ado, here’s where we are now, 5 months after we kickstarted the project…
</font></td>
<td width="26"></td>
</tr>
<tr height="20"><td></td></tr>
</table>
Now the strange thing is, my text is perfectly normal. Only when I live preview it in my browser I see strange symbols in my text.
Image of the weird symbols
This happens everytime when I use a 'symbol in html. What is the problem? Even when I make it a normal html doctype, the problem stays.
Can someone help me?
For Html 4.01 you would need to put
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
in your head tag.
This is a problem with your encoding.
You can try fixing this by adding this in your <head>:
<meta charset="utf-8" />
Note: This is for HTML 5 documents.
Read more here
For HTML 4 documents, use:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Read more here

Generate html with images from pdf using Linux -poppler-utils-pdftohtml

Currently I am working with pdftohtml, under CentOS, poppler-utils. The concept is simple - user uploads the PDF file and sees the HTML version of that file. I use the simple command -
$> pdftohtml source.pdf target.html
but it doesn't work! Later on, I try to create html using complex switch with no frames:
$> pdftohtml -c - noframes source.pdf target.html
Still no Luck! The problem is - The image of the pdf file (the images are inside of that pdf file) can't appear in html, sometimes, the image overlaps! Any ideas?
Here is the PHP Code -
Add.php
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<link href="css/style.css" rel="stylesheet" type="text/css"/>
<title>CompleteView</title>
</head>
<body>
<form method="post" action="save.php" enctype="multipart/form-data">
<input type="hidden" name="action" value="add">
<tr class="dark_bgcolor text-content">
<td align="left" width="20%">Upload</td>
<td align="left" width="1%">:</td>
<td align="left">
<input type="file" name="img_full" class="look" size="50">
(Only .pdf)
</td>
</tr>
<tr class="bottom_bgcolor">
<td align="center" colspan="3"><input type="submit" name="" value="Upload" class="look"></td>
</tr>
</form>
</body>
</html>
Save.php
<?php
$myNewFolderPath=rand();
mkdir($myNewFolderPath);
$fname="full_".uniqid("");
$filename=$fname.'.pdf';
//$uploadpath=SPL_IMG_UPLOADPATH.$filename;
move_uploaded_file($_FILES['img_full']['tmp_name'], $myNewFolderPath.'/'.$filename);
chmod($myNewFolderPath.'/'.$filename, 0777);
echo ('/usr/local/bin/pdftohtml '.$myNewFolderPath.'/'.$filename);
exec('/usr/local/bin/pdftohtml -c -noframes'.$myNewFolderPath.'/'.$filename);
header('Location:'.$fname.'.html');
//exec('/usr/local/bin/pdftohtml 2098602105/EssentialC.pdf');
?>
One More thing - the pdftohtml version is -0.36
Here is The Screenshots -
Result -
$ pdftohtml -c source.pdf target.html
This will output in complex mode. You can't use -noframes with the complex flag.
$ man pdftohtml
-noframes generate no frames. Not supported in complex output mode.

How to detect file path in a string of html codes

I need to read in a string of HTML codes using ajax and refactor every file path found in the string.
For example:
I need to detect "../../images/home.jpg"
And change it to "http://www.mywebsite.com/report/images/home.jpg".
I am actually doing a dashboard. Using iframe to display content in a dashboard cost me performance. In order to not use iframe, I intend to read in the string of html codes of the targeted URL. But the only factor here is that if the filepath of any images, javascript or css are not absolute path, my dashboard will not be able to display.
Please give your thoughts in how you would detect file path within a string of html codes or is there any existing library that you know of.
Here is one of a simple html codes i am reading from.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="Head1"><title>
Web Report: Customer Complains
</title><link rel="stylesheet" type="text/css" href="../../../css/Concept1.css" /></head>
<style type="text/css">
</style>
<body>
<div class="SectionTitleContainer1">
<div style="text-align: left;" class="SectionTitle">
Customer Complaints
</div>
</div>
<table cellspacing="0" cellpadding="5" style="width: 100%;" border="0">
<tr>
<td width="1%" align="left">
<a>
Refund
</a>
</td>
<td width="1%" style="text-align: right">139</td>
<td align="left" style="">
<img src="../../../images/tile-horiz-bar.png" style="height: 10px; width: 100px" />
</td>
</tr>
<tr>
<td width="1%" align="left">
<a>
Appointment Rescheduled/Delayed
</a>
</td>
<td width="1%" style="text-align: right">96</td>
<td align="left" style="">
<img src="../../../images/tile-horiz-bar.png" style="height: 10px; width: 69px" />
</td>
</tr>
</table>
</body>
</html>
After a long period of discovery, I learned that if there is a need for two different system to display identical UI, we should place the image files (or any other script files) in a common directory within a server. Web app should be given proper permission to access the files in the local directory of the server. The directory holding the image files should not be secured and not exposing to the public (if necessary).