I'm using wkhtmltopdf to translate a webpage to a PDF document, but some of the images in the result file are a bit low resolution.
The source webpage actually has the srcset attribute used to provide higher resolution versions for higher pixel density displays. However wkhtmltopdf doesn't appear to be using them.
I figured as this is a WebKit based tool, and WebKit definitely supports this attribute, there might be something I could set to make WebKit use the highest resolution version available.
Edit: While I'm waiting on figuring out a better way of doing this, I have managed to successfully preprocess the HTML with xmlstarlet, by stripping all but the last URL out of the srcset, renaming the attribute src, and removing the last but one src attribute (the HTML is generated so all images follow an identical format in their usage of src / srcset).
xmlstarlet ed -P \
--update "//img/#srcset" \
-x "substring-before(substring-after(.,', '),' ')" \
--rename "//img/#srcset" -v "src" \
--delete "//img/#src[position()=last()-1]" file.html
A useful workaround until I find a more elegant solution!
Related
Whenever I copy a code from any infographic provider ( free services ) they add their logo in the infographic, so is there any option any code to remove their logo from embed code?
I have already tried some features but without success.
The logo can be removed easily using ffmpeg, by using its delogo filter. All you need to supply is the co-ordinates and dimensions of the logo present on the video.
Example for the filter syntax:
ffmpeg -i your_video_url -f "delogo=x=0:y=0:w=100:h=77:band=10" -r outup_file_url
Find the complete filter info here.
simple answer: no. they put their lgoo there as their copyright to give them credit.
This question of how to include text that will appear in html but not pdf has been answered twice (LaTeX multicolumn block in Pandoc markdown and Pandoc markdown: Omit text in PDF version but include in HTML version) and I applied the solution recommended:
Here is the source:
My text.
<div>
This will be ignored in non-HTML output
</div>
Other text
The command I used was:
pandoc essai.md --from=markdown-markdown_in_html_blocks -o essai.pdf
Yet the text between the <div>tags is still displayed in the pdf file.
Am I missing something obvious? Or has the behaviour of pandoc changed since 2013/2014 (my version is 1.19 of 2016)? If yes, what would the solution be?
I know you asked for a solution that works with pandoc 1.19, but for completeness, here is one which is a bit cleaner but only works with pandoc 2.0 and later:
There is a new raw_attribute extension, which allows to do the following:
`text in html`{=html}
`text in pdf`{=latex}
It is enabled in pandoc markdown per default.
I found an answer: add the -native_divs extension.
This way:
$ pandoc essai.md --from=markdown-markdown_in_html_blocks-native_divs -t latex
My text.
other text
My understanding is that pandoc has started interpreting <div> items as "native" elements, i.e. represented in the AST, which causes the interesting side effect that a <p> is added in html. This behaviour needs to be deactivated in order to obtain the desired effect.
If someone has a better solution or a better explanation, please let me know.
Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.
How to represent:
Create new image with paint (any size)
Add letter A to this image
Try to recognize -> tesseract will not find any letters
Copy-paste this letter 5-6 times to this image
Try to recognize -> tesseract will find all the letters
Why?
You must set the "page segmentation mode" to "single char".
For example, in Android you do the following:
api.setPageSegMode(TessBaseAPI.pageSegMode.PSM_SINGLE_CHAR);
python code to do that configuration is like this:
import pytesseract
import cv2
img = cv2.imread("path to some image")
pytesseract.image_to_string(
img, config=("-c tessedit"
"_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
" --psm 10"
" -l osd"
" "))
the --psm flag defines the page segmentation mode.
according to documentaion of tesseract, 10 means :
Treat the image as a single character.
so to recognize a single character you just need to use : --psm 10 flag.
You need to set Tesseract's page segmentation mode to "single character."
Have you seen this?
https://code.google.com/p/tesseract-ocr/issues/detail?id=581
The bug list shows it as "no longer an issue".
Be sure to have high resolution images.
If you are resizing the image, be sure to keep a high DPI and don't resize too small
Be sure to train your tesseract system
use the baseApi.setVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"); code before the init Tesseract
Also, you may look into which font to use with OCR
I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?
The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.
The w3m browser supports coloring the output text.
You can use the lynx browser to output the text using this command.
lynx -dump http://example.com