I am under Linux and I want to fetch an html page from the web and then output it on terminal. I found out that html2text essentially does the job, but it converts my html to a plain text whereas I would better convert it into ansi colored text in the spirit of ls --color=auto. Any ideas?
The elinks browser can do that. Other text browsers such as lynx or w3m might be able to do that as well.
elinks -dump -dump-color-mode 1 http://example.com/
the above example provides a text version of http://example.com/ using 16 colors. The output format can be customized further depending on need.
The -dump option enables the dump mode, which just prints the whole page as text, with the link destinations printed out in a kind of "email-style".
-dump-color-mode 1 enables the coloring of the output using the 16 basic terminal colors. Depending on the value and the capabilities of the terminal emulator this can be up to ~16 million (True Color). The values are documented in elinks.conf(5).
The colors used for output can be configured as well, which is documented in elinks.conf(5) as well.
The w3m browser supports coloring the output text.
You can use the lynx browser to output the text using this command.
lynx -dump http://example.com
Related
I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM
By default G-WAN strips white spaces from HTML files to minimize the file.
What's the best way to allow pre-formatted text defined by <pre> tag to get through?
#Richard Heath
Interesting -- I'm using a vanilla installation of G-Wan with the <pre> block starting like this <pre class="fragment">.
See sample of doxygen generated doc
This is being hosted up on a vanilla installation of g-wan.
Update:
As a temporary (not clean/quick fix) work around, I've changed the startup to look like this:
START=""
...
nohup ./$NAME $START &>/dev/null &
I will try later to write a handler to filter the return.
updated sample files for comparison
./gwan -d
http://alex4u2nv.com/test/test.html
nohup ./gwan &> /dev/null &
http://alex4u2nv.com/docs/test.html
If you look at this link showing both preformated source code and text then it is clear that G-WAN v3.3 respects the <pre> tag even when running in daemon mode.
If you have an example of a broken page then publish the broken text rather than such a huge HTML page.
Further, in the link you provide the text is NOT broken but there is a client script that blocks the Internet Browser (one has to stop Javascript to see the whole page).
I want to indent my text in the article I am writing in docbook 5. I also need to add colors to my text. Is that possible? If so how? I tried indenting as follows but it was not visible when I took the html output of it.(Here I tried to align the text "Kerfun" to the center) I have no idea regarding the colour change. Can someone please tell me how? Where have I gone wrong?
<dbk:para text-indent="center">Kerfun</dbk:para>
<dbk:para text-indent="center">
<dbk:emphasis role="bold">Fadiah</dbk:emphasis>
</dbk:para>
You haven't specified your OS or toolchain.
To format your xml:
I'd suggest using the "xmllint -format" command
To validate your xml:
Same command could be used to ensure your document is valid against the docbook schema
To colorize your xml:
That very much depends on what editor you use. Personally I'm a fan of gvim which has XML high-lighting enabled by default.
Update
As stated I'm not a windows guy but 2 minutes of googling lead me to the following:
Notepad++ appears to have an XML plugin. Source was the following link
Nothing to do with rendering an individual image on a webpage. Goal is to render the entire webpage save that as a screenshot. Want to show a thumbnail of an HTML file to the user. The HTML file I will be screenshotting will be an HTML part in a MIME email message - ideally I would like to snapshot the entire MIME file but if I can do this to an HTML file I'll be in good shape.
An API would be ideal but an executable is also good.
You need html2ps, and convert from the package ImageMagick:
html2ps index.html index.ps
convert index.ps index.png
The second program produces one png per page for a long html-page - the page layout was done by by html2ps.
I found a program evince-thumbnailer, which was reported as:
apropos postscript | grep -i png
evince-thumbnailer (1) - create png thumbnails from PostScript and PDF documents
but it didn't work on an simple, first test.
If you like to combine multiple pages to a larger image, convert will help you surely.
Now I see, that convert operates on html directly, so
convert index.html index.png
shall work too. I don't see a difference in the output, and the size of the images is nearly identical.
If you have a multipart mime-type email, you typically have a mail header, maybe some pre-html-text, the html and maybe attachments.
You can extract the html and format it seperately - but rendering it embedded might not be that easy.
Here is a file I tested, which was from Apr. 14, so I extract the one mail from the mailfolder:
sed -n "/From - Sat Apr 14/,/From -/p" /home/stefan/.mozilla-thunderbird/k2jbztqu.default/Mail/Local\ Folders-1/Archives.sbd/sample | \
sed -n '/<html>/,/<\/html>/p' | wkhtmltopdf - - > sample.pdf
then I extract just the html-part of that.
wkhtmltopdf needs - - for reading stdin/writing to stdout. The PDF is rendered, but I don't know how to integrate it into your workflow.
You can replace wkhtml ... with
convert - sample.jpg
I'm going with wkhtmltoimage. This worked once correctly set up xvfb. The postscript suggestion did not render correctly and we need img not pdf.
Is there a good way to convert html to PDF from bash with Unicode (UTF-8) support?
I would expect the same result as if I were to use a PDF printer and print a page from Firefox.
Usage Example:
curl http://www.wikipedia.org/ | html2pdf_bash_command > /tmp/wikipedia.org.pdf
You can do something with xfvb and a browser or use a small qt component wkhtmltopdf. Also if you have a full gnome installed on your environment you can use gnome-web-print.