getting HTML source or rich text from the X clipboard - html

How can rich text or HTML source code be obtained from the X clipboard? For example, if you copy some text from a web browser and paste it into kompozer, it pastes as HTML, with links etc. preserved. However, xclip -o for the same selection just outputs plain text, reformatted in a way similar to that of elinks -dump. I'd like to pull the HTML out and into a text editor (specifically vim).
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses. The X clipboard API is to me yet a mysterious beast; any tips on hacking something up to pull this information are most welcome. My language of choice these days is Python, but pretty much anything is okay.

To complement #rkhayrov's answer, there exists a command for that already: xclip. Or more exactly, there's a patch to xclip which was added to xclip later on in 2010, but hasn't been released yet that does that. So, assuming your OS like Debian ships with the subversion head of xclip (2019 edit: version 0.13 with those changes was eventually released in 2016 (and pulled into Debian in January 2019)):
To list the targets for the CLIPBOARD selection:
$ xclip -selection clipboard -o -t TARGETS
TIMESTAMP
TARGETS
MULTIPLE
SAVE_TARGETS
text/html
text/_moz_htmlcontext
text/_moz_htmlinfo
UTF8_STRING
COMPOUND_TEXT
TEXT
STRING
text/x-moz-url-priv
To select a particular target:
$ xclip -selection clipboard -o -t text/html
 rkhayrov
$ xclip -selection clipboard -o -t UTF8_STRING
rkhayrov
$ xclip -selection clipboard -o -t TIMESTAMP
684176350
And xclip can also set and own a selection (-i instead of -o).

In X11 you have to communicate with the selection owner, ask about supported formats, and then request data in the specific format. I think the easiest way to do this is using existing windowing toolkits. E,g. with Python and GTK:
#!/usr/bin/python
import glib, gtk
def test_clipboard():
clipboard = gtk.Clipboard()
targets = clipboard.wait_for_targets()
print "Targets available:", ", ".join(map(str, targets))
for target in targets:
print "Trying '%s'..." % str(target)
contents = clipboard.wait_for_contents(target)
if contents:
print contents.data
def main():
mainloop = glib.MainLoop()
def cb():
test_clipboard()
mainloop.quit()
glib.idle_add(cb)
mainloop.run()
if __name__ == "__main__":
main()
Output will look like this:
$ ./clipboard.py
Targets available: TIMESTAMP, TARGETS, MULTIPLE, text/html, text/_moz_htmlcontext, text/_moz_htmlinfo, UTF8_STRING, COMPOUND_TEXT, TEXT, STRING, text/x-moz-url-priv
...
Trying 'text/html'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/_moz_htmlcontext'...
<html><body class="question-page"><div class="container"><div id="content"><div id="mainbar"><div id="question"><table><tbody><tr><td class="postcell"><div><div class="post-text"><p></p></div></div></td></tr></tbody></table></div></div></div></div></body></html>
...
Trying 'STRING'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/x-moz-url-priv'...
http://stackoverflow.com/questions/3261379/getting-html-source-or-rich-text-from-the-x-clipboard

Extending the ideas from Stephane Chazelas, you can:
Copy from the formatted source.
Run this command to extract from the clipboard, convert to HTML, and then (with a pipe |) put that HTML back in the clipboard, again using the same xclip:
xclip -selection clipboard -o -t text/html | xclip -selection clipboard
Next, when you paste with Ctrl+v, it will paste the HTML source.
Going further, you can make it a shortcut, so that you don't have to open the terminal and run the exact command each time. ✨
To do that:
Open the settings for your OS (in my case it's Ubuntu)
Find the section for the Keyboard
Then find the section for shortcuts
Create a new shortcut
Set a Name, e.g.: Copy as HTML
Then as the command for the shortcut, put:
bash -c "xclip -selection clipboard -o -t text/html | xclip -selection clipboard"
Note: notice that it's the same command as above, but put inside of an inline Bash script. This is necessary to be able to use the | (pipe) to send the output from one command as input to the next.
Set the shortcut to whatever combination you want, preferably not overwriting another shortcut you use. In my case, I set it to: Ctrl+Shift+c
After this, you can copy some formatted text as normally with: Ctrl+c
And then, before pasting it, convert it to HTML with: Ctrl+Shift+c
Next, when you paste it with: Ctrl+v, it will paste the contents as HTML. 🧙✨

Related

Is it possible to directly 'convert' a google drive image url in Imagemagick?

I have had a good search around for this, but cannot find a concrete answer. I have been trying to run the convert command against a url for an image I have stored on google drive, in a public folder.
I have been able to get this to work if I wget the url with -qO- and pipe this to convert- e.g.
wget 'https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1' -qO- | convert - -resize 100x100 MGFIN01.png
Ideally I would prefer to be able to directly run the url through convert:
convert https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1 -resize 100x100 MGFIN01.png
With the ultimate intention of creating an html image map like: http://www.imagemagick.org/Usage/montage/#html therefore requiring a list of urls and names (I can probably work this out once I have resolved the url part)
I am on Ubuntu with imagemagick 6.9. I see in delegates.xml that I have this:
<delegate decode="https" command=""curl" -s -k -L -o "%o" "https:%M""/>
Also tried the download with curl and options and that also worked.
This works for me on ImageMagick 6.9.10.97 Q16 Mac OSX. Put the URL in double quotes. You may also have to edit your policy.xml file to give permission for HTTPS. See policy.xml at https://imagemagick.org/script/resources.php
convert "https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1" -resize 100x100 MGFIN01.png
Just to give a tidier response than possible in comments:
Open policy.xml
sudo nano /etc/ImageMagick-6/policy.xml
Scroll down to find:
<policy domain="delegate" rights="none" pattern="HTTPS" />
Edit this to show:
<policy domain="delegate" rights="read" pattern="https" />
Save (CTRL+X, Y)
Run convert command again. Tada.

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). #nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?
Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid
Code
brew install wine # takes a little while >10m
brew install gs # only for generating a tif example. Not required, you can use Preview
brew install wget # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs \
-o paper-%d.tif \
-sDEVICE=tiff24nc \
-r300x300 \
paper.pdf
cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \
-inp-img "$dl/Downloads/paper-3.tif" \
-out-xml "$dl/Downloads/paper-3-tool.xml" \
-rec-mode layout>>log.txt
# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
Results
Document with overlays (rollover to see text and type)
Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
Unfortunately, I kept getting null pointers.
Could not convert to target XML schema format.
java.lang.NullPointerException
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.
If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
img = Image.open(filename)
results = []
with PyTessBaseAPI() as api:
api.SetImage(img)
api.SetPageSegMode(PSM.AUTO_ONLY)
iterator = api.AnalyseLayout()
for w in iterate_level(iterator, RIL.BLOCK):
if w is not None:
results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))
draw = ImageDraw.Draw(img)
for block_type, poly in results:
# you can define a color per block type (see tesserocr.PT for block types list)
draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)
With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.
More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.
The HOCR individual character step is now available in Tesseract since 4.1.
Once the installation check, use :
tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

Is there something like a "CSS selector" or XPath grep?

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):
div.a ul.b
or XPath:
//div[#class="a"]//div[#class="b"]
grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.
Try this:
Install http://www.w3.org/Tools/HTML-XML-utils/.
Ubuntu: aptitude install html-xml-utils
MacOS: brew install html-xml-utils
Save a web page (call it filename.html).
Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"
Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:
#!/bin/bash
# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"
You can then run:
cssgrep filename.html "label.black"
This will generate the content for all HTML label elements of the class black.
The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.
See also:
https://superuser.com/a/529024/9067 - similar question
https://gist.github.com/Boldewyn/4473790 - wrapper script
I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.
You will need to install Element Finder, cd into the directory you want to search, and then run:
elfinder -s "div.a ul.b"
For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/
There are two tools:
pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.
htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.
Examples:
$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html
$ pup --color 'title' < robots.html
<title>
Robots exclusion standard - Wikipedia
</title>
$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia
Per Nat's answer here:
How to parse XML in Bash?
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library

Solution to programmatically generate an export script from a directory hierarchy

I often have the following scenario: in order to reproduce a bug for reporting, I create a small sample project, sometimes a maven multi module project. So there may be a hierarchy of directories and it will usually contain a few small text files. Standard procedure would of course be to create a zip file and send that. But on some mailing lists attachments are not allowed, and so I am looking for a way to automatically create an installation script, that I can post to such mailing lists.
Basically I would be happy with a Unix flavor only that creates mkdir statements to create directories and >> statements to write the file contents. (Actually, apart from the relative path delimiters, the windows and unix versions can probably be identical)
Does such a tool exist somewhere? If not, I'll probably write one in java, but I'm happy to accept solutions in all kinds of languages.
(The tool could run under windows or unix, but the target platform for the generated scripts should be either unix or configurable)
I think you're looking for shar, which creates a shell archive (shell script that when run produces a given directory hierarchy). It is available on most systems; you can use GNU sharutils if you don't already have it.
Normal usage for packing up a directory tree would be something like:
shar `find somedirectory -print` > archive.sh
If you're using GNU sharutils, and want to create "vanilla" archives which use only the most portable of shell builtins, mkdir, and sed, then you should invoke it as shar -V. You can remove some more extra baggage from the scripts by using -xQ; -x to remove checks for existing files, and -Q to remove verbose output from the archive.
shar -VxQ `find somedir -print` > archive.sh
If you really want something even simpler, here's a dirt-simple version of shar as a shell script. It takes filenames on standard input instead of arguments for simplicity and to be a little more robust.
#!/bin/sh
while read filename
do
if test -d $filename
then
echo "mkdir -p '$filename'"
else
echo "sed 's/^X//' <<EOF > '$filename'"
sed 's/^/X/' < "$filename"
echo 'EOF'
fi
done
Invoke as:
find somedir -print | simpleshar > archive.sh
You still need to invoke sed, as you need some way of ensuring that no lines in the here document begin with the delimiter, which would close the document and cause later lines to be interpreted as part of the script. I can't think of any really good way to solve the quoting problem using only shell builtins, so you will have to rely on sed (which is standard on any Unix-like system, and has been practically forever).
if your problem are non-text-file-hating filters:
in times long forgotten, we used uuencode to get past 8-bit eating relays -
is that a way to get past attachment eating mail boxes these days ?
So why not zip and uuencode ?
(or base64, which is its younger cousin)

How to configure firefox to run emacsclientw on certain links?

I've got a Perl script that groks a bunch of log files looking for "interesting" lines, for some definition of interesting. It generates an HTML file which consists of a table whose columns are a timestamp, a filename/linenum reference and the "interesting" bit. What I'd love to do is have the filename/linenum be an actual link that will bring up that file with the cursor positioned on that line number, in emacs.
emacsclientw will allow such a thing (e.g. emacsclientw +60 foo.log) but I don't know what kind of URL/URI to construct that will let FireFox call out to emacsclientw. The original HTML file will be local, so there's no problem there.
Should I define my own MIME type and hook in that way?
Firefox version is 3.5 and I'm running Windows, in case any of that matters. Thanks!
Go to about:config page in firefox. Add a new string :
network.protocol-handler.app.emacs
value: path to a script that parse the url without protocol (what's after emacs://) and then call emacsclient with the proper argument.
You can't just put the path of emacsclient because everything after the protocol is passed as one arg to the executable so your +60 foo.log would be a new file named that way.
But you could easily imagine someting like emacs:///path/to/your/file/LINENUM and have a little script that remove the final / and number and call emacsclient with the number and the file :-)
EDIT: I could do that in bash if you want but i don't know how to do that with the windows "shell" or whatever it is called.
EDIT2: I'm wrong on something, the protocol is passed in the arg string to !
Here is a little bash script that i just made for me, BTW thanks for the idea :-D
#!/bin/bash
ARG=${1##emacs://}
LINE=${ARG##*/}
FILE=${ARG%/*}
if wmctrl -l | grep emacs#romuald &>/dev/null; then # if there's already an emacs frame
ARG="" # then just open the file in the existing emacs frame
else
ARG="-c" # else create a new frame
fi
emacsclient $ARG -n +$LINE "$FILE"
exit $?
and my network.protocol-handler.app.emacs in my iceweasel (firefox) is /home/p4bl0/bin/ffemacsclient. It works just fine !
And yes, my laptop's name is romuald ^^.
Thanks for the pointer, p4bl0. Unfortunately, that only works on a real OS; Windows uses a completely different method. See http://kb.mozillazine.org/Register_protocol for more info.
But, you certainly provided me the start I needed, so thank you very, very much!
Here's the solution for Windows:
First you need to set up the registry correctly to handle this new URL type. For that, save the following to a file, edit it to suit your environment, save it and double click on it:
Windows Registry Editor Version 5.00
[HKEY_CLASSES_ROOT\emacs]
#="URL:Emacs Protocol"
"URL Protocol"=""
[HKEY_CLASSES_ROOT\emacs\shell]
[HKEY_CLASSES_ROOT\emacs\shell\open]
[HKEY_CLASSES_ROOT\emacs\shell\open\command]
#="\"c:\\product\\emacs\\bin\\emacsclientw.exe\" --no-wait -e \"(emacs-uri-handler \\\"%1\\\")\""
This is not as robust as p4bl0's shell script, because it does not make sure that Emacs is running first. Then add the following to your .emacs file:
(defun emacs-uri-handler (uri)
"Handles emacs URIs in the form: emacs:///path/to/file/LINENUM"
(save-match-data
(if (string-match "emacs://\\(.*\\)/\\([0-9]+\\)$" uri)
(let ((filename (match-string 1 uri))
(linenum (match-string 2 uri)))
(with-current-buffer (find-file filename)
(goto-line (string-to-number linenum))))
(beep)
(message "Unable to parse the URI <%s>" uri))))
The above code will not check to make sure the file exists, and the error handling is rudimentary at best. But it works!
Then create an HTML file that has lines like the following:
file: c:/temp/my.log, line: 60
and then click on the link.
Post Script:
I recently switched to Linux (Ubuntu 9.10) and here's what I did for that OS:
$ gconftool -s /desktop/gnome/url-handlers/emacs/command '/usr/bin/emacsclient --no-wait -e "(emacs-uri-handler \"%s\")"' --type String
$ gconftool -s /desktop/gnome/url-handlers/emacs/enabled --type Boolean true
Using the same emacs-uri-handler from above.
Might be a great reason to write your first FF plugin ;)