Is it possible to directly 'convert' a google drive image url in Imagemagick? - google-drive-api

I have had a good search around for this, but cannot find a concrete answer. I have been trying to run the convert command against a url for an image I have stored on google drive, in a public folder.
I have been able to get this to work if I wget the url with -qO- and pipe this to convert- e.g.
wget 'https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1' -qO- | convert - -resize 100x100 MGFIN01.png
Ideally I would prefer to be able to directly run the url through convert:
convert https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1 -resize 100x100 MGFIN01.png
With the ultimate intention of creating an html image map like: http://www.imagemagick.org/Usage/montage/#html therefore requiring a list of urls and names (I can probably work this out once I have resolved the url part)
I am on Ubuntu with imagemagick 6.9. I see in delegates.xml that I have this:
<delegate decode="https" command=""curl" -s -k -L -o "%o" "https:%M""/>
Also tried the download with curl and options and that also worked.

This works for me on ImageMagick 6.9.10.97 Q16 Mac OSX. Put the URL in double quotes. You may also have to edit your policy.xml file to give permission for HTTPS. See policy.xml at https://imagemagick.org/script/resources.php
convert "https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1" -resize 100x100 MGFIN01.png

Just to give a tidier response than possible in comments:
Open policy.xml
sudo nano /etc/ImageMagick-6/policy.xml
Scroll down to find:
<policy domain="delegate" rights="none" pattern="HTTPS" />
Edit this to show:
<policy domain="delegate" rights="read" pattern="https" />
Save (CTRL+X, Y)
Run convert command again. Tada.

Related

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). #nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?
Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid
Code
brew install wine # takes a little while >10m
brew install gs # only for generating a tif example. Not required, you can use Preview
brew install wget # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs \
-o paper-%d.tif \
-sDEVICE=tiff24nc \
-r300x300 \
paper.pdf
cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \
-inp-img "$dl/Downloads/paper-3.tif" \
-out-xml "$dl/Downloads/paper-3-tool.xml" \
-rec-mode layout>>log.txt
# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
Results
Document with overlays (rollover to see text and type)
Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
Unfortunately, I kept getting null pointers.
Could not convert to target XML schema format.
java.lang.NullPointerException
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.
If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
img = Image.open(filename)
results = []
with PyTessBaseAPI() as api:
api.SetImage(img)
api.SetPageSegMode(PSM.AUTO_ONLY)
iterator = api.AnalyseLayout()
for w in iterate_level(iterator, RIL.BLOCK):
if w is not None:
results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))
draw = ImageDraw.Draw(img)
for block_type, poly in results:
# you can define a color per block type (see tesserocr.PT for block types list)
draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)
With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.
More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.
The HOCR individual character step is now available in Tesseract since 4.1.
Once the installation check, use :
tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

How to download HTTP directory with all files and sub-directories as they appear on the online files/folders list?

There is an online HTTP directory that I have access to. I have tried to download all sub-directories and files via wget. But, the problem is that when wget downloads sub-directories it downloads the index.html file which contains the list of files in that directory without downloading the files themselves.
Is there a way to download the sub-directories and files without depth limit (as if the directory I want to download is just a folder which I want to copy to my computer).
Solution:
wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/
Explanation:
It will download all files and subfolders in ddd directory
-r : recursively
-np : not going to upper directories, like ccc/…
-nH : not saving files to hostname folder
--cut-dirs=3 : but saving it to ddd by omitting
first 3 folders aaa, bbb, ccc
-R index.html : excluding index.html
files
Reference: http://bmwieczorek.wordpress.com/2008/10/01/wget-recursively-download-all-files-from-certain-directory-listed-by-apache/
I was able to get this to work thanks to this post utilizing VisualWGet. It worked great for me. The important part seems to be to check the -recursive flag (see image).
Also found that the -no-parent flag is important, othewise it will try to download everything.
you can use lftp, the swish army knife of downloading if you have bigger files you can add --use-pget-n=10 to command
lftp -c 'mirror --parallel=100 https://example.com/files/ ;exit'
wget -r -np -nH --cut-dirs=3 -R index.html http://hostname/aaa/bbb/ccc/ddd/
From man wget
‘-r’
‘--recursive’
Turn on recursive retrieving. See Recursive Download, for more details. The default maximum depth is 5.
‘-np’
‘--no-parent’
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded. See Directory-Based Limits, for more details.
‘-nH’
‘--no-host-directories’
Disable generation of host-prefixed directories. By default, invoking Wget with ‘-r http://fly.srk.fer.hr/’ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.
‘--cut-dirs=number’
Ignore number directory components. This is useful for getting a fine-grained control over the directory where recursive retrieval will be saved.
Take, for example, the directory at ‘ftp://ftp.xemacs.org/pub/xemacs/’. If you retrieve it with ‘-r’, it will be saved locally under ftp.xemacs.org/pub/xemacs/. While the ‘-nH’ option can remove the ftp.xemacs.org/ part, you are still stuck with pub/xemacs. This is where ‘--cut-dirs’ comes in handy; it makes Wget not “see” number remote directory components. Here are several examples of how ‘--cut-dirs’ option works.
No options -> ftp.xemacs.org/pub/xemacs/
-nH -> pub/xemacs/
-nH --cut-dirs=1 -> xemacs/
-nH --cut-dirs=2 -> .
--cut-dirs=1 -> ftp.xemacs.org/xemacs/
...
If you just want to get rid of the directory structure, this option is similar to a combination of ‘-nd’ and ‘-P’. However, unlike ‘-nd’, ‘--cut-dirs’ does not lose with subdirectories—for instance, with ‘-nH --cut-dirs=1’, a beta/ subdirectory will be placed to xemacs/beta, as one would expect.
No Software or Plugin required!
(only usable if you don't need recursive deptch)
Use bookmarklet. Drag this link in bookmarks, then edit and paste this code:
javascript:(function(){ var arr=[], l=document.links; var ext=prompt("select extension for download (all links containing that, will be downloaded.", ".mp3"); for(var i=0; i<l.length; i++) { if(l[i].href.indexOf(ext) !== false){ l[i].setAttribute("download",l[i].text); l[i].click(); } } })();
and go on page (from where you want to download files), and click that bookmarklet.
wget is an invaluable resource and something I use myself. However sometimes there are characters in the address that wget identifies as syntax errors. I'm sure there is a fix for that, but as this question did not ask specifically about wget I thought I would offer an alternative for those people who will undoubtedly stumble upon this page looking for a quick fix with no learning curve required.
There are a few browser extensions that can do this, but most require installing download managers, which aren't always free, tend to be an eyesore, and use a lot of resources. Heres one that has none of these drawbacks:
"Download Master" is an extension for Google Chrome that works great for downloading from directories. You can choose to filter which file-types to download, or download the entire directory.
https://chrome.google.com/webstore/detail/download-master/dljdacfojgikogldjffnkdcielnklkce
For an up-to-date feature list and other information, visit the project page on the developer's blog:
http://monadownloadmaster.blogspot.com/
You can use this Firefox addon to download all files in HTTP Directory.
https://addons.mozilla.org/en-US/firefox/addon/http-directory-downloader/
wget generally works in this way, but some sites may have problems and it may create too many unnecessary html files. In order to make this work easier and to prevent unnecessary file creation, I am sharing my getwebfolder script, which is the first linux script I wrote for myself. This script downloads all content of a web folder entered as parameter.
When you try to download an open web folder by wget which contains more then one file, wget downloads a file named index.html. This file contains a file list of the web folder. My script converts file names written in index.html file to web addresses and downloads them clearly with wget.
Tested at Ubuntu 18.04 and Kali Linux, It may work at other distros as well.
Usage :
extract getwebfolder file from zip file provided below
chmod +x getwebfolder (only for first time)
./getwebfolder webfolder_URL
such as ./getwebfolder http://example.com/example_folder/
Download Link
Details on blog

getting HTML source or rich text from the X clipboard

How can rich text or HTML source code be obtained from the X clipboard? For example, if you copy some text from a web browser and paste it into kompozer, it pastes as HTML, with links etc. preserved. However, xclip -o for the same selection just outputs plain text, reformatted in a way similar to that of elinks -dump. I'd like to pull the HTML out and into a text editor (specifically vim).
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses. The X clipboard API is to me yet a mysterious beast; any tips on hacking something up to pull this information are most welcome. My language of choice these days is Python, but pretty much anything is okay.
To complement #rkhayrov's answer, there exists a command for that already: xclip. Or more exactly, there's a patch to xclip which was added to xclip later on in 2010, but hasn't been released yet that does that. So, assuming your OS like Debian ships with the subversion head of xclip (2019 edit: version 0.13 with those changes was eventually released in 2016 (and pulled into Debian in January 2019)):
To list the targets for the CLIPBOARD selection:
$ xclip -selection clipboard -o -t TARGETS
TIMESTAMP
TARGETS
MULTIPLE
SAVE_TARGETS
text/html
text/_moz_htmlcontext
text/_moz_htmlinfo
UTF8_STRING
COMPOUND_TEXT
TEXT
STRING
text/x-moz-url-priv
To select a particular target:
$ xclip -selection clipboard -o -t text/html
 rkhayrov
$ xclip -selection clipboard -o -t UTF8_STRING
rkhayrov
$ xclip -selection clipboard -o -t TIMESTAMP
684176350
And xclip can also set and own a selection (-i instead of -o).
In X11 you have to communicate with the selection owner, ask about supported formats, and then request data in the specific format. I think the easiest way to do this is using existing windowing toolkits. E,g. with Python and GTK:
#!/usr/bin/python
import glib, gtk
def test_clipboard():
clipboard = gtk.Clipboard()
targets = clipboard.wait_for_targets()
print "Targets available:", ", ".join(map(str, targets))
for target in targets:
print "Trying '%s'..." % str(target)
contents = clipboard.wait_for_contents(target)
if contents:
print contents.data
def main():
mainloop = glib.MainLoop()
def cb():
test_clipboard()
mainloop.quit()
glib.idle_add(cb)
mainloop.run()
if __name__ == "__main__":
main()
Output will look like this:
$ ./clipboard.py
Targets available: TIMESTAMP, TARGETS, MULTIPLE, text/html, text/_moz_htmlcontext, text/_moz_htmlinfo, UTF8_STRING, COMPOUND_TEXT, TEXT, STRING, text/x-moz-url-priv
...
Trying 'text/html'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/_moz_htmlcontext'...
<html><body class="question-page"><div class="container"><div id="content"><div id="mainbar"><div id="question"><table><tbody><tr><td class="postcell"><div><div class="post-text"><p></p></div></div></td></tr></tbody></table></div></div></div></div></body></html>
...
Trying 'STRING'...
I asked the same question on superuser.com, because I was hoping there was a utility to do this, but I didn't get any informative responses.
Trying 'text/x-moz-url-priv'...
http://stackoverflow.com/questions/3261379/getting-html-source-or-rich-text-from-the-x-clipboard
Extending the ideas from Stephane Chazelas, you can:
Copy from the formatted source.
Run this command to extract from the clipboard, convert to HTML, and then (with a pipe |) put that HTML back in the clipboard, again using the same xclip:
xclip -selection clipboard -o -t text/html | xclip -selection clipboard
Next, when you paste with Ctrl+v, it will paste the HTML source.
Going further, you can make it a shortcut, so that you don't have to open the terminal and run the exact command each time. ✨
To do that:
Open the settings for your OS (in my case it's Ubuntu)
Find the section for the Keyboard
Then find the section for shortcuts
Create a new shortcut
Set a Name, e.g.: Copy as HTML
Then as the command for the shortcut, put:
bash -c "xclip -selection clipboard -o -t text/html | xclip -selection clipboard"
Note: notice that it's the same command as above, but put inside of an inline Bash script. This is necessary to be able to use the | (pipe) to send the output from one command as input to the next.
Set the shortcut to whatever combination you want, preferably not overwriting another shortcut you use. In my case, I set it to: Ctrl+Shift+c
After this, you can copy some formatted text as normally with: Ctrl+c
And then, before pasting it, convert it to HTML with: Ctrl+Shift+c
Next, when you paste it with: Ctrl+v, it will paste the contents as HTML. 🧙✨

curl: downloading from dynamic url

I'm trying to download an html file with curl in bash. Like this site:
http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=10S&subareasel=PHYSICS&idxcrs=0001B+++
When I download it manually, it works fine. However, when i try and run my script through crontab, the output html file is very small and just says "Object moved to here." with a broken link. Does this have something to do with the sparse environment the crontab commands run it? I found this question:
php ssl curl : object moved error
but i'm using bash, not php. What are the equivalent command line options or variables to set to fix this problem in bash?
(I want to do this with curl, not wget)
Edit: well, sometimes downloading the file manually (via interactive shell) works, but sometimes it doesn't (I still get the "Object moved here" message). So it may not be a a specifically be a problem with cron's environment, but with curl itself.
the cron entry:
* * * * * ~/.class/test.sh >> ~/.class/test_out 2>&1
test.sh:
#! /bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/sbin
cd ~/.class
course="physics 1b"
url="http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=10S<URL>subareasel=PHYSICS<URL>idxcrs=0001B+++"
curl "$url" -sLo "$course".html --max-redirs 5
Edit: Problem solved. The issue was the stray tags in the url. It was because I was doing sed s,"<URL>",\""$url"\", template.txt > test.sh to generate the scripts and sed replaced all instances of & with the regular expression <URL>. After fixing the url, curl works fine.
You want the -L or --location option, which follows 300 series redirects. --maxredirs [n] will limit curl to n redirects.
Its curious that this works from an interactive shell. Are you fetching the same url? You could always try sourcing your environment scripts in your cron entry:
* * * * * . /home/you/.bashrc ; curl -L --maxredirs 5 ...
EDIT: the example url is somewhat different than the one in the script. $url in the script has an additional pair of <URL> tags. Replacing them with &, the conventional argument seperators for GET requests, works for me.
Without seeing your script it's hard to guess what exactly is going on, but it's likely that it's an environment problem as you surmise.
One thing that often helps is to specify the full path to executables and files in your script.
If you show your script and crontab entry, we can be of more help.

How to configure firefox to run emacsclientw on certain links?

I've got a Perl script that groks a bunch of log files looking for "interesting" lines, for some definition of interesting. It generates an HTML file which consists of a table whose columns are a timestamp, a filename/linenum reference and the "interesting" bit. What I'd love to do is have the filename/linenum be an actual link that will bring up that file with the cursor positioned on that line number, in emacs.
emacsclientw will allow such a thing (e.g. emacsclientw +60 foo.log) but I don't know what kind of URL/URI to construct that will let FireFox call out to emacsclientw. The original HTML file will be local, so there's no problem there.
Should I define my own MIME type and hook in that way?
Firefox version is 3.5 and I'm running Windows, in case any of that matters. Thanks!
Go to about:config page in firefox. Add a new string :
network.protocol-handler.app.emacs
value: path to a script that parse the url without protocol (what's after emacs://) and then call emacsclient with the proper argument.
You can't just put the path of emacsclient because everything after the protocol is passed as one arg to the executable so your +60 foo.log would be a new file named that way.
But you could easily imagine someting like emacs:///path/to/your/file/LINENUM and have a little script that remove the final / and number and call emacsclient with the number and the file :-)
EDIT: I could do that in bash if you want but i don't know how to do that with the windows "shell" or whatever it is called.
EDIT2: I'm wrong on something, the protocol is passed in the arg string to !
Here is a little bash script that i just made for me, BTW thanks for the idea :-D
#!/bin/bash
ARG=${1##emacs://}
LINE=${ARG##*/}
FILE=${ARG%/*}
if wmctrl -l | grep emacs#romuald &>/dev/null; then # if there's already an emacs frame
ARG="" # then just open the file in the existing emacs frame
else
ARG="-c" # else create a new frame
fi
emacsclient $ARG -n +$LINE "$FILE"
exit $?
and my network.protocol-handler.app.emacs in my iceweasel (firefox) is /home/p4bl0/bin/ffemacsclient. It works just fine !
And yes, my laptop's name is romuald ^^.
Thanks for the pointer, p4bl0. Unfortunately, that only works on a real OS; Windows uses a completely different method. See http://kb.mozillazine.org/Register_protocol for more info.
But, you certainly provided me the start I needed, so thank you very, very much!
Here's the solution for Windows:
First you need to set up the registry correctly to handle this new URL type. For that, save the following to a file, edit it to suit your environment, save it and double click on it:
Windows Registry Editor Version 5.00
[HKEY_CLASSES_ROOT\emacs]
#="URL:Emacs Protocol"
"URL Protocol"=""
[HKEY_CLASSES_ROOT\emacs\shell]
[HKEY_CLASSES_ROOT\emacs\shell\open]
[HKEY_CLASSES_ROOT\emacs\shell\open\command]
#="\"c:\\product\\emacs\\bin\\emacsclientw.exe\" --no-wait -e \"(emacs-uri-handler \\\"%1\\\")\""
This is not as robust as p4bl0's shell script, because it does not make sure that Emacs is running first. Then add the following to your .emacs file:
(defun emacs-uri-handler (uri)
"Handles emacs URIs in the form: emacs:///path/to/file/LINENUM"
(save-match-data
(if (string-match "emacs://\\(.*\\)/\\([0-9]+\\)$" uri)
(let ((filename (match-string 1 uri))
(linenum (match-string 2 uri)))
(with-current-buffer (find-file filename)
(goto-line (string-to-number linenum))))
(beep)
(message "Unable to parse the URI <%s>" uri))))
The above code will not check to make sure the file exists, and the error handling is rudimentary at best. But it works!
Then create an HTML file that has lines like the following:
file: c:/temp/my.log, line: 60
and then click on the link.
Post Script:
I recently switched to Linux (Ubuntu 9.10) and here's what I did for that OS:
$ gconftool -s /desktop/gnome/url-handlers/emacs/command '/usr/bin/emacsclient --no-wait -e "(emacs-uri-handler \"%s\")"' --type String
$ gconftool -s /desktop/gnome/url-handlers/emacs/enabled --type Boolean true
Using the same emacs-uri-handler from above.
Might be a great reason to write your first FF plugin ;)