Is there a way to force tcpdump to print only hex format when reading CAN packets? - tcpdump

Although the can-utils package is available in Linux to interact with CAN network devices, I am trying to confirm if tcpdump can print just hex format when reading CAN packets without including ASCII characters?
tcpdump version 4.2.1
libpcap version 1.1.1
The only work around was I found was to direct the tcpdump output to a file then read from said file using a util like hexdump, xxd, or OD, etc.
The upper left screen is based on the script below:
# tcpdump -ivcan0 -s0 -x -w - -s0 -l > canpackets.hex & tail -f canpackets.hex | hexdump -vC
The upper right screen represents the tcpdump output using (-x) without redirecting to hexdump; which still prints ASCII.
The bottom left screen is candump; which is here just to illustrate the hex values that are generated by the cangen util in the bottom right screen.
while :; do ./cansend vcan0 001#1122334455667788; sleep .25; done

This is not a direct answer to my original question.
A solution is, "Just use tshark; it works better with can packets by not printing garble."

Related

Is it possible to directly 'convert' a google drive image url in Imagemagick?

I have had a good search around for this, but cannot find a concrete answer. I have been trying to run the convert command against a url for an image I have stored on google drive, in a public folder.
I have been able to get this to work if I wget the url with -qO- and pipe this to convert- e.g.
wget 'https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1' -qO- | convert - -resize 100x100 MGFIN01.png
Ideally I would prefer to be able to directly run the url through convert:
convert https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1 -resize 100x100 MGFIN01.png
With the ultimate intention of creating an html image map like: http://www.imagemagick.org/Usage/montage/#html therefore requiring a list of urls and names (I can probably work this out once I have resolved the url part)
I am on Ubuntu with imagemagick 6.9. I see in delegates.xml that I have this:
<delegate decode="https" command=""curl" -s -k -L -o "%o" "https:%M""/>
Also tried the download with curl and options and that also worked.
This works for me on ImageMagick 6.9.10.97 Q16 Mac OSX. Put the URL in double quotes. You may also have to edit your policy.xml file to give permission for HTTPS. See policy.xml at https://imagemagick.org/script/resources.php
convert "https://drive.google.com/uc?export=download&id=1ydsWevwDxARqrabo5yZEYozez0eZK4K1" -resize 100x100 MGFIN01.png
Just to give a tidier response than possible in comments:
Open policy.xml
sudo nano /etc/ImageMagick-6/policy.xml
Scroll down to find:
<policy domain="delegate" rights="none" pattern="HTTPS" />
Edit this to show:
<policy domain="delegate" rights="read" pattern="https" />
Save (CTRL+X, Y)
Run convert command again. Tada.

OCR detecting E as £

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")
Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.
A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:
Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text:
tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist
In this case the parameters must be setted in your Python script as:
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

Apache forbidden error when trying to run a Perl script [duplicate]

I have a Perl CGI script that isn't working and I don't know how to start narrowing down the problem. What can I do?
Note: I'm adding the question because I really want to add my very lengthy answer to Stack Overflow. I keep externally linking to it in other answers and it deserves to be here. Don't be shy about editing my answer if you have something to add.
This answer is intended as a general framework for working through
problems with Perl CGI scripts and originally appeared on Perlmonks as Troubleshooting Perl CGI Scripts. It is not a complete guide to every
problem that you may encounter, nor a tutorial on bug squashing. It
is just the culmination of my experience debugging CGI scripts for twenty (plus!) years. This page seems to have had many different homes, and I seem
to forget it exists, so I'm adding it to the StackOverflow. You
can send any comments or suggestions to me at
bdfoy#cpan.org. It's also community wiki, but don't go too nuts. :)
Are you using Perl's built in features to help you find problems?
Turn on warnings to let Perl warn you about questionable parts of your code. You can do this from the command line with the -w switch so you don't have to change any code or add a pragma to every file:
% perl -w program.pl
However, you should force yourself to always clear up questionable code by adding the warnings pragma to all of your files:
use warnings;
If you need more information than the short warning message, use the diagnostics pragma to get more information, or look in the perldiag documentation:
use diagnostics;
Did you output a valid CGI header first?
The server is expecting the first output from a CGI script to be the CGI header. Typically that might be as simple as print "Content-type: text/plain\n\n"; or with CGI.pm and its derivatives, print header(). Some servers are sensitive to error output (on STDERR) showing up before standard output (on STDOUT).
Try sending errors to the browser
Add this line
use CGI::Carp 'fatalsToBrowser';
to your script. This also sends compilation errors to the browser window. Be sure to remove this before moving to a production environment, as the extra information can be a security risk.
What did the error log say?
Servers keep error logs (or they should, at least).
Error output from the server and from your script should
show up there. Find the error log and see what it says.
There isn't a standard place for log files. Look in the
server configuration for their location, or ask the server
admin. You can also use tools such as CGI::Carp
to keep your own log files.
What are the script's permissions?
If you see errors like "Permission denied" or "Method not
implemented", it probably means that your script is not
readable and executable by the web server user. On flavors
of Unix, changing the mode to 755 is recommended:
chmod 755 filename. Never set a mode to 777!
Are you using use strict?
Remember that Perl automatically creates variables when
you first use them. This is a feature, but sometimes can
cause bugs if you mistype a variable name. The pragma
use strict will help you find those sorts of
errors. It's annoying until you get used to it, but your
programming will improve significantly after awhile and
you will be free to make different mistakes.
Does the script compile?
You can check for compilation errors by using the -c
switch. Concentrate on the first errors reported. Rinse,
repeat. If you are getting really strange errors, check to
ensure that your script has the right line endings. If you
FTP in binary mode, checkout from CVS, or something else that
does not handle line end translation, the web server may see
your script as one big line. Transfer Perl scripts in ASCII
mode.
Is the script complaining about insecure dependencies?
If your script complains about insecure dependencies, you
are probably using the -T switch to turn on taint mode, which is
a good thing since it keeps you have passing unchecked data to the shell. If
it is complaining it is doing its job to help us write more secure scripts. Any
data originating from outside of the program (i.e. the environment)
is considered tainted. Environment variables such as PATH and
LD_LIBRARY_PATH
are particularly troublesome. You have to set these to a safe value
or unset them completely, as I recommend. You should be using absolute
paths anyway. If taint checking complains about something else,
make sure that you have untainted the data. See perlsec
man page for details.
What happens when you run it from the command line?
Does the script output what you expect when run from the
command line? Is the header output first, followed by a
blank line? Remember that STDERR may be merged with STDOUT
if you are on a terminal (e.g. an interactive session), and
due to buffering may show up in a jumbled order. Turn on
Perl's autoflush feature by setting $| to a
true value. Typically you might see $|++; in
CGI programs. Once set, every print and write will
immediately go to the output rather than being buffered.
You have to set this for each filehandle. Use select to
change the default filehandle, like so:
$|++; #sets $| for STDOUT
$old_handle = select( STDERR ); #change to STDERR
$|++; #sets $| for STDERR
select( $old_handle ); #change back to STDOUT
Either way, the first thing output should be the CGI header
followed by a blank line.
What happens when you run it from the command line with a CGI-like environment?
The web server environment is usually a lot more limited
than your command line environment, and has extra
information about the request. If your script runs fine
from the command line, you might try simulating a web server
environment. If the problem appears, you have an
environment problem.
Unset or remove these variables
PATH
LD_LIBRARY_PATH
all ORACLE_* variables
Set these variables
REQUEST_METHOD (set to GET, HEAD, or POST as appropriate)
SERVER_PORT (set to 80, usually)
REMOTE_USER (if you are doing protected access stuff)
Recent versions of CGI.pm ( > 2.75 ) require the -debug flag to
get the old (useful) behavior, so you might have to add it to
your CGI.pm imports.
use CGI qw(-debug)
Are you using die() or warn?
Those functions print to STDERR unless you have redefined
them. They don't output a CGI header, either. You can get
the same functionality with packages such as CGI::Carp
What happens after you clear the browser cache?
If you think your script is doing the right thing, and
when you perform the request manually you get the right
output, the browser might be the culprit. Clear the cache
and set the cache size to zero while testing. Remember that
some browsers are really stupid and won't actually reload
new content even though you tell it to do so. This is
especially prevalent in cases where the URL path is the
same, but the content changes (e.g. dynamic images).
Is the script where you think it is?
The file system path to a script is not necessarily
directly related to the URL path to the script. Make sure
you have the right directory, even if you have to write a
short test script to test this. Furthermore, are you sure
that you are modifying the correct file? If you don't see
any effect with your changes, you might be modifying a
different file, or uploading a file to the wrong place.
(This is, by the way, my most frequent cause of such trouble
;)
Are you using CGI.pm, or a derivative of it?
If your problem is related to parsing the CGI input and you
aren't using a widely tested module like CGI.pm, CGI::Request,
CGI::Simple or CGI::Lite, use the module and get on with life.
CGI.pm has a cgi-lib.pl compatibility mode which can help you solve input
problems due to older CGI parser implementations.
Did you use absolute paths?
If you are running external commands with
system, back ticks, or other IPC facilities,
you should use an absolute path to the external program.
Not only do you know exactly what you are running, but you
avoid some security problems as well. If you are opening
files for either reading or writing, use an absolute path.
The CGI script may have a different idea about the current
directory than you do. Alternatively, you can do an
explicit chdir() to put you in the right place.
Did you check your return values?
Most Perl functions will tell you if they worked or not
and will set $! on failure. Did you check the
return value and examine $! for error messages? Did you check
$# if you were using eval?
Which version of Perl are you using?
The latest stable version of Perl is 5.28 (or not, depending on when this was last edited). Are you using an older version? Different versions of Perl may have different ideas of warnings.
Which web server are you using?
Different servers may act differently in the same
situation. The same server product may act differently with
different configurations. Include as much of this
information as you can in any request for help.
Did you check the server documentation?
Serious CGI programmers should know as much about the
server as possible - including not only the server features
and behavior, but also the local configuration. The
documentation for your server might not be available to you
if you are using a commercial product. Otherwise, the
documentation should be on your server. If it isn't, look
for it on the web.
Did you search the archives of comp.infosystems.www.authoring.cgi?
This use to be useful but all the good posters have either died or wandered off.
It's likely that someone has had your problem before,
and that someone (possibly me) has answered it in this
newsgroup. Although this newsgroup has passed its heyday, the collected wisdom from the past can sometimes be useful.
Can you reproduce the problem with a short test script?
In large systems, it may be difficult to track down a bug
since so many things are happening. Try to reproduce the problem
behavior with the shortest possible script. Knowing the problem
is most of the fix. This may be certainly time-consuming, but you
haven't found the problem yet and you're running out of options. :)
Did you decide to go see a movie?
Seriously. Sometimes we can get so wrapped up in the problem that we
develop "perceptual narrowing" (tunnel vision). Taking a break,
getting a cup of coffee, or blasting some bad guys in [Duke Nukem,Quake,Doom,Halo,COD] might give you
the fresh perspective that you need to re-approach the problem.
Have you vocalized the problem?
Seriously again. Sometimes explaining the problem aloud
leads us to our own answers. Talk to the penguin (plush toy) because
your co-workers aren't listening. If you are interested in this
as a serious debugging tool (and I do recommend it if you haven't
found the problem by now), you might also like to read The Psychology
of Computer Programming.
I think CGI::Debug is worth mentioning as well.
Are you using an error handler while you are debugging?
die statements and other fatal run-time and compile-time errors get
printed to STDERR, which can be hard to find and may be conflated with
messages from other web pages at your site. While you're debugging your
script, it's a good idea to get the fatal error messages to display in your
browser somehow.
One way to do this is to call
use CGI::Carp qw(fatalsToBrowser);
at the top of your script. That call will install a $SIG{__DIE__} handler (see perlvar)display fatal errors in your browser, prepending it with a valid header if necessary. Another CGI debugging trick that I used before I ever heard of CGI::Carp was to
use eval with the DATA and __END__ facilities on the script to catch compile-time errors:
#!/usr/bin/perl
eval join'', <DATA>;
if ($#) { print "Content-type: text/plain:\n\nError in the script:\n$#\n; }
__DATA__
# ... actual CGI script starts here
This more verbose technique has a slight advantage over CGI::Carp in that it will catch more compile-time errors.
Update: I've never used it, but it looks like CGI::Debug, as Mikael S
suggested, is also a very useful and configurable tool for this purpose.
I wonder how come no-one mentioned the PERLDB_OPTS option called RemotePort; although admittedly, there aren't many working examples on the web (RemotePort isn't even mentioned in perldebug) - and it was kinda problematic for me to come up with this one, but here it goes (it being a Linux example).
To do a proper example, first I needed something that can do a very simple simulation of a CGI web server, preferably through a single command line. After finding Simple command line web server for running cgis. (perlmonks.org), I found the IO::All - A Tiny Web Server to be applicable for this test.
Here, I'll work in the /tmp directory; the CGI script will be /tmp/test.pl (included below). Note that the IO::All server will only serve executable files in the same directory as CGI, so chmod +x test.pl is required here. So, to do the usual CGI test run, I change directory to /tmp in the terminal, and run the one-liner web server there:
$ cd /tmp
$ perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $1 ? "./$1 |" : $1) if /^GET \/(.*) / })'
The webserver command will block in the terminal, and will otherwise start the web server locally (on 127.0.0.1 or localhost) - afterwards, I can go to a web browser, and request this address:
http://127.0.0.1:8080/test.pl
... and I should observe the prints made by test.pl being loaded - and shown - in the web browser.
Now, to debug this script with RemotePort, first we need a listener on the network, through which we will interact with the Perl debugger; we can use the command line tool netcat (nc, saw that here: Perl如何remote debug?). So, first run the netcat listener in one terminal - where it will block and wait for connections on port 7234 (which will be our debug port):
$ nc -l 7234
Then, we'd want perl to start in debug mode with RemotePort, when the test.pl has been called (even in CGI mode, through the server). This, in Linux, can be done using the following "shebang wrapper" script - which here also needs to be in /tmp, and must be made executable:
cd /tmp
cat > perldbgcall.sh <<'EOF'
#!/bin/bash
PERLDB_OPTS="RemotePort=localhost:7234" perl -d -e "do '$#'"
EOF
chmod +x perldbgcall.sh
This is kind of a tricky thing - see shell script - How can I use environment variables in my shebang? - Unix & Linux Stack Exchange. But, the trick here seems to be not to fork the perl interpreter which handles test.pl - so once we hit it, we don't exec, but instead we call perl "plainly", and basically "source" our test.pl script using do (see How do I run a Perl script from within a Perl script?).
Now that we have perldbgcall.sh in /tmp - we can change the test.pl file, so that it refers to this executable file on its shebang line (instead of the usual Perl interpreter) - here is /tmp/test.pl modified thus:
#!./perldbgcall.sh
# this is test.pl
use 5.10.1;
use warnings;
use strict;
my $b = '1';
my $a = sub { "hello $b there" };
$b = '2';
print "YEAH " . $a->() . " CMON\n";
$b = '3';
print "CMON " . &$a . " YEAH\n";
$DB::single=1; # BREAKPOINT
$b = '4';
print "STEP " . &$a . " NOW\n";
$b = '5';
print "STEP " . &$a . " AGAIN\n";
Now, both test.pl and its new shebang handler, perldbgcall.sh, are in /tmp; and we have nc listening for debug connections on port 7234 - so we can finally open another terminal window, change directory to /tmp, and run the one-liner webserver (which will listen for web connections on port 8080) there:
cd /tmp
perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $1 ? "./$1 |" : $1) if /^GET \/(.*) / })'
After this is done, we can go to our web browser, and request the same address, http://127.0.0.1:8080/test.pl. However, now when the webserver tries to execute the script, it will do so through perldbgcall.sh shebang - which will start perl in remote debugger mode. Thus, the script execution will pause - and so the web browser will lock, waiting for data. We can now switch to the netcat terminal, and we should see the familiar Perl debugger text - however, output through nc:
$ nc -l 7234
Loading DB routines from perl5db.pl version 1.32
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(-e:1): do './test.pl'
DB<1> r
main::(./test.pl:29): $b = '4';
DB<1>
As the snippet shows, we now basically use nc as a "terminal" - so we can type r (and Enter) for "run" - and the script will run up do the breakpoint statement (see also In perl, what is the difference between $DB::single = 1 and 2?), before stopping again (note at that point, the browser will still lock).
So, now we can, say, step through the rest of test.pl, through the nc terminal:
....
main::(./test.pl:29): $b = '4';
DB<1> n
main::(./test.pl:30): print "STEP " . &$a . " NOW\n";
DB<1> n
main::(./test.pl:31): $b = '5';
DB<1> n
main::(./test.pl:32): print "STEP " . &$a . " AGAIN\n";
DB<1> n
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1>
... however, also at this point, the browser locks and waits for data. Only after we exit the debugger with q:
DB<1> q
$
... does the browser stop locking - and finally displays the (complete) output of test.pl:
YEAH hello 2 there CMON
CMON hello 3 there YEAH
STEP hello 4 there NOW
STEP hello 5 there AGAIN
Of course, this kind of debug can be done even without running the web server - however, the neat thing here, is that we don't touch the web server at all; we trigger execution "natively" (for CGI) from a web browser - and the only change needed in the CGI script itself, is the change of shebang (and of course, the presence of the shebang wrapper script, as executable file in the same directory).
Well, hope this helps someone - I sure would have loved to have stumbled upon this, instead of writing it myself :)
Cheers!
For me, I use log4perl . It's quite useful and easy.
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init( { level => $DEBUG, file => ">>d:\\tokyo.log" } );
my $logger = Log::Log4perl::get_logger();
$logger->debug("your log message");
Honestly you can do all the fun stuff above this post.
ALTHOUGH, the simplest and most proactive solution I found was to just "print it".
In example:
(Normal code)
`$somecommand`;
To see if it's doing what I really want it to do:
(Trouble shooting)
print "$somecommand";
It will probably also be worth mentioning that Perl will always tell you on what line the error occurs when you execute the Perl script from the command line. (An SSH Session for example)
I will usually do this if all else fails. I will SSH into the server and manually execute the Perl script. For example:
% perl myscript.cgi
If there is a problem then Perl will tell you about it. This debugging method does away with any file permission related issues or web browser or web server issues.
You may run the perl cgi-script in terminal using the below command
$ perl filename.cgi
It interpret the code and provide result with HTML code.
It will report the error if any.

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). #nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?
Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid
Code
brew install wine # takes a little while >10m
brew install gs # only for generating a tif example. Not required, you can use Preview
brew install wget # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs \
-o paper-%d.tif \
-sDEVICE=tiff24nc \
-r300x300 \
paper.pdf
cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \
-inp-img "$dl/Downloads/paper-3.tif" \
-out-xml "$dl/Downloads/paper-3-tool.xml" \
-rec-mode layout>>log.txt
# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
Results
Document with overlays (rollover to see text and type)
Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
Unfortunately, I kept getting null pointers.
Could not convert to target XML schema format.
java.lang.NullPointerException
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.
If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
img = Image.open(filename)
results = []
with PyTessBaseAPI() as api:
api.SetImage(img)
api.SetPageSegMode(PSM.AUTO_ONLY)
iterator = api.AnalyseLayout()
for w in iterate_level(iterator, RIL.BLOCK):
if w is not None:
results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))
draw = ImageDraw.Draw(img)
for block_type, poly in results:
# you can define a color per block type (see tesserocr.PT for block types list)
draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)
With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.
More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.
The HOCR individual character step is now available in Tesseract since 4.1.
Once the installation check, use :
tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

curl: downloading from dynamic url

I'm trying to download an html file with curl in bash. Like this site:
http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=10S&subareasel=PHYSICS&idxcrs=0001B+++
When I download it manually, it works fine. However, when i try and run my script through crontab, the output html file is very small and just says "Object moved to here." with a broken link. Does this have something to do with the sparse environment the crontab commands run it? I found this question:
php ssl curl : object moved error
but i'm using bash, not php. What are the equivalent command line options or variables to set to fix this problem in bash?
(I want to do this with curl, not wget)
Edit: well, sometimes downloading the file manually (via interactive shell) works, but sometimes it doesn't (I still get the "Object moved here" message). So it may not be a a specifically be a problem with cron's environment, but with curl itself.
the cron entry:
* * * * * ~/.class/test.sh >> ~/.class/test_out 2>&1
test.sh:
#! /bin/bash
PATH=/usr/local/bin:/usr/bin:/bin:/sbin
cd ~/.class
course="physics 1b"
url="http://www.registrar.ucla.edu/schedule/detselect.aspx?termsel=10S<URL>subareasel=PHYSICS<URL>idxcrs=0001B+++"
curl "$url" -sLo "$course".html --max-redirs 5
Edit: Problem solved. The issue was the stray tags in the url. It was because I was doing sed s,"<URL>",\""$url"\", template.txt > test.sh to generate the scripts and sed replaced all instances of & with the regular expression <URL>. After fixing the url, curl works fine.
You want the -L or --location option, which follows 300 series redirects. --maxredirs [n] will limit curl to n redirects.
Its curious that this works from an interactive shell. Are you fetching the same url? You could always try sourcing your environment scripts in your cron entry:
* * * * * . /home/you/.bashrc ; curl -L --maxredirs 5 ...
EDIT: the example url is somewhat different than the one in the script. $url in the script has an additional pair of <URL> tags. Replacing them with &, the conventional argument seperators for GET requests, works for me.
Without seeing your script it's hard to guess what exactly is going on, but it's likely that it's an environment problem as you surmise.
One thing that often helps is to specify the full path to executables and files in your script.
If you show your script and crontab entry, we can be of more help.