How do I force tika server to exclude the TesseractOCRParser using curl - ocr

I'm running tika-server-1.23.jar with tesseract and extracting text from files using curl via php. Sometimes it takes too long to run with OCR so I'd like, occasionally, to exclude running tesseract. I can do this by inserting
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
in the tika config xml file but this means it never runs tesseract.
Can I force the tika server to skip using tesseract selectively at each request via curl and, if so, how?
I've got a workaround where I'm running two instances of the tika server each with a different config file listening on different ports but this is sub-optimal.
Thanks in advance.

You can set the OCR strategy using headers for PDF files, which includes an option not to OCR:
curl -T test.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: no_ocr"
There isn't really an equivalent for other file types, but there is a similar header prefix call X-Tika-OCR that allows you to set configuration on the TesseractOCRConfig instance when used on any file type.
You have some options which could be of interest in your scenario:
maxFileSizeToOcr - which you could set to 0
timeout - which you could set to the timeout you are willing to give
tesseractPath - which you can set to anything, as if it can't find it, it can't execute
So, for example, if you want to skip a file you could set the max file size to 0 which means it will not be processed:
curl -T testOCR.jpg http://localhost:9998/tika --header "X-Tika-OCRmaxFileSizeToOcr: 0"
Or set the path to /dummy:
curl -T testOCR.jpg http://localhost:9998/tika --header "X-Tika-OCRtesseractPath: /dummy"
You can of course also use these headers with PDF files too, should you wish.

Related

Modify an etherpad instance with command line tool

Most instances of Etherpad accept setting the entire file by uploading an HTML file. Is there a way to automate this process with a command line tool such as cURL ?
You can use the HTTP API to create or change the content of a pad. There are several client libraries available.

Linux shell script command - gzip

I am having one shell script in Linux in which the output will be generated in .csv format.
At the end of the script i am making this .csv to .gz format to reduce the space on my machine.
The file which is generated comes in this format Output_04-07-2015.csv
The command which i have written to make it zip is:-gzip Output_*.csv
But i am facing an issue that if the file already exists, then it should make the new file with that reported time stamp.
Can anyone help me with it.?
If all you want is to just overwrite the file if it already exists, gzip has a -f flag for it.
gzip -f Output_*.csv
What the -f flag does is forcefully create the gzip file, and overwrite whatever existing zip file there might already be.
Have a look at the man pages by typing man gzip or even this link for many other options.
If instead you want to do it more elegantly, you could check out and see if shell commands for your script work for you or not. But that would differ depending on what shell you have, bash, cshell, etc.

Importing large datasets into Couchbase

I am having difficulty importing large datasets into Couchbase. I have experience doing this very fast with Redis via the command line but I have not seen anything yet for Couchbase.
I have tried using the PHP SDK and it imports about 500 documents / second. I have also tried the cbcdocload script in the Couchbase bin folder but it seems to want each document in its on JSON file. It is a bit of work to create all these files and then load them. Is there some other importation process I am missing? If cbcdocload is the only way load data fast then is it possible to put multiple documents into 1 json file.
Take the file that has all the JSON documents in it and zip up the file:
zip somefile.zip somefile.json
Place the zip file(s) into a directory. I used ~/json_files/ in my home directory.
Then load the file or files by the following command:
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 \
~/json_files/somefile.zip
Note: '-s 1000' is the memory size. You'll need to adjust this value for your bucket.
If successful you'll see output stating how many documents were loaded, success, etc.
Here is a brief script to load up a lot of .zip files in a given directory:
#!/bin/bash
JSON_Dir=~/json_files/
for ZipFile in $JSON_Dir/*.zip ;
do /Applications/Couchbase\ Server.app/Contents/Resources/couchbase-core/bin/cbdocloader \
-u Administrator -p s3kre7Pa55 -b MyBucketToLoad \
-n 127.0.0.1:8091 -s 1000 $ZipFile
done
UPDATED: Keep in mind this script will only work if your data is formatted correctly or if the files are less than the max single document size of 20MB. (not the zipfile, but any document extracted from the zip)
I have created a blog post describing bulk loading from a single file as well and it is listed here:
Bulk Loading Documents Into Couchbase

How to get AWStats to generate static HTML files?

I want to get AWStats running on my webserver that runs Debian 4.4.5-8 with Apache 2.
There are several websites that all have their own configuration file, similar to this:
Include "/etc/awstats/awstats.model.conf"
LogFile="/var/customers/logs/myname-example.com-access.log"
LogType=W
LogFormat = 1
LogSeparator=" "
SiteDomain="example.com"
HostAliases="*.example.com"
DirData="/www/myname/awstats/example.com/"
What I expect is that HTML files are written to /www/myname/awstats/example.com/ which I can then access through Apache. However when I run /usr/share/awstats/tools/buildstatic.sh what happens is that .txt files are written to that directory and HTML files that I want are written to /var/cache/awstats. The error file in /tmp remains empty.
Why is this happening and how do I make it work the way I want?
DirData is not supposed to be read directly by the Web Server. It is used by awstats.pl.
The fact is that /var/cache/awstats is hardcoded in buildstatic.sh so you have to change the two lines mentioning it:
mkdir -p /var/cache/awstats/$c/$Y/$m/
and
-dir=/var/cache/awstats/$c/$Y/$m/ >$TMPFILE 2>&1

--no-clobber still overwrites file if --html-extension used in wget?

I have a script for downloading all of my Chrome Bookmarks. I use wget with the --html-extension because some of the bookmarks end in .php and can't be opened by a web browser unless --html-extension option is used. The problem I am having is that when I use --html-extension with --no-clobber, It doesn't recognize that most of the files are already there for some reason, so it goes through the whole process of redownloading stuff it already has.
An example:
wget -nc http://www.test.com/
run once will save the file like it is supposed to. if you run it again then it will say the file already there so not retrieving. that is the operation i would expect.
however, delete the file that was just saved and run:
wget -nc http://www.test.com/ --html-extension
and then run that same command again. it overwrites the file instead of saying file already there. What is going on?
When the html suffix is added, wget can't tell what remote file you want to compare it to.
man wget: http://unixhelp.ed.ac.uk/CGI/man-cgi?wget
======================
--html-extension
If a file of type application/xhtml+xml or text/html is downloaded
and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this
option will cause the suffix .html to be appended to the local
filename. This is useful, for instance, when you're mirroring a
remote site that uses .asp pages, but you want the mirrored pages
to be viewable on your stock Apache server. Another good use for
this is when you're downloading CGI-generated materials. A URL
like http://site.com/article.cgi?25 will be saved as arti-
cle.cgi?25.html.
Note that filenames changed in this way will be re-downloaded every
time you re-mirror a site, because Wget can't tell that the local
X.html file corresponds to remote URL X (since it doesn't yet know
that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k
and -K so that the original version of the file will be saved as
X.orig.