Train tesseract for Hindi language

Train tesseract for Hindi language - ocr

I want to train my tesseract for hindi language . I have many 'hindi' written text images with specific font and I would like to train tesseract ocr for that images .
Several times I tried train tesseract using this link https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 . when I run makebox command it extracts box file but it recognises like english character. I dont understand why this happen. Please help me to train tesseract ocr for Hindi language.
You can check sample image on following link.
sample file

I have been wanting to train a few character sets myself, and have been gathering information first. Maybe this info is of use to you too.
Did you read this document:
http://blog.cedric.ws/how-to-train-tesseract-301
If none of the characters are recognized you will have to train all of the characters, I'm afraid. But important steps seem to be:
include the indication of the language ('eng') in the makebox command line (this would probably be 'hin' in your case.
be aware of the version of tesseract. I have the impression that the training procedure has been changing in the last versions.

Sample program of the recognize the Hindi char from the image and store the respective bounding box values and respective Hindi char store into the one file.
/*
* Char_OCR.cpp
*
* Created on: Jun 23, 2016
* Author: pratik
*/
#include <opencv2/opencv.hpp>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <iostream>
#include <fstream>
using namespace std;
using namespace cv;
void dumpIntoFile(const char *ocrResult , ofstream &myfile1 ,int x1, int y1,
int x2, int y2, int &);
int main(int argc ,char **argv)
{
Pix *image = pixRead(argv[1]);
if (image == 0) {
cout << "Cannot load input file!\n";
}
tesseract::TessBaseAPI tess;
if (tess.Init("/usr/share/tesseract/tessdata", "hin")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
tess.SetImage(image);
tess.Recognize(0);
tesseract::ResultIterator *ri = tess.GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_SYMBOL;
cout << ri << endl;
ofstream myfile1("Word.txt");
myfile1 << "ID" << '\t' << "CORD_X" << '\t' << "CORD_Y" << '\t' <<
"CORD_W" << '\t' << "CORD_H" << '\t' << "STRING" << endl;
int i=1;
if(ri!=0)
{
do {
const char *word = ri->GetUTF8Text(level);
// cout << word << endl;
//float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);
dumpIntoFile(word, myfile1, x1, y1, x2, y2, i);
delete []word;
} while (ri->Next(level));
delete []ri;
}
}
void dumpIntoFile(const char *ocrResult , ofstream &myfile1 ,int x1, int y1,
int x2, int y2,int &i)
{
int length = strlen(ocrResult);
myfile1 << i++ << '\t' << x1 << '\t' << y1 << '\t' <<
x2 << '\t' << y2 << '\t' ;
//cout << "in the string (" << length << ") ::";
for(int j = 0; j < length && ocrResult[j] != '\n'; j++)
{
myfile1 << ocrResult[j];
}
myfile1 << endl;
}

Currently the Tesseract API provides pre-trained language models for most of the popular languages:
https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Related

Parameter Passing in C-like

int i, a[] = {0, 1, 2};
void foo(int x) {
i++;
x++;
cout << a[0] << " " << a[1] << " " << a[2];
}
void main() {
i = 0;
foo(a[i]);
}
So, the printing output will be:
By value-result: 0 - 1 - 2
By reference: 1 - 1 - 2
By name: 0 - 2 - 2
By constant reference: 0 - 1 - 2
Right ?

Beware, the cout stream and << operators are pure C++ primitives, you are NOT in C !
You also have to understand that side-effect inside a sequence of cout << foo << bar << fuzz; will produce a totally impredictable output depending on the choices made by the compiler (and NOT triggered by the language specification because the variables are supposed to stay constant along the evaluation of the expression).
To illustrate what I am saying, try to compile the following small program (example.cpp):
#include <iostream>
using namespace std;
int main ()
{
int i = 0;
cout << "i++: " << i++ << i++ << i++ << i++ << endl;
i = 0;
cout << "++i: " << ++i << ++i << ++i << ++i << endl;
cout << "The address of i is: " << &i << endl;
return 0;
}
When compiled with g++ -o example example.cpp, you should get something like:
i++: 3210
++i: 4321
The address of i is: 0xbfb1c44c
Then, try to compile it with g++ -O2 -o example example.cpp, you should get something like:
i++: 3210
++i: 4444
The address of i is: 0xbf9af0cc
In fact, the difference of these two execution comes from the fact that once you trigger the optimization in g++, the compiler assumes that you conform to the specification of C++ and that there will be no side-effect inside the cout << ... << endl; expression. So, it will use the last value of i all the time.

C++, why does adding an int to an unsigned int. produce a weird value?

#include <iostream>
using namespace std;
int main()
{
unsigned u = 10;
int i = -42;
cout << i + i << endl; // prints -84
cout << u + i << endl; // if 32-bit ints, prints 4294967264
}
I have that code and on the second arithmetic equation 'u+i' I get the value "4294967264," now why is that? Why?
Can you explain?
I'm only so far in C++ so please explain step by step and refrain from using complicated terminology! Please!

CUDA Thrust - Run Length Encoding with run index

I am trying to build a "run length encoder" which produces a report of occurrences of runs within a file using CUDA Thrust. I will use this "report" to perform the run length encoding step later.
e.g.
Input sequence:
inputSequence = [a, a, b, c, a, a, a];
Output sequences:
runChar = [a, a];
runCount = [2, 3];
runPosition = [0, 4];
The output desribes a run of 2 a's starting at position 0 and a run of 3 a's starting at the position 4.
The Thrust run length encoder example described below outputs two arrays - one for the output char and one for its length.
I would like to modify this so runs of less than 2 are excluded and it also outputs the position each run occurs.
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
return 0;

One possible approach, building on what you have shown already:
Take your output lengths, and do an exclusive_scan on them. This creates a corresponding vector of the starting indexes of each run.
Use stream compaction (remove_if) to remove elements from all arrays (output, lengths, and indexes) whose corresponding length is 1. We do this in two steps, the first remove_if operation to clean up output and indexes, using lengths as the stencil, and the second operating directly on lengths. This can probably be significantly improved by operating on all 3 at once, which will make the output length calculation a bit more complicated. How you handle this exactly will depend on which sets of data you intend to retain.
Here is a fully worked example, extending your code:
$ cat t601.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/zip_iterator.h>
struct is_not_one{
template <typename T>
__host__ __device__
bool operator()(T data){
return data != 1;
}
};
int main(){
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
thrust::device_vector<int> indexes(num_runs);
thrust::exclusive_scan(lengths.begin(), lengths.begin()+num_runs, indexes.begin());
thrust::device_vector<char> foutput(num_runs);
thrust::device_vector<int> findexes(num_runs);
thrust::device_vector<int> flengths(num_runs);
thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(output.begin(), indexes.begin())), thrust::make_zip_iterator(thrust::make_tuple(output.begin()+num_runs, indexes.begin()+num_runs)), lengths.begin(), thrust::make_zip_iterator(thrust::make_tuple(foutput.begin(), findexes.begin())), is_not_one());
size_t fnum_runs = thrust::copy_if(lengths.begin(), lengths.begin()+num_runs, flengths.begin(), is_not_one()) - flengths.begin();
std::cout << "output: " << std::endl;
thrust::copy_n(foutput.begin(), fnum_runs, std::ostream_iterator<char>(std::cout, ","));
std::cout << std::endl << "lengths: " << std::endl;
thrust::copy_n(flengths.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "indexes: " << std::endl;
thrust::copy_n(findexes.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -arch=sm_20 -o t601 t601.cu
$ ./t601
input data:
aaabbbbbcddeeeeeeeeeff
run-length encoded output:
(a,3)(b,5)(c,1)(d,2)(e,9)(f,2)
output:
a,b,d,e,f,
lengths:
3,5,2,9,2,
indexes:
0,3,9,11,20,
$
I'm certain that this code can be improved upon, but my purpose is to show you one possible general approach.
In my opinion, for future reference, it's not very helpful for you to strip off the include headers from your sample code. I think it's better to provide a complete, compilable code. Not a big deal in this case.
Also note that there are thrust example codes for run length encoding and decoding.

qserialport does not send a char to arduino

I'm having a trouble in trying to send a char (i.e. "R") from my qt5 application on WIN7 to comport which is connected to an Arduino.
I intend to blink a led on Arduino and my arduino part works OK.
Here is my qt code:
#include <QTextStream>
#include <QCoreApplication>
#include <QtSerialPort/QSerialPortInfo>
#include <QSerialPort>
#include <iostream>
#include <QtCore>
QT_USE_NAMESPACE
using namespace std;
QSerialPort serial;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QTextStream out(stdout);
QList<QSerialPortInfo> serialPortInfoList = QSerialPortInfo::availablePorts();
out << QObject::tr("Total number of ports available: ") << serialPortInfoList.count() << endl;
foreach (const QSerialPortInfo &serialPortInfo, serialPortInfoList) {
out << endl
<< QObject::tr("Port: ") << serialPortInfo.portName() << endl
<< QObject::tr("Location: ") << serialPortInfo.systemLocation() << endl
<< QObject::tr("Description: ") << serialPortInfo.description() << endl
<< QObject::tr("Manufacturer: ") << serialPortInfo.manufacturer() << endl
<< QObject::tr("Vendor Identifier: ") << (serialPortInfo.hasVendorIdentifier() ? QByteArray::number(serialPortInfo.vendorIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Product Identifier: ") << (serialPortInfo.hasProductIdentifier() ? QByteArray::number(serialPortInfo.productIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Busy: ") << (serialPortInfo.isBusy() ? QObject::tr("Yes") : QObject::tr("No")) << endl;
}
serial.setPortName("COM5");
serial.open(QIODevice::ReadWrite);
serial.setBaudRate(QSerialPort::Baud9600);
serial.setDataBits(QSerialPort::Data8);
serial.setParity(QSerialPort::NoParity);
serial.setStopBits(QSerialPort::OneStop);
serial.setFlowControl(QSerialPort::NoFlowControl);
if(!serial.isOpen())
{
std::cout<<"port is not open"<<endl;
//serial.open(QIODevice::ReadWrite);
}
if(serial.isWritable()==true)
{
std::cout<<"port writable..."<<endl;
}
QByteArray data("R");
serial.write(data);
serial.flush();
std::cout<<"value sent!!! "<<std::endl;
serial.close();
return 0;
}
My source code consists of two parts,
1- serialportinfolist .... which works just fine
2- opening and writing data... I get no issue when running the code and the display shows the result as if nothing has gone wrong!
HOWEVER, the led on the board does not turn on when I run this code.
I test this with Arduino Serial Monitor and it turns on but cant turn on from Qt.

Are you waiting for cr lf (0x0D 0x0A) in your arduino code?
QByteArray ba;
ba.resize(3);
ba[0] = 0x5c; //'R'
ba[1] = 0x0d;
ba[2] = 0x0a;
Or append it to your string with
QByteArray data("R\r\n");
Or
QByteArray data("R\n");

I think I have found a partial solution but it is still incomplete.
When I press debug the first time, qt does not send any signal to Arduino, but when I press debug for the second time it behaves as expected.
So, is'nt it so weird that one has to run it twice to get it working???
Let me know if the problem exists somewhere else,
any help...

std::find with type T** vs T*[N]

I prefer to work with std::string but I like to figure out what is going wrong here.
I am unable to understand out why std::find isn't working properly for type T** even though pointer arithmetic works on them correctly. Like -
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
But it works fine, for the types T*[N].
#include <iostream>
#include <algorithm>
int main( int argc, const char ** argv )
{
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
const char ** cmdPtr = std::find(argv+1, argv+argc, "Hello") ;
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
if( cmdPtr == argv+argc )
std::cout << "String not found" << std::endl;
if( testPtr != testAr+2 )
std::cout << "String found: " << *testPtr << std::endl;
return 0;
}
Arguments passed: Hello World
Output:
Hello World
String not found
String found: Hello
Thanks.

Comparing types of char const* amounts to pointing to the addresses. The address of "Hello" is guaranteed to be different unless you compare it to another address of the string literal "Hello" (in which case the pointers may compare equal). Your compare() function compares the characters being pointed to.

In the first case, you're comparing the pointer values themselves and not what they're pointing to. And the constant "Hello" doesn't have the same address as the first element of argv.
Try using:
const char ** cmdPtr = std::find(argv+1, argv+argc, std::string("Hello")) ;
std::string knows to compare contents and not addresses.
For the array version, the compiler can fold all literals into a single one, so every time "Hello" is seen throughout the code it's really the same pointer. Thus, comparing for equality in
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
yields the correct result

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Train tesseract for Hindi language - ocr

Currently the Tesseract API provides pre-trained language models for most of the popular languages: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Related

Parameter Passing in C-like

C++, why does adding an int to an unsigned int. produce a weird value?

CUDA Thrust - Run Length Encoding with run index

qserialport does not send a char to arduino

std::find with type T** vs T*[N]

Categories

Resources