Aspose.OCR fails to read simple JPEG files - ocr

I am testing Aspose.OCR, attempting to OCR a simple document, but finding that the OcrEngine.Process() returns jibberish with both my sample document and the sample provided by Aspose.
My code:
var license = new License();
license.SetLicense("Aspose.OCR.lic");
OcrEngine ocrEngine = new OcrEngine();
string text = null;
ocrEngine.Image = ImageStream.FromFile("Sample.Aspose.jpg");
if (ocrEngine.Process())
{
text = ocrEngine.Text.ToString();
}
Assert.IsTrue(text.Contains("TRUTH"), text);
The Sample.Aspose.jpg is a copy of Aspose's GIT sample.
The text returned (truncated for brevity) starts with:
Avi [hhhBuyahLITITI Ll r h u -- - ] ---hhh --III-f LIII-fhh l t} ITI r
F8 4 1 T Y L h IiRlm'kpfan order 081Dec
- -
hh - hh - - h - h j : t ITI lblel tljehrerlly }}ollnatffst/t trun IT IT } li IIIckaigf nigh ''I.. } : :;;.et}:
fc.'IL:ef:t;;e;atc{1';;;:L IT':c:, ,.,.:,, ., ,...,. ''I
Equivalent jibberish is returned from a sample GIF.
Am I missing some simple settings for the OcrEngine?

The sample file that you have used is an example for OMR operation. You may consider the file “Sample.bmp” for OCR example. The code snippet is fine. It will work.
I work with Aspose as Developer evangelist.

Related

Deleting commas in R Markdown html output

I am using R Markdown to create an html file for regression results tables, which are produced by stargazer and lfe in a code chunk.
library(lfe); library(stargazer)
data <- data.frame(x = 1:10, y = rnorm(10), z = rnorm(10))
result <- stargazer(felm(y ~ x + z, data = data), type = 'html')
I create a html file win an inline code r result after the chunk above. However, a bunch of commas appear at the top of the table.
When I check the html code, I see almost every </tr> is followed by a comma.
How can I delete these commas?
Maybe not what you are looking for exactly but I am a huge fan of modelsummary. I knit to HTML to see how it looks and then usually knit to pdf. The modelsummary equivalent would look something like this
library(lfe)
library(modelsummary)
data = data.frame(x = 1:10, y = rnorm(10), z = rnorm(10))
results = felm(y ~ x + z, data = data)
modelsummary(results)
There are a lot of ways to customize it through kableExtra and other packages. The documentation is really good. Here is kind of a silly example
library(kableExtra)
modelsummary(results,
coef_map = c("x" = "Cool Treatment",
"z" = "Confounder",
"(Intercept)" = "(Intercept)")) %>%
row_spec(1, background = "#F5ABEA")

links.append() does not show all the image links as a list

I need to scrape 29 images of this hotel. With the code below, the output is each link separately every time I run the cell. Even though I used links.append(), I need to re-run the cell in order to get another image.
r = rq.get("https://uk.hotels.com/ho177101/?q-check-out=2020-04-18&FPQ=3&q-check-in=2020-04-17&WOE=6&WOD=5&q-room-0-children=0&pa=1&tab=description&JHR=2&q-room-0-adults=2&YGF=2&MGT=1&ZSX=0&SYE=3#:WO")
soup = BeautifulSoup(r.text, "html.parser")
links = []
x = soup.select('img[src^="https://exp.cdn-hotels.com/hotels/1000000/560000/558400/558353"]')
for img in x:
links.append(img['src'])
#os.mkdir("hotel_photos")
for l in links:
print(l)
Thank you in advance!
Try this:
links = []
x = soup.select('a[href^="https://exp.cdn-hotels.com/hotels/1000000/560000/558400/558353"]')
for img in x:
links.append(img['href'])
But, this will only work for that specific link, if you need a code that would scrape any link you provide, because this "https://exp.cdn-hotels.com/hotels/1000000/560000/558400/558353" changes depending on the hotel, so this would be a better approach:
x = soup.select("li[id^='thumb-']")
for i in x:
links.append((next(i.children, None)["href"]))
for l in links:
print(l)
PS: If you need after that to download the pictures, make sure to replace "&w=82&h=82" with "&w=773&h=530" to match the picture displayed.

Executing all cases one after another inside a for loop in octave

Until now, I change req manually. The code works, including saving the result into a file.
But now I want to run the code for all possible values of req.
Without saving it into a file, the code works but obviously it overwrite the result.
That is why I put that line of code that saving the result by giving it a different name depending of the values of req. But this gives me error.
error:
error: sprintf: wrong type argument 'cell'
error: called from
testforloop at line 26 column 1
my code:
clear all;
clc;
for req = {"del_1", "del_2", "del_3"}
request = req;
if (strcmp(request, "del_1"))
tarr = 11;
# and a bunch of other variables
elseif (strcmp(request, "del_2"))
tarr = 22;
# and a bunch of other variables
elseif (strcmp(request, "del_3"))
tarr = 33;
# and a bunch of other variables
else
# do nothing
endif
#long calculation producing many variable including aa, bb, cc.
aa = 2 * tarr;
bb = 3 * tarr;
cc = 4 * tarr;
#collecting variables of interest: aa, bb, cc and save it to a file.
result_matrix = [aa bb cc];
dlmwrite (sprintf('file_result_%s.csv', request), result_matrix);
endfor
if I use ["del_1" "del_2" "del_3"], the error is
error: 'tarr' undefined near line 20 column 10
error: called from
testforloop at line 20 column 4
Inside the loop
for req = {"del_1", "del_2", "del_3"}
req gets as value each of the cells of the cell array, not the contents of the cells (weird design decision, IMO, but this is the way it works). Thus, req={"del_1"} in the first iteration. The string itself can then be obtained with req{1}. So all you need to change is:
request = req{1};
However, I would implement this differently, as so:
function myfunction(request, tarr)
% long calculation producing many variable including aa, bb, cc.
aa = 2 * tarr;
bb = 3 * tarr;
cc = 4 * tarr;
% collecting variables of interest: aa, bb, cc and save it to a file.
result_matrix = [aa bb cc];
dlmwrite (sprintf('file_result_%s.csv', request), result_matrix);
end
myfunction("del_1", 11)
myfunction("del_2", 22)
myfunction("del_3", 33)
I think this obtains a clearer view of what you're actually doing, the code is less complicated.
Note that in Octave, ["del_1" "del_2" "del_3"] evaluates to "del_1del_2del_3". That is, you concatenate the strings. In MATLAB this is not the case, but Octave doesn't know the string type, and uses " in the same way as ' to create char arrays.

Reading from binary file into several labels on a form in C#

I'm writing a trivia game app in C# that writes data to a binary file, then reads the data from the file into six labels. The six labels are as follows:
lblQuestion // This is where the question text goes.
lblPoints // This is where the question points goes.
lblAnswerA // This is where multiple choice answer A goes.
lblAnswerB // This is where multiple choice answer B goes.
lblAnswerC // This is where multiple choice answer C goes.
lblAnswerD // This is where multiple choice answer D goes.
Here is the code for writing to the binary file:
{
bw.Write(Question);
bw.Write(Points);
bw.Write(AnswerA);
bw.Write(AnswerB);
bw.Write(AnswerC);
bw.Write(AnswerD);
}
Now for the code to read data from the file into the corresponding labels:
{
FileStream fs = File.OpenRead(ofd.FileName);
BinaryReader br = new BinaryReader(fs);
lblQuestion.Text = br.ReadString();
lblPoints.Text = br.ReadInt32() + " points";
lblAnswerA.Text = br.ReadString();
lblAnswerB.Text = br.ReadString();
lblAnswerC.Text = br.ReadString();
lblAnswerD.Text = br.ReadString();
}
The Question string reads correctly into lblQuestion.
The Points value reads correctly into lblPoints.
AnswerA, AnswerB, and AnswerC DO NOT read into lblAnswerA, lblAnswerB and lblAnswerC respectively.
lblAnswerD, however, gets the string meant for lblAnswerA.
Looking at the code for reading data into the labels, is there something missing, some sort of incremental value that needs to be inserted into the code in order to get the strings to the correct labels?

Character confidence for Tesseract 3.02 using config file

How would I get the % confidence per character detected?
By searching around I found that you should set save_blob_choices to T.
So I added that to as a line in the hocr config file in tessdata/configs and called tesseract with it.
This is all I'm getting in the generated html file:
<span class='ocr_line' id='line_1' title="bbox 0 0 50 17"><span class='ocrx_word' id='word_1' title="bbox 3 2 45 15"><strong>31,835</strong></span>
As you can see there isn't any confidence annotations not even per word.
I don't have visual studio so I'm not able to make any code changes. But I'm also open to answers describing code changes as well as how I would compile the code without VS.
Here is the sample code of getting confidence of each word.
You can even replace RIL_WORD with RIL_SYMBOL to get confidence of each character.
mTess.Recognize(0);
tesseract::ResultIterator* ri = mTess.GetIterator();
if(ri != 0)
{
do
{
const char* word = ri->GetUTF8Text(tesseract::RIL_WORD);
if(word != 0 )
{
float conf = ri->Confidence(tesseract::RIL_WORD);
printf(" word:%s, confidence: %f", word, conf );
}
delete[] word;
} while((ri->Next(tesseract::RIL_WORD)));
delete ri;
}
You will have to write a program to do this. Take a look at the ResultIterator API example at Tesseract site. For your case, be sure to set save_blob_choices variable and iterate at RIL_SYMBOL level.