Using C# to measure the width of a string in pixels in a cross platform way - skiasharp

I have some existing C# code that uses System.Drawing.Common to measure the approximate width of a string in pixels:
var text = "abc123 this is some long text my dog's name is fido.";
using (var bitmap = new Bitmap(500, 50))
using (var graphics = Graphics.FromImage(bitmap))
{
// Size: 9 Points
using var font = new System.Drawing.Font(familyName: "Times New Roman", emSize: 9f);
var ms = graphics.MeasureString(text, font);
// Output: 'abc123 this is some long text my dog's name is fido.' via System.Drawing: 394.00195 x 22.183594
Console.WriteLine($"'{text}' via System.Drawing: {ms.Width} x {ms.Height}");
}
After upgrading to .NET 6.0, I got a bunch of warning messages telling me that these graphics primitives are only supported on Windows. I want this measurement to work on other platforms, so I tried to do something similar with both SkiaSharp:
var text = "abc123 this is some long text my dog's name is fido.";
using (var paint = new SKPaint())
{
paint.Typeface = SKTypeface.FromFamilyName("Times New Roman");
// Size: 12px
paint.TextSize = 12f;
var skBounds = SKRect.Empty;
var textWidth = paint.MeasureText(text.AsSpan(), ref skBounds);
// Output: 'abc123 this is some long text my dog's name is fido.' via SkiaSharp: 251.13867 x 12
Console.WriteLine($"'{text}' via SkiaSharp: {skBounds.Width} x {skBounds.Height}");
}
And ImageSharp:
var text = "abc123 this is some long text my dog's name is fido.";
// Size: 12px
var imgSharpFont = SixLabors.Fonts.SystemFonts.CreateFont("Times New Roman", 12f);
var imgSharpMeasurement = TextMeasurer.Measure(text, new RendererOptions(imgSharpFont));
// Output: 'abc123 this is some long text my dog's name is fido.' via ImageSharp: 251.13869 x 14.589844
Console.WriteLine($"'{text}' via ImageSharp: {imgSharpMeasurement.Width} x {imgSharpMeasurement.Height}");
However, as you can see, I can't get SkiaSharp or ImageSharp to produce the same width, although they produce similar results:
'abc123 this is some long text my dog's name is fido.' via System.Drawing: 394.00195
'abc123 this is some long text my dog's name is fido.' via SkiaSharp: 251.13867
'abc123 this is some long text my dog's name is fido.' via ImageSharp: 251.13869
I don't understand graphics programming enough to know what I'm missing. It might be a unit conversion between Points and Pixels, or perhaps I'm not setting the correct properties. Any ideas on how to make SkiaSharp and/or ImageSharp return the same width measurement as System.Drawing.Common?
Thank you.

Basically the issue is due to the fact that System.Drawing.Graphics will be using the DPI of your machine (The reason is because System.Drawing is built around GDI which is a windows subsystem for drawing images to screens and printers) by default.
This is opposition to ImageSharp and SkiaSharp as both be defaulting to 72 DPI. (I know ImageSharp definitely does, and as the numbers between SkiaSharp and ImageSharp are within a rounding error of each other it must be using a DPI of 72 as well).
Based on the numbers you provided, I would have to guess you are running a 112 DPI monitor?

Related

C# - Can not access File which is already being used - Iron OCR

I am using "Iron OCR", something like "Tesseract" to detect and scan certain Text from Screenshots.
Now I have the following error. Every time Iron OCR is used to scan an image for text it tries to access the Iron OCR log file which is somehow still used by the process before. So every time I get the error message that it can't access the log file because it is already in use. Nevertheless the Scan still works and I get a valid result even tho it gives me an exception because of that error.
My program works like this:
it takes a screenshots of certain areas of my screen.
it analyzes that image with Iron OCR and looks for text.
this process repeats itself infinitely.
I have following code:
//------------------------- # Capture Screenshot of specific Area # -------------------------\\
Rectangle bounds3;
Rect rect3 = new Rect();
bounds3 = new Rectangle(rect3.Left + 198, rect3.Top + 36, rect3.Right + 75 - rect3.Left - 10, rect3.Bottom + 30 - rect3.Top - 10);
CursorPosition = new Point(Cursor.Position.X - rect.Left, Cursor.Position.Y - rect.Top);
Bitmap result3 = new Bitmap(40, 14);
using (Graphics g = Graphics.FromImage(result3))
{
g.CopyFromScreen(new Point(bounds3.Left, bounds3.Top), Point.Empty, bounds3.Size);
}
//------------------------- # Analyze Image for Text # -------------------------\\
var Ocr = new IronTesseract();
using (var Input = new OcrInput(result))
{
Input.Contrast();
Input.EnhanceResolution(300);
Input.Invert();
Input.Sharpen();
Input.ToGrayScale();
try
{
//------------------- # This causes the Error - Using Try Catch to Ignore it # -------------------\\
var Result = Ocr.Read(Input);
text = Result.Text;
}
catch
{
}
}
Also removing all the above only using their "1 Line Code" gives the same error message:
var Result = new IronTesseract().Read(#"images\image.png").Text;
I hope someone can help me to figure out what exactly causes that issue.

Can Tesseract OCR recognize subscripts and superscripts?

I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.
There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp
what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.

Text limiting by the amount of space they take up

Is there a way to limit the amount of letters that can be put into an inputfield depending on the size it takes up?
For example W's and M's take enough space to place 34 of them.
But if i use normal sentences the same amount of letters is only taking half of the size up in comparison to the W's and M's.
What i want is that i can basically write anything as long the Size of the inputfield is big enough.
Is that possible?
First, you can access to the size of your input with something like this :
const ele = document.getElementById('yourInputId');
const eleStyle = window.getComputedStyle(ele);
const inputSize = {width: eleStyle.width, height: eleStyle.heigth};
Then, in another method, assuming you know the font and font-size of your text (if you don't, check Angular Documentation), you can set-up a method like this :
pixelLength(txt: string, font: number) {
const canva = document.createElement('CANVAS');
const attr = document.createAttribute('id');
attr.value = 'myId';
canva.setAttributeNode(attr);
document.body.appendChild(canva);
const c: any = document.getElementById('myId');
const ctx = c.getContext('2d');
ctx.font = font.toString() + 'px Helvetica';
const result = ctx.measureText(txt).width;
document.body.removeChild(canva);
return result;
}
You might want to call this method in a custom form control, to check if it will be longer than inputSize.width.
Let me know if something's unclear ;)
Happy coding.

How can I search pipeline with another pipeline value on google cloud dataflow

I would like to search text which includes specified word from stream data with google cloud dataflow.
In detail, I will deal with following two stream.
stream A: element of stream is "word"
stream B: element of stream is "text". and each text consists of "word". This text may have "word" on stream A
Many "text" flow into stream B frequently. On the other hand, "word" flow into stream A occasionally.
When "word" flow into stream A, I would like to search "text" which has "word" and flow into stream B after 5 minutes ago.
Example
time stream A : stream B
00:01 - this is an apple
00:02 - this is an orange
00:03 - I have an apple
00:04 apple <= "this is an apple" and "I have an apple" are found
00:05 this <= "this is an apple" and "this is an orange" are found
Can I search text with google cloud dataflow?
If I understand your question correctly, there are multiple ways to achieve something like what you want. I will describe two variations.
The basic idea in my example code is to use an inner join and SlidingWindows of five minutes. You can implement the join using ParDo side inputs or CoGroupByKey, depending on your data sizes.
Here is how you set up your inputs and windowing:
PCollection<String> streamA = ...;
PCollection<String> streamB = ...;
PCollection<String> windowedStreamA = streamA.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
PCollection<String> windowedStreamB = streamB.apply(
Window.into(
SlidingWindows.of(Duration.standardMinutes(5)).every(...)));
You may want to adjust the size of windows or period to meet your specification & performance needs.
Here is a sketch of how to do the join with side inputs. This will iterate over the entire five minute window of streamB for each element of streamA, so performance will suffer if windows get large.
PCollectionView<Iterable<String>> streamBview = streamB.apply(View.asIterable());
PCollection<String> matches = windowedStreamA.apply(
ParDo.of(new DoFn<String, String>() {
#Override void processElement(ProcessContext context) {
for (String text : context.sideInput()) {
if (split(text).contains(context.element())) {
context.output(text);
}
}
}
});
Here is a sketch of how to do this with CoGroupByKey by pre-splitting the text and joining each keyword with the lines that contain that keyword. There is similar logic in the TfIdf example included with the SDK.
PCollection<KV<String, Void>> keyedStreamA = windowedStreamA.apply(
MapElements
.via(word -> KV.of(word, null))
.withOutputType(new TypeDescriptor<KV<String, Void>>() {}));
PCollection<KV<String, String>> keyedStreamB = windowedStreamB.apply(
FlatMapElements
.via(text -> split(text).forEach(word --> KV.of(word, text))
.withOutputType(new TypeDescriptor<KV<String, String>>() {}));
TupleTag<Void> tagA = new TupleTag<Void>() {};
TupleTag<String> tagB = new TupleTag<String>() {};
KeyedPCollectionTuple coGbkInput = KeyedPCollectionTuple
.of(tagA, keyedStreamA)
.and(tagB, keyedStreamB);
PCollection<String> matches = coGbkInput
.apply(CoGroupByKey.create())
.apply(FlatMapElements
.via(result -> result.getAll(tagB))
.withOutputType(new TypeDescriptor<String>()));
The best approach will depend on your data. If you are OK with getting more matches than just the last five minutes you can tune the amount of data duplication in windows by enlarging your sliding windows and having a larger period. You can also use triggers to tune when output is produced.

Can I configure Tesseract to recognize texts from image with only specified length?

I am working on some OCR experiments where I would like to improve the quality of Tesseract output. Basically the test subject is things like CAPTCHA, random characters on an obfuscated image. Now Tesseract isn't doing a very good job. Partially because sometimes it identifies certain character as several characters/digits separately.
I am wondering if telling Tesseract that, my specific image should always contain a text of length, say six, could improve the OCR recognition result a bit. But I am not sure if this is even supported in Tesseract.
I didn't find documentation on that point. Could someone help point out if such feature exists, and if does, what configuration parameter I can set. Thanks!
Try this example for specifying length of the text. Please set value in for loop, which length you need to recognise text.
Consider following code:
Pix *image = pixRead("/usr/src/tesseract-3.02/phototest.tif");
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
api->Init(NULL, "eng");
api->SetImage(image);
Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
printf("Found %d textline image components.\n", boxes->n);
for (int i = 0; i < boxes->n; i++) {
BOX* box = boxaGetBox(boxes, i, L_CLONE);
api->SetRectangle(box->x, box->y, box->w, box->h);
char* ocrResult = api->GetUTF8Text();
int conf = api->MeanTextConf();
fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
i, box->x, box->y, box->w, box->h, conf, ocrResult);
}
In for (int i = 0; i < boxes->n; i++), replace boxes->n by 20 if you want specified length of 20.