I am trying to convert the attached OCR JPEG file to text. When I use pytesseract or tesseract, I am seeing diacritics because of which my output contains a lot of junk characters. Also, conversion of jpeg to text is not working.
I tried to read from the image file, extract text, and print using keystrokes. The output is not as expected.
The code is as follows:
image=Image.open('8001.jpg')
text = image_to_string(image, lang='eng')
keyboard.write(text)
I am getting some unwanted characters like these:
>) ) 7? ) 7 0
Daybreak: appeared. Ihe mowing miosls ourvounded us, bub Urey 2001 cleared ch J Wea
> pm 0. 0 ) ) aeaboul lo examine the hull, which formed on deely a kind of horizontal
2
fatfoun, w fen a J felt ils
op
nel, kicking the resounding plate. “Open,
) me " 57
gradually sinking. Oh! confound i! cried Nod
0 Q yi
you inhoapitable zasealy!
Says Pp iy ui
0 0
cide, came from the interior of the Boal. One iton plate was moved, a men appeared, ullered
Related
When using Puppeteer to print a page as PDF, Puppeteer may convert images in that page to a different format.
For example, printing a JPEG image will result in a PDF with (roughly) the same size as the image. That means Puppeteer is using the same exact JPEG image in the generated PDF. Same happens with other formats like PNG and SVG (the output size matches the size of the original images).
However, printing a WebP image will result in a PDF with a much bigger size (10x more that expected). This seems to be because Puppeteer is converting the WebP image into a JPEG/PNG image before generating the PDF.
I am guessing this is because WebP is not supported (maybe not even by the PDF standard and that may be the reason Puppeteer converts the WebP image in the first place).
Is there a way to control this image conversion? In particular, is it possible to set the target format (ideally JPEG) and quality (ideally < 100) to try to maintain the output size of the PDF in the same range as the input WebP image size?
It may help you to see at two levels what happens to images when saved as pdf, now understand this is a basic demo thus not real world but just by explanation of considerations.
Upper left we have 5x5 pixels so screen rendering uses a blurring to not show images as "sharp" but upper right a pdf viewer tries to maintain vector sharpness.
so what about different formats, GIF TIF and PNG (middle line) are lossless and behave in roughly similar fashion. All should maintain colour pixel fidelity in a PDF.
However, lower line, Jpeg is lousy at maintaining colour fidelity because it spreads the colours between adjoining pixels, which is "Perfect" for fuzzy text or photographs but not much good for PDF colours.
Ok moving on your focus is input to pdf so what do those look like when stored.
each may be written in many ways but let's focus on the most versatile PNG.
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/MediaBox[0 0 3.75 3.75]/Rotate 0/Resources<</XObject<</Img3 6 0 R>>>>/Contents 5 0 R/Parent 2 0 R>>endobj
5 0 obj<</Length 34>>
stream
q
3.75 0 0 3.75 0 0 cm
/Img3 Do
Q
endstream
endobj
6 0 obj<</Length 75/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB>>
stream
ÿ ÿÿÿ ÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿ###ÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿ ÿÿÿ
endstream
endobj
7 0 obj
<</Length 5/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 1/ColorSpace/DeviceGray>>
stream
ÿÿÿÿÿ
endstream
endobj
xref
0 7
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000334 00000 n
0000000472 00000 n
0000000555 00000 n
trailer
<</Size 7/Info<</Producer(Me)>>/Root 1 0 R>>
startxref
684
%%EOF
Again, for illustration this is a non-typical stream as it shows the bitmap uncompressed but note the main image is defined by
6 0 obj<</Length 75/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB>>
So of interest it is 5 pixels wide by 5 pixels high NO hint of how many inches it's just 8 bits R, 8bits G, 8 bits B (again its only 3 colours) the Alpha is in a separate image (image Smask 7) so 3x5x5=75 is the uncompressed storage now we can compress many ways such as "Flate" (similar to say used in a zip file)
that will convert the stream from lots of ÿs into a more compacted form.
Again, there are many encodings so if we wish to keep the pdf as text for editing in a text editor first
/Length 72
/Filter [ /ASCIIHexDecode /FlateDecode ]
>>
stream
789cfbcfc0f0ffff7f8605601244a002b0888383038c892a091582e86500009c
663a28>
endstream
Well that was not much compression down from 75 to 72 !
let's use something better by not using plain text.
6 0 obj
<</Length 36/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB/Filter/FlateDecode>>
stream
xœû¯ ðÿÿ…`D °ˆƒƒŒ‰*©€P¤ ÞF;è
endstream
Ok much better we halved the storage from 72 down to 36 good its small compact and well formed.
So, what about keeping the jpeg structure ahhhh! that when maintaining its lousy nature needs 730
<</Filter/DCTDecode/Type/XObject/Subtype/Image/BitsPerComponent 8/Width 5/Height 5/ColorSpace/DeviceRGB/Length 730>>
stream
ÿØÿà JFIF ` ` ÿÛ C
ÿÛ C
ÿÀ " ÿÄ
ÿÄ µ } !1AQa "q2‘¡#B±ÁRÑð$3br‚
%&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyzƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖ×ØÙÚáâãäåæçèéêñòóôõö÷øùúÿÄ
ÿÄ µ w !1AQ aq"2B‘¡±Á #3RðbrÑ
$4á%ñ&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz‚ƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖ×ØÙÚâãäåæçèéêòóôõö÷øùúÿÚ ? ëÿ j¯Ú[Á_ >|×õ¯„šgŽ¡ñ–™ý¥mg¨.˜ƒJ
§é€Æ„鮬
´`©þ¬ ㌢Šõiæ¸ì,ëáèUq„*ÖŒRÑ%³I}Ëüõ<ìFŽ*£¯QËšVnÓšWi_E$—ÉÿÙ
endstream
endobj
So this test piece is not real world but may serve to help make decisions over best storage means for different inputs.
My preference is use PNG where possible for charts and document text and use Jpeg only when essential for photos or fuzzy OCR.
taking your offered sample jpeg is necessary but even set quality to high with size reduction from maximal can suffer collateral damage.
However, it's not very noticeable except you zoom in closer than blobby anyway here 4 X zoom
Source 58-59 KB
Slightly reduced 50-51 KB
According to my photo I upload, I can't plot correctly the vertical line that appear in TF M5 to TF M15.
I already backtest to see what happen when my price meet the condition in M5.
It do appear correct in TF M15 when condition in M5 is true but when I switch between TF or just reload the page. It dissapear.
I upload here my code, please help me fix this. Many thanks!
//#version=4
// Big Bar Size can be used as Support resistent level.
study("bar_size 191022", shorttitle="Moment detect ", overlay=true)
body_limit=input(15,minval=1,title="Body Limit")
open0=security(syminfo.tickerid, "5",open)
close0=security(syminfo.tickerid, "5",close)
body = abs(open0-close0)
con_momen1=body > body_limit
con_up1=(open0 < close0)
con_buy1=con_momen1 and con_up1
bgcolor(con_buy1?color.silver:na,transp=0)
I have an ultrasound wave (graph axes: Volt vs microsecond) and need to cut the signal/wave between two specific value to further analyze this clipping. My idea is to cut the signal between 0.2 V (y-axis). The wave is sine shaped as shown in the figure with the desired cutoff points in red
In my current code, I'm cutting the signal between 1900 to 4000 ms (x-axis) (Aa = A(1900:4000);) and then I want to make the aforementioned clipping and proceed with the code.
Does anyone know how I could do this y-axis clipping?
Thanks!! :)
clear
clf
pkg load signal
for k=1:2
w=1
filename=strcat("PCB 2.1 (",sprintf("%01d",k),").mat")
load(filename)
Lthisrun=length(A);
Pico(k,1:Lthisrun)=A;
Aa = A(1900:4000);
Ah= abs(hilbert(Aa));
step=100;
hold on
i=1;
Ac=0;
for index=1:step:3601
Ac(i+1)=Ac(i)+Ah(i);
i=i+1
r(k)=trapz(Ac)
end
end
ok, you want to just look at values 'above the noise' in your data. Or, in this case, 'clip out' everything below 0.2V. the easiest way to do this is with logical indexing. You can take an array and create a sub array eliminating everything that doesn't meet a certain logical condition. See this example:
f = #(x) sin(x)./x;
x = [-100:.1:100];
y = f(x);
plot(x,y);
figure;
x_trim = x(y>0.2);
y_trim = y(y>0.2);
plot(x_trim, y_trim);
From your question it looks like you want to do the clipping after applying the horizontal windowing from 1900-4000. (you say that that is in milliseconds, but your image shows the pulse being much sooner than 1900 ms). In any case, something like
Ab = Aa(Aa > 0.2);
will create another array Ab that will only contain the portions of Aa with values above 0.2. You may need to do something similar (see the example) for the horizontal axis if your x-data is not just the element index.
I've installed tesseract on my linux environment.
It works when I execute something like
# tesseract myPic.jpg /output
But my pic has some little labels and tesseract didn't see them.
Is an option is available to set a pitch or something like that ?
Example of text labels:
With this pic, tesseract doesn't recognize any value...
But with this pic:
I have the following output:
J8
J7A-J7B P7 \
2
40 50 0 180 190
200
P1 P2 7
110 110
\ l
For example, in this case, the 90 (on top left) is not seen by tesseract...
I think it's just an option to define or somethink like that, no ?
Thx
In order to get accurate results from Tesseract (as well as any OCR engine) you will need to follow some guidelines as can be seen in my answer on this post:
Junk results when using Tesseract OCR and tess-two
Here is the gist of it:
Use a high resolution image (if needed) 300 DPI is minimum
Make sure there is no shadows or bends in the image
If there is any skew, you will need to fix the image in code prior to ocr
Use a dictionary to help get good results
Adjust the text size (12 pt font is ideal)
Binarize the image and use image processing algorithms to remove noise
It is also recommended to spend some time training the OCR engine to receive better results as seen in this link: Training Tesseract
I took the 2 images that you shared and ran some image processing on them using the LEADTOOLS SDK (disclaimer: I am an employee of this company) and was able to get better results than you were getting with the processed images, but since the original images aren't the greatest - it still was not 100%. Here is the code I used to try and fix the images:
//initialize the codecs class
using (RasterCodecs codecs = new RasterCodecs())
{
//load the file
using (RasterImage img = codecs.Load(filename))
{
//Run the image processing sequence starting by resizing the image
double newWidth = (img.Width / (double)img.XResolution) * 300;
double newHeight = (img.Height / (double)img.YResolution) * 300;
SizeCommand sizeCommand = new SizeCommand((int)newWidth, (int)newHeight, RasterSizeFlags.Resample);
sizeCommand.Run(img);
//binarize the image
AutoBinarizeCommand autoBinarize = new AutoBinarizeCommand();
autoBinarize.Run(img);
//change it to 1BPP
ColorResolutionCommand colorResolution = new ColorResolutionCommand();
colorResolution.BitsPerPixel = 1;
colorResolution.Run(img);
//save the image as PNG
codecs.Save(img, outputFile, RasterImageFormat.Png, 0);
}
}
Here are the output images from this process:
I am using the Tesseract library to extract text from images. The language is Vietnamese. I have two images. The first one is from a website. The second is a screenshot taken from the Wordpad program. They are shown in links below:
1
2
The first one has 95% accuracy.
Bán căn hộ tầng 5 khu tập thể Thành công Bắc, DT 28m2, gần chợ ThànhCông,
số
đỏ, chính chủ, giá 800 triệu.LH:A.Châu, 0979622551,0905685336
The second image is much larger but the accuracy is just about 60%.
Bặn căn hộ tầng ậ khu tập thể Ỉhành gông
Băc. llĩ 28 m2. gân chợ ĩllành Bông. sũ Ilỏ.
chính l:lIlì. giá 800 lriệu. l.ll: A.BhâU,
0979622551, 0905685336
What about the second image do I have to fix to get as accurate text as the first one?
As stated by #user898678 in image processing to improve tesseract OCR accuracy ,
the following operations can improve OCR's accuracy :
fix DPI (if needed) 300 DPI is minimum
fix text size (e.g. 12 pt should be ok)
try to fix text lines (deskew and dewarp text)
try to fix illumination of image (e.g. no dark part of image
binarize and de-noise image