Tesseract is giving junk data as an output for Japaneses language - ocr

I'm trying to build a sample application in java for Japaneses language that will read an image file and just output the text extracted from the image. I found one sample application on net which is running perfect for English Language but not for Japanees it is giving unidentified text, following is my code:
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with japanees, without specifying tessdata path
if (api.Init(".", "jpn") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
// Open input image with leptonica library
PIX image = pixRead("test.png");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
String string = outText.getString();
assertTrue(!string.isEmpty());
System.out.println("OCR output:\n" + string);
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
my output is:
OCR output:
ETCカー-ード申 込書
�申�込�日 09/02/2017
ETC FeatureID ETCFFL
ー申込枚輩交 画 枚
i has used jpn.tessdata and my application is reading tessdata file also. is any more configration needed? i'm using Tessaract 3.02 version with very clean image.

Yes! i got the solution, what we need to do is to set the locale in our java code as follows:
olocale = new Locale.Builder().setLanguage("ja").setRegion("JP").build();
we can set locale for English language also in order to extract both Japanese as well as English text from Image.
now it is working like charm for me!!

Related

Apps script JSON.parse() returns unexpected result, how can I solve this?

I am currently working on external app using Google Sheets and JSON for data transmission via Fetch API. I decided to mock the scenario (for debugging matters) then simple JSON comes from my external app through prepared Code.gs to be posted on Google sheets. The code snippet I run through Apps-scripts looks like this:
function _doPost(/* e */) {
// const body = e.postData.contents;
const bodyJSON = JSON.parse("{\"coords\" : \"123,456,789,112,113,114,115,116\"}" /* instead of : body */);
const db = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
db.getRange("A1:A10").setValue(bodyJSON.coords).setNumberFormat("#"); // get range, set value, set text format
}
The problem is the result I get: 123,456,789,112,113,000,000,000 As you see, starting from 114 and the later it outputs me 000,... instead. I thought, okay I am gonna explicitly specify format to be returned (saved) as a text format. If the output within the range selected on Google Sheets UI : Format -> Number -> it shows me Text.
However, interesting magic happens, let's say if I would update the body of the JSON to be parsed something like that when the sequence of numbers composed of 2 digits instead of 3 (notice: those are actual part of string, not true numbers, separated by comma!) : "{\"coords\" : \"123,456,789,112,113,114,115,116,17,18\"}" it would not only show response result as expected but also brings back id est fixes the "corrupted" values hidden under the 000,... as so : "{"coords" : "123,456,789,112,113,114,115,116,17,18 "}".
Even Logger.log() returns me initial JSON input as expected. I really have no clue what is going on. I would really appreciate one's correspondence to help solving this issue. Thank you.
You can try directly assigning a JSON formatted string in your bodyJSON variable instead of parsing a set of string using JSON.parse.
Part of your code should look like this:
const bodyJSON = {
"coords" : "123,456,789,112,113,114,115,116"
}
I found simple workaround after all: just added the preceding pair of zeros 0,0,123,... at the very beginning of coords. This prevents so called culprit I defined in my issue. If anyone interested, the external app I am building currently, it's called Hotspot widget : play around with DOM, append a marker which coordinates (coords) being pushed through Apps-script and saved to Google Sheets. I am providing a link with instructions on how to set up one's own copy of the app. It's a decent start-off for learning Vanilla JavaScript basics including simple database approach on the fly. Thank you and good luck!
Hotspot widget on Github

How to add w:altChunk and its relationship with python-docx

I have a use case that make use of <w:altChunk/> element in Word document by inject (fragment of) HTML file as alternate chunks and let Word do it works when the file gets opened. The current implementation was using XML/XSL to compose WordML XML, modify relationships, and do all packaging stuffs manually which is a real pain.
I wanted to move to python-docx but the API doesn't support this directly. Currently I found a way to add the <w:altChunk/> in the document XML. But still struggle to find a way to add relationship and related file to the package.
I think I should make a compatible part and pass it to document.part.relate_to function to do its job. But still can't figure how to:
from docx import Document
from docx.oxml import OxmlElement, qn
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, chunk_part):
''' TODO: figuring how to add files and relationships'''
r_id = doc.part.relate_to(chunk_part, RT.A_F_CHUNK)
alt = OxmlElement('w:altChunk')
alt.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt)
Update:
As per scanny's advice, below is my working code. Thank you very much Steve!
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from docx.opc.part import Part
from docx.opc.constants import RELATIONSHIP_TYPE as RT
def add_alt_chunk(doc: Document, html: str):
package = doc.part.package
partname = package.next_partname('/word/altChunk%d.html')
alt_part = Part(partname, 'text/html', html.encode(), package)
r_id = doc.part.relate_to(alt_part, RT.A_F_CHUNK)
alt_chunk = OxmlElement('w:altChunk')
alt_chunk.set(qn('r:id'), r_id)
doc.element.body.sectPr.addprevious(alt_chunk)
doc = Document()
doc.add_paragraph('Hello')
add_alt_chunk(doc, "<body><strong>I'm an altChunk</strong></body>")
doc.add_paragraph('Have a nice day!')
doc.save('test.docx')
Note: the altChunk parts only work/appear when document is open using MS Word
Well, some hints here anyway. Maybe you can post your working code at the end as a full "answer":
The alt-chunk part needs to start its life as a docx.opc.part.Part object.
The blob argument should be the bytes of the file, which is often but not always plain text. It must be bytes though, not unicode (characters), so any encoding has to happen before calling Part().
I expect you can work out the other arguments:
package is the overall OPC package, available on document.part.package.
You can use docx.opc.package.OpcPackage.next_partname() to get an available partname based on a root template like: "altChunk%s" for a name like "altChunk3". Check what partname prefix Word uses for these, possibly with unzip -l has-an-alt-chunk.docx; should be easy to spot.
The content-type is one in docx.opc.constants.CONTENT_TYPE. Check the [Content_Types].xml part in a .docx file that has an altChunk to see what they use.
Once formed, the document_part.relate_to() method will create the proper relationship. If there is more than one relationship (not common) then you need to create each one separately. There would only be one relationship from a particular part, just some parts are related to more than one other part. Check the relationships in an existing .docx to see, but pretty good guess it's only the one in this case.
So your code would look something like:
package = document.part.package
partname = package.next_partname("altChunkySomethingPrefix")
content_type = docx.opc.constants.CONTENT_TYPE.THE_RIGHT_MIME_TYPE
blob = make_the_altChunk_file_bytes()
alt_chunk_part = Part(partname, content_type, blob, package)
rId = document.part.relate_to(alt_chunk_part, RT.A_F_CHUNK)
etc.

How to set decimal mark character or localization in CSVeed?

new Problem: my employer wishes me to implement CSVeed utility for a project. It works just fine except that data formatting is not recognised correctly. The data to read is formatted with semicolon (;) as field separator and colon (,) as decimal mark. The information on the projects home page is telling me that decimal conversion is done automatically, but e.g. a string 0,5 in csv file is interpeted as 5, a string 9,5 read as 95. In the source code of the project i find Information: "Makes sure that a specific Locale is used to convert numbers.". I am not exactly sure where to tell the csveed lib which l10n to use. At another point of source doc it says utility will use l10n of framework. Is this from Eclipse RCP which i am using oder from the machine ? Sorry for not posting any code, but i didnt find barely a hint where to setup
the decimal mark in the utility...
Anyone an idea ?
Greetings :)
My Goodness, why this verbose ? ^^
CsvClient<BeanClass> reader = new CsvClientImpl<BeanClass>(reader, BeanClass.class);
reader.setConverter("[name of property]", new CustomNumberConverter(Double.class, NumberFormat.getNumberInstance(Locale.[whereever]), false));
[name of property] has to be the name of the actual instance variable.
Greetings :)

Append string to a Flash AS3 shared object: For science

So I have a little flash app I made for an experiment where users interact with the app in a lab, and the lab logs the interactions.
The app currently traces a timestamp and a string when the user interacts, it's a useful little data log in the console:
trace(Object(root).my_date + ": User selected the cupcake.");
But I need to move away from using traces that show up in the debug console, because it won't work outside of the developer environment of Flash CS6.
I want to make a log, instead, in a SO ("Shared Object", the little locally saved Flash cookies.) Ya' know, one of these deals:
submit.addEventListener("mouseDown", sendData)
function sendData(evt:Event){
{
so = SharedObject.getLocal("experimentalflashcookieWOWCOOL")
so.data.Title = Title.text
so.data.Comments = Comments.text
so.data.Image = Image.text
so.flush()
}
I don't want to create any kind of architecture or server interaction, just append my timestamps and strings to an SO. Screw complexity! I intend to use all 100kb of the SO allocation with pride!
But I have absolutely no clue how to append data to the shared object. (Cough)
Any ideas how I could create a log file out of a shared object? I'll be logging about 200 lines per so it'd be awkward to generate new variable names for each line then save the variable after 4 hours of use. Appending to a single variable would be awesome.
You could just replace your so.data.Title line with this:
so.data.Title = (so.data.Title is String) ? so.data.Title + Title.text : Title.text; //check whether so.data.Title is a String, if it is append to it, if not, overwrite it/set it
Please consider not using capitalized first letter for instance names (as in Title). In Actionscript (and most C based languages) instance names / variables are usually written with lowercase first letter.

When trying to select a column of type BLOB, SQLStatement throws a RangeError #2006: The supplied index is out of bounds

var imageData:ByteArray = new ByteArray();
var headshotStatement:SQLStatement = new SQLStatement();
headshotStatement.sqlConnection = dbConnection;
var headshotStr:String = "SELECT headshot FROM ac_images WHERE id = " + idx;
headshotStatement.text = headshotStr;
headshotStatement.execute();
Error references the final line in this block. I found a few references online where people are unable to select a BLOB using AS3 that has had data inserted into it from an external program, which is my exact situation.
Exact error trace is:
RangeError: Error #2006: The supplied index is out of bounds.
at flash.data::SQLStatement/internalExecute()
at flash.data::SQLStatement/execute()
at edu.bu::DataManager/getHeadshotImageData()[omitted/DataManager.as:396]
at edu.bu::CustomStudentProfileEditorWindow/editStudent()[omitted/CustomStudentProfileEditorWindow.mxml:194]
at edu.bu::CustomStudentProfileEditorWindow/profileEditor_completeHandler()[omitted/CustomStudentProfileEditorWindow.mxml:37]
at edu.bu::CustomStudentProfileEditorWindow/___CustomStudentProfileEditorWindow_Window1_creationComplete()[omitted/CustomStudentProfileEditorWindow.mxml:9]
at flash.events::EventDispatcher/dispatchEventFunction()
at flash.events::EventDispatcher/dispatchEvent()
at mx.core::UIComponent/dispatchEvent()[E:\dev\4.x\frameworks\projects\framework\src\mx\core\UIComponent.as:12528]
at mx.core::UIComponent/set initialized()[E:\dev\4.x\frameworks\projects\framework\src\mx\core\UIComponent.as:1627]
at mx.managers::LayoutManager/doPhasedInstantiation()[E:\dev\4.x\frameworks\projects\framework\src\mx\managers\LayoutManager.as:759]
at mx.managers::LayoutManager/doPhasedInstantiationCallback()[E:\dev\4.x\frameworks\projects\framework\src\mx\managers\LayoutManager.as:1072]
Any ideas what could be wrong with the data that is causing AS3 to choke?
Edit: I've tried to insert the same data into a binary type field using SQLite Manager 3.5 for OS X with the same result. I'm working with a JPG, going to try other image formats, though I can't begin to guess which supplied index is out of bounds given that trying to jump to the definition just states that it's getting it from a SWC and I can't see. Grasping at straws before I distribute the data in a zip and decompress it on the fly, which I don't want to do.
Edit 2, Workaround Edition: For now, I've written a quick MXML Window that prompts for the images and inserts the data into the database. Would still love to know why another application can't store BLOB data in a way that SQLStatement understands. I'm essentially setting up a few buttons that prompt for file paths to the images I want in the database, reading them in with readBytes() to a ByteArray, and storing that into the SQLite db using SQLStatement. I am then able to read the image back without difficulty.
Edit 3: Here is an image of the table structure. Note, this same table is being used successfully when I have Flash do the database image insert, as opposed to some other tool which properly stores the data in the BLOB field for every other application but Flash...
From the Air docs, CAST is supported to bring in non-AMF blobs as of Air 2.5. I had the same issue with a database/blob field created in python and used in Air and this solved it for me.
example from docs:
stmt.text = "SELECT CAST(data AS ByteArray) AS data FROM pictures;";
stmt.execute();
var result:SQLResult = stmt.getResult();
var bytes:ByteArray = result.data[0].data;
http://help.adobe.com/en_US/as3/dev/WSd47bd22bdd97276f1365b8c112629d7c47c-8000.html
I found your question because I have experienced the same problem.
I have finally been able to fix and solve my problem maybe you should look on my solution maybe this can help you solves yours: What is the first bytes in a Blob column SQlite Adobe AIR? Blob Sizeinfo?
It's all fault of Adobe AIR and AMF (ActionScript Message Format) which try to store the ByteArray object as a formatted binary. So it add a first byte 12 which is the TypeCode for the ByteArray followed by a int-29 integer for the size-length and finally the raw data.
So if you want to store a binary blob data into a sqlite which must be read by Adobe AIR you must add those informations to allow Adobe Air to convert everything back into a ByteArray.