Write a CSV file in Shift-JIS (MFC VC++, Windows Embedded - WinCE) - csv

As the title says, I have been trying to write data that the user enters into a CEdit control to a file.
The system is a handheld terminal running Windows CE, in which my test application is running, and I try to enter test data (Japanese characters in Romaji, Hiragana, Katakana and Kanji mixed along with normal English alphanumeric data) that initially is displayed in a CListCtrl. The characters display properly on the handheld display screen in my test application UI.
Finally, I try to read back the data from the List control and write it to a text CSV file. The data I get on reading back from the control is correct, but on writing it to the CSV, things mess up and my CSV file is unreadable and shows strange symbols and nonsense alphanumeric garbage.
I searched about this, and I ended up with a similar question on stackOverflow:
UTF-8, CString and CFile? (C++, MFC)
I tried some of their suggestions and finally ended up with a proper UTF-8 CSV file.
The write-to-csv-file code goes like this:
CStdioFile cCsvFile = CStdioFile();
cCsvFile.Open(cFileName, CFile::modeCreate|CFile::modeWrite);
char BOM[3]={0xEF, 0xBB, 0xBF}; // Utf-8 BOM
cCsvFile.Write(BOM,3); // Write the BOM first
for(int i = 0; i < M_cDataList.GetItemCount(); i++)
{
CString cDataStr = _T("\"") + M_cDataList.GetItemText(i, 0) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 1) + _T("\",");
cDataStr += _T("\"") + M_cDataList.GetItemText(i, 2) + _T("\"\r\n");
CT2CA outputString(cDataStr, CP_UTF8);
cCsvFile.Write(outputString, ::strlen(outputString));
}
cCsvFile.Close();
So far it is OK.
Now, for my use case, I would like to change things a bit such that the CSV file is encoded as Shift-JIS, not UTF-8.
For Shift-JIS, what BOM do I use, and what changes should I make to the above code?
Thank you for any suggestions and help.

Codepage for Shift-JIS is apparently 932. Use WideCharToMultiByte and MultiByteToWideChar for conversion. For example:
CStringW source = L"日本語ABC平仮名ABCひらがなABC片仮名ABCカタカナABC漢字ABC①";
CStringA destination = CW2A(source, 932);
CStringW convertBack = CA2W(destination, 932);
//Testing:
ASSERT(source == convertBack);
AfxMessageBox(convertBack);
As far as I can tell there is no BOM for Shift-JIS. Perhaps you just want to work with UTF16. For example:
CStdioFile file;
file.Open(L"utf16.txt", CFile::modeCreate | CFile::modeWrite| CFile::typeUnicode);
BYTE bom[2] = { 0xFF, 0xFE };
file.Write(bom, 2);
CString str = L"日本語";
file.WriteString(str);
file.Close();
ps, according to this page there are some problems between codepage 932 and Shift-JIS, although I couldn't duplicate any errors.

Related

Can Tesseract OCR recognize subscripts and superscripts?

I have problems with the general recognition of subscript and superscript in text fragments.
Example-image:
I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:
tessedit_create_hocr = 1 (to get result as HOCR)
hocr_font_info = 1 (to get additional font infos like font size)
hocr_char_boxes = 1 (to get character-based result)
The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.
In the output the sub/sup-fragments were all more or less wrong:
"SubtextSub" is recognized as "Subtextsu,"
"SuptextSub" is recognized as "Suptexts?"
"P0" is recognized as "Po"
"P100" is recognized as "P1go"
"a2+b2" is recognized as "a+b?"
Using Tesseract for OCR is there a way to ...?
optimize subscript/superscript handling
get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.
Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training:
link1 and link2.
But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.
Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:
public class SubSupEvaluator {
public void determineSubSupCharacters(BufferedImage image) {
//1. initialize Tesseract and set image infos
TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
try {
int bpp = image.getColorModel().getPixelSize();
int bytespp = bpp / 8;
int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);
//2. start actual OCR run
TessBaseAPIRecognize(handle, null);
//3. iterate over the result character-wise
TessResultIterator ri = TessBaseAPIGetIterator(handle);
TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
TessPageIteratorBegin(pi);
do {
//determine character
Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
String character = ptr.getString(0);
TessDeleteText(ptr); //release memory
//determine position information
IntBuffer leftB = IntBuffer.allocate(1);
IntBuffer topB = IntBuffer.allocate(1);
IntBuffer rightB = IntBuffer.allocate(1);
IntBuffer bottomB = IntBuffer.allocate(1);
TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);
//write info to console
System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
} while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
} finally {
TessBaseAPIDelete(handle); //release memory
}
}
}
The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.
There is very little information on this topic.
One option to enhance sub/superscript character recognition (even if not the position itself) is by preprocessing the image, with cv2 / pil (also pillow) e.g., and then tesseract it.
See
How to detect subscript numbers in an image using OCR?
Related (but otherwise not answering the question):
https://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg19434.html
https://github.com/tesseract-ocr/tesseract/blob/master/src/ccmain/superscript.cpp
what do you guys think about getting tesseract to recognize single letters?
Tesseract does not recognize single characters
I tried it with the option --psm 10
tesseract imTstg.png out5 --psm 10
but it did not seem to work. I am thinking about just running yolo to detect the single letters.

Read tif tag from a tif file using LibTif[Edited : Adding sample code]

I have a requirement where I need to read couple of Tiff tags from a input Tiff file. As user can provide any tag ID to read. For this, I need to know the type of the value of that tag so that i can read the tag and return the value to user.
const char* filename = "C:\\test\\Modified.tif";
TIFF* mtif = TIFFOpen(filename, "r");
uint16 flor, w, h;
uint16 gotcount = 0;
TIFFGetField(mtif, TIFFTAG_FILLORDER, &flor);
TIFFGetField(mtif, TIFFTAG_IMAGEWIDTH, &w);
TIFFGetField(mtif, TIFFTAG_IMAGELENGTH, &h);
I am using LibTif library. Here all i am able to read the width and height properly whereas fillorder tag value is not being received.
I opened the file in a Tiff editor and can see that FillOrder has valid value.
Can someone help me in this? Thanks.

Character confidence for Tesseract 3.02 using config file

How would I get the % confidence per character detected?
By searching around I found that you should set save_blob_choices to T.
So I added that to as a line in the hocr config file in tessdata/configs and called tesseract with it.
This is all I'm getting in the generated html file:
<span class='ocr_line' id='line_1' title="bbox 0 0 50 17"><span class='ocrx_word' id='word_1' title="bbox 3 2 45 15"><strong>31,835</strong></span>
As you can see there isn't any confidence annotations not even per word.
I don't have visual studio so I'm not able to make any code changes. But I'm also open to answers describing code changes as well as how I would compile the code without VS.
Here is the sample code of getting confidence of each word.
You can even replace RIL_WORD with RIL_SYMBOL to get confidence of each character.
mTess.Recognize(0);
tesseract::ResultIterator* ri = mTess.GetIterator();
if(ri != 0)
{
do
{
const char* word = ri->GetUTF8Text(tesseract::RIL_WORD);
if(word != 0 )
{
float conf = ri->Confidence(tesseract::RIL_WORD);
printf(" word:%s, confidence: %f", word, conf );
}
delete[] word;
} while((ri->Next(tesseract::RIL_WORD)));
delete ri;
}
You will have to write a program to do this. Take a look at the ResultIterator API example at Tesseract site. For your case, be sure to set save_blob_choices variable and iterate at RIL_SYMBOL level.

Actionscript problems with social share encoding

I'm trying to make some "social share" buttons at my site, but the urls I generate just don't get decoded by this services.
One example, for twitter:
private function twitter(e:Event):void {
var message:String = "Message with special chars âõáà";
var url:String = "http://www.twitter.com/home?status=";
var link:URLRequest = new URLRequest( url + escape(message) );
}
But when twitter opens up, the message is:
Message with special chars
%E2%F5%E1%E0
Something similar is happening with Facebook and Orkut (but these two hide the special chars).
Someone know why is this happening?
The problem is that the escape() function doesn't take UTF-8 enconding into account. The function you want for encoding the querystring using UTF-8 is encodeURIComponent().
So, let's say you have an "ñ" (eñe in Spanish, or n plus tilde). I'm using "ñ", because I remember both its code point and its UTF-8 representation, since I always use it for debugging, but the same applies for other non-ASCII, non-alphanumeric number.
Say you have the string "Año" ("year" in Spanish, by the way).
The code points (both in Unicode and iso-8859-1) are:
A: 0x41
ñ: 0xf1
o: 0x6f
If you call escape(), you'll get this:
A: A
ñ: %F1
o: o
"A" and "o" don't need to be encoded. The "ñ" is encoded as "%" plus its code point, which is 0xf1.
But, twitter, facebook, etc, expect UTF-8. 0xf1 is not a valid UTF-8 sequence and should be represented with a 2 bytes sequence. Meaning, "ñ" should be encoded as:
0xC3
0xB1
This is what encodeURIComponent does. It will encode "año" this way:
A: A
ñ: %C3
%B1
o: o
So, to sum up, instead of this:
var link:URLRequest = new URLRequest( url + escape(message) );
try this
var link:URLRequest = new URLRequest( url + encodeURIComponent(message) );
And it should work fine.

Unicode, VBScript and HTML

I have the following radio box:
<input type="radio" value="香">香</input>
As you can see, the value is unicode. It represents the following Chinese character: 香
So far so good.
I have a VBScript that reads the value of that particular radio button and saves it into a variable. When I display the content with a message box, the Chinese Character appears. Additionally I have a variable called uniVal where I assign the unicode of the Chinese character directly:
radioVal = < read value of radio button >
MsgBox radioVal ' yields chinese character
uniVal = "香"
MsgBox uniVal ' yields unicode representation
Is there a possibility to read the radio box value in such a way that the unicode string is preserved and NOT interpreted as the chinese character?
For sure, I could try to recreate the unicode of the character, but the methods I found in VBScript are not working correctly due to VBScripts implicit UTF-16 setting (instead of UTF-8). So the following method does not work correctly for all characters:
Function StringToUnicode(str)
result = ""
For x=1 To Len(str)
result = result & "&#"&ascw(Mid(str, x, 1))&";"
Next
StringToUnicode = result
End Function
Cheers
Chris
I got a solution:
JavaScript is in possession of a function that actually works:
function convert(value) {
var tstr = value;
var bstr = '';
for(i=0; i<tstr.length; i++) {
if(tstr.charCodeAt(i)>127)
{
bstr += '&#' + tstr.charCodeAt(i) + ';';
}
else
{
bstr += tstr.charAt(i);
}
}
return bstr;
}
I call this function from my VBScript... :)
Here is a VBScript function that will always return a positive value for the Unicode code point of a given character:-
Function PositiveUnicode(s)
Dim val : val = AscW(s)
If (val And &h8000) <> 0 Then
PositiveUnicode = (val And &h7FFF) + &h8000&
Else
PositiveUnicode = CLng(val)
End If
End Function
This will save you loading two script engines to acheive a simple operation.
"not working correctly due to VBScripts implicit UTF-16 setting (instead of UTF-8)."
This issue has nothing to do with UTF-8. It is purely the result of AscW use of the signed integer type.
As to why you have to recreate the &#xxxxx; encodings that you sent this is result of how HTML (and XML) work. The use of this character encoding entity is a convnience that the specification does not require to remain intact. Since the character encoding of the document is quite capable or representing that character the DOM is at liberty to convert it.