A good practice to work with dictionaries? - actionscript-3

I'm starting do develop a game (AS3), and in one step, the participants have to type a word in one of 5 different available languages, and then that word is translated to the other 4.
For the sake of example:
I choose the word "home" in English, and then these fields are filled:
Spanish: casa
Russian: домой
German: Zuhause
French: maison
So the question is, what would be the best approach to do it?, are there any downloadable dictionaries available for different languages?, or it would be better to feed from a web service?.
Also something to consider is that the translations shouldn't consist of more than one word.
I never worked with dictionaries before, so I'd rather investigate a bit instead of starting with the left foot. Thanks.

You have to use property file. This is best approach to do multi-languaged application.

Related

Custom translator - How can I train the machine to recognize the right translation solution (synonyms)?

I'm pretty new with Custom Translator and I'm working on a fashion-related EN_KO project.
There are many cases where a single English term has two possible translations into Korean. An example: if "fastening"is related to "bags, backpacks..." is 잠금 but if it's related to "clothes, shoes..." is 여밈.
I'd like to train the machine to recognize these differences. Could it be useful to upload a phrase dictionary? Any ideas? Thanks!
The purpose of training a custom translation system is to teach it how to translate terms in context.
The best way to teach the system how to translate is training with parallel documents of full sentence prose: the same document in two languages. A translation memory extract in a TMX or XLIFF file is the best material, but many other document formats are suitable as well, as long as you have both languages. Have at least 10000 sentences in both languages, upload to http://customtranslator.ai, and build a custom system with it.
If you have documents in Korean that are representative of the terminology and style you want to achieve, without an English match, you can automatically translate those to English, and add to the training material as parallel documents. Be sure to not use the automatically translated documents in the other direction.
A phrase dictionary is of limited help, because it is unaware of context. It is useful only in bootstrapping your custom system or for very rare terms where you cannot find or create a sentence.

Ad-Hoc dictionary

I m currently working on a small project with Finereader 11 SDK. To improve my results i like to work with an ad-hoc dictionary. The content of the dictionary is based on the first word of a certain line
Example:
Samsung Galaxy S3 ... many other word in this line
Apple Iphone 4 ... much more words
some more lines
My idea is to recongize the first word ( Samsung or Apple ) and fill the dictionary with all possible words based on the first ( for Samsung : Galaxy, S3, ...)
Any idea how to solve this with Finereader
Regards
Thank you for the clarification. So here is what you can do in my opinion. This applies to FineReader product line, and of course in the SDK you have more specific control via API.
FineReader OCR has these dictionaries:
Built-in dictionary - large set of common words and their variations, one of the strengths of ABBYY OCR technology. It does not contain specialized words, such as "Samsung" and "S3", for example. By selecting popular language, you automatically turn on built-in dictionary for that language.
Custom Dictionary - this is a dictionary that you can build, and use alone or in conjunction with built-in dictionary.
So for your project, I believe it makes sense to use built-in dictionary, because your phrases may have standard English words (you did not provide full phrases for me to see, so decide on this yourself).
I also strongly believe that you need to create a custom dictionary with brands and models, etc. If you have that option, and sounds like you do. It will greatly improve recognition, especially for un-natural words, like "S3", because common language rules indicate letters and numbers should not be mixed. This is very easy to do.
I presently do not see the benefit of reading each line with a separate dictionary, unless you believe you will have an intersection of very similar words applicable to different lines, and you would want those words in separate dictionaries and relative to each line. Then you can create separate dictionaries, and turn on each dictionary for secondary recognition based on the initial word. However, to achieve that, you need to first separate into lines (in memory, or actually crop images) in order to be able to process each separately with unique dictionary. That is possible only in SDK with substantial amount of work.

Coding a domain specific text generator

A friend of mine is in the real estate business and after being showed the art of writing copy for real estate ads, I realized that it is very formulaic. Especially when advertising online as there are predefined fields you fill in.
Naturally, I thought about creating a generator that pretty much automates writing the ads. i don't expect it to generate outstanding or even very good copy, just that it can put together words and sentences like a human would.
I have a skeleton/template that defines an ad and I've also put together a set of phrases and words that can be randomly selected, but I am interested in more general aspects of coding such a generator? Any suggestions, tips or literature that I can read to better understand this little project better?
using metadata about the listing would be one way.
Say for a given house, you have these attributes:
(type: bungalo, sq feet: <= 1400) You could use the phrase "cozy cottage".
bedrooms: obvious, same thing with bathrooms. Assume using the word Large, medium, etc.
garage spots: if > 2 then "Can park many vehicles", etc.
You could go even further with this given the lat/lon for the address, there are web services that you can find the amount of parks nearby, crime in the neighborhood, etc.
Rick
I'd say there are three basic approaches you could take to a problem like this, depending on how flexible you want the system to be and on how much work you want to put into it. The simplest is to treat it as a report generation problem, along the lines of Rick's suggestion. That's probably the way I'd go to produce a first draft of a listing. The results would be pure boilerplate, but each listing could be quickly punched up by the copywriter.
If you wanted to get fancy, though, you could come at it as a natural language generation problem. You'd start with some kind of a knowledge representation describing the meaning of the listing and set of rules (finite state transducers, say) for mapping meanings to linguistic forms. There's a sizable academic literature on that kind of stuff, though it's kind of out of fashion these days. Places to start might be Blackburn & Bos's book or the NLTK suite (especially some of the projects in the contrib package).
The third way of doing it would be to treat it as a translation problem, essentially "translating" database entries into ad copy. You'd start with a large collection of listings and the corresponding human-written ads and construct a statistical model of the relationship between the two. Moses/Giza++ is a general purpose tool for building and applying such models.

Programmers dictionary/lexicon for non native speakers

I'm not an English speaker, and I'm not very good at English. I'm self thought. I have not worked together with others on a common codebase. I don't have any friends who program. I don't work with other programmers (at least nobody who cares about these things).
I guess this might explain some of my problems in finding good unambiguous class names. I have tried to find some sort of "Programmers dictionary" containing words often used and their meanings. When reading others code I have to look up words quite often, and as many use abbreviations this poses an additional challenge.
My very limited vocabulary "forces" me to use bad class names like xxManager, xxProvider, xxWhatever. It's usually less problematic choosing variable and method names.
Other non English people out here: How have you managed to cope with this? Have you studied English so well it's not a problem? Or have you read so much code naming comes natural? Or discussed a lot with English speakers? Found any good websites, articles or other publications? As I've never read anything regarding programming in my own language, I often have more problems trying to find the words in my language...
PS: All other posts I've found was regarding mixing native tongue and English... And I understand this might be a bit off topic and might be closed.
Edit: Some resources from the answers and other stuff I use:
Jargon / The New Hacker's Dictionary
Common design patterns
Google translate
Dictionary
The Jargon file will help with the more obscure references people will give in the industry.
http://catb.org/jargon/html/go01.html
Other than that..finding good names for your variables/classes/etc is hard. Often times, it's harder than actually solving the problem. Here's a good resource for some common design pattern names people like to use: http://en.wikipedia.org/wiki/Design_pattern_%28computer_science%29
Examples:
AbcFactory
XyzBridge
Could be an unorthodox suggestion, but I would recommend studying English more deeply (I am also a non-native speaker).
Expose yourself to as much English as possible! Watch movies, read English fiction, listen to technical podcasts.
Mind you, if you really want to deepen your knowledge of English, you're probably not going to learn a lot watching "Transformers". On the other hand, diving into Ulysses probably is not a good strategy either.
If you're feeling adventurous, you could always get a subscription to the New Yorker magazine. It'll do things to you - yes this is flamebaiting. :P
Other non English people out here:
How have you managed to cope with this?
Good naming in code matters. Using English is the preferred, but if you don't know English very well the result could be counterproductive.
I had a friend who just guessed what the correct name would be and the result was horrible. ie
String employiiNeim; // employeeName
int eich; // age
The problem with English, is that is not pronounced as written ( french have this minor ... ehrm characteristic ) Other languages like Spanish, German, Dutch, and others, do type and pronounce every letter in the word.
This becomes particular relevant when what you are coding are business rules or business models. In this case it is much better to use your native language.
String nombreEmpleado;
int edad;
Way much better, specially when you work with others.
Have you studied English so well it's not a problem?
Yeap, there is no other way, and a lot of practice.
You can study English the same way you study programming languages though. You can have a teacher and attend to a class room and study an hour a day. Or ( what I did ) you can just grab something that is interesting to you and try to understand it. For instance, you have a small document describing something you care, you read blogs or read content here at StackOverflow, you translate a song you like, etc. etc.
All these are study forms. There is no other way, you won't wake up one day and say: "...I know kung fu" I mean, and say: ..."I know English"
Or have you read so much code naming comes natural?
Also helps, but if you don't understand what the code means, you ... well won't make any progress.
You'll learn the programming language, and that will help you to understand English bit better, but won't help you to learn it. That's because when we program we learn the programming language not the native language.
Or discussed a lot with English speakers?
Eerhh..nope. If you have that chance go ahead, it will improve your listening and speaking, but not necessarily your writting.
The most effective way to improve your English vocabulary and grammar is by READING ( reading in your native language also improves your own language btw )
So, I would say, read as much as you can. Use your native language while you gain more confidence, and keep studying.
The English will come with time.
If you can't find the "Programmer's Dictionary" you're looking for, start one. Post a new question: "What entries are missing from this Dictionary for English-as-a-Second-Language-Programmers?" and seed it with 10 or 20 words/definitions you've already discovered. Once posters have suggested enough additions, move it to a a wiki somewhere and keep accepting contributions. You might end up creating a valuable resource.
Documenting your code with excellent prose like your question above will go a long way!
If you stick to common design patterns endemic to the language, platform, and architecture for which you're working with, other engineers should understand your nomenclature fairly easily.
If you are worried about it in terms of naming your own objects, just think of what your native word is for what you want to do, then go get an english language translation dictionary, and use the english language version.
How about using your native language?
Of course (like for me as an Austrian) some letters may not be allowed - but who cares if there is Mörder or Moerder (Murder) in the class name :)
Or (as I do) use a dictionary like dict.cc or something else.
I do - think what the class does - it manages game session (for an example) so it will become GameSessionManager.
Abbreviations are (at least for me) a problem - but what I've learned from other code - event native speakers use different abbreviations.
And if the class is called GameSessionMgr or GameSessionMngr doesn't make a difference.
Your are not writing books or some kind of "english poem" where spelling, grammar and... counts.
You write code - and if you follow "your sepcial rules" - you and others will (after some time) be able to understand you code and class names.
It will come with time and experience. Above all attempt to (like #Mike A says) document things until the code becomes clearer and try to be consistent.
This is an issue that I run into as well, even as a native English speaker. As a programmer, I often find that I need to find a descriptive word for a class, variable, function, etc. I often find myself asking a friend or coworker what verbage they would use by explaining my idea, carefully excluding any words I myself have considered as a possible choice for the class/function/variable name so as not to inhibit their creativity.
It seems to me that the English Language & Usage site proposal over at Area51 is a good place to ask such questions as "What would you call a class (or thing) that does this, this and that, and has properties x, y, and z?

Internationalization in your projects

How have you implemented Internationalization (i18n) in actual projects you've worked on?
I took an interest in making software cross-cultural after I read the famous post by Joel, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However, I have yet to able to take advantage of this in a real project, besides making sure I used Unicode strings where possible. But making all your strings Unicode and ensuring you understand what encoding everything you work with is in is just the tip of the i18n iceberg.
Everything I have worked on to date has been for use by a controlled set of US English speaking people, or i18n just wasn't something we had time to work on before pushing the project live. So I am looking for any tips or war stories people have about making software more localized in real world projects.
It has been a while, so this is not comprehensive.
Character Sets
Unicode is great, but you can't get away with ignoring other character sets. The default character set on Windows XP (English) is Cp1252. On the web, you don't know what a browser will send you (though hopefully your container will handle most of this). And don't be surprised when there are bugs in whatever implementation you are using. Character sets can have interesting interactions with filenames when they move to between machines.
Translating Strings
Translators are, generally speaking, not coders. If you send a source file to a translator, they will break it. Strings should be extracted to resource files (e.g. properties files in Java or resource DLLs in Visual C++). Translators should be given files that are difficult to break and tools that don't let them break them.
Translators do not know where strings come from in a product. It is difficult to translate a string without context. If you do not provide guidance, the quality of the translation will suffer.
While on the subject of context, you may see the same string "foo" crop up in multiple times and think it would be more efficient to have all instances in the UI point to the same resource. This is a bad idea. Words may be very context-sensitive in some languages.
Translating strings costs money. If you release a new version of a product, it makes sense to recover the old versions. Have tools to recover strings from your old resource files.
String concatenation and manual manipulation of strings should be minimized. Use the format functions where applicable.
Translators need to be able to modify hotkeys. Ctrl+P is print in English; the Germans use Ctrl+D.
If you have a translation process that requires someone to manually cut and paste strings at any time, you are asking for trouble.
Dates, Times, Calendars, Currency, Number Formats, Time Zones
These can all vary from country to country. A comma may be used to denote decimal places. Times may be in 24hour notation. Not everyone uses the Gregorian calendar. You need to be unambiguous, too. If you take care to display dates as MM/DD/YYYY for the USA and DD/MM/YYYY for the UK on your website, the dates are ambiguous unless the user knows you've done it.
Especially Currency
The Locale functions provided in the class libraries will give you the local currency symbol, but you can't just stick a pound (sterling) or euro symbol in front of a value that gives a price in dollars.
User Interfaces
Layout should be dynamic. Not only are strings likely to double in length on translation, the entire UI may need to be inverted (Hebrew; Arabic) so that the controls run from right to left. And that is before we get to Asia.
Testing Prior To Translation
Use static analysis of your code to locate problems. At a bare minimum, leverage the tools built into your IDE. (Eclipse users can go to Window > Preferences > Java > Compiler > Errors/Warnings and check for non-externalised strings.)
Smoke test by simulating translation. It isn't difficult to parse a resource file and replace strings with a pseudo-translated version that doubles the length and inserts funky characters. You don't have to speak a language to use a foreign operating system. Modern systems should let you log in as a foreign user with translated strings and foreign locale. If you are familiar with your OS, you can figure out what does what without knowing a single word of the language.
Keyboard maps and character set references are very useful.
Virtualisation would be very useful here.
Non-technical Issues
Sometimes you have to be sensitive to cultural differences (offence or incomprehension may result). A mistake you often see is the use of flags as a visual cue choosing a website language or geography. Unless you want your software to declare sides in global politics, this is a bad idea. If you were French and offered the option for English with St. George's flag (the flag of England is a red cross on a white field), this might result in confusion for many English speakers - assume similar issues will arise with foreign languages and countries. Icons need to be vetted for cultural relevance. What does a thumbs-up or a green tick mean? Language should be relatively neutral - addressing users in a particular manner may be acceptable in one region, but considered rude in another.
Resources
C++ and Java programmers may find the ICU website useful: http://www.icu-project.org/
Some fun things:
Having a PHP and MySQL Application that works well with German and French, but now needs to support Russian and Chinese. I think I move this over to .net, as PHP's Unicode support is - in my opinion - not really good. Sure, juggling around with utf8_de/encode or the mbstring-functions is fun. Almost as fun as having Freddy Krüger visit you at night...
Realizing that some languages are a LOT more Verbose than others. German is a LOT more verbose than English usually, and seeing how the German Version destroys the User Interface because too little space was allocated was not fun. Some products gained some fame for their creative ways to work around that, with Oblivion's "Schw.Tr.d.Le.En.W." being memorable :-)
Playing around with date formats, woohoo! Yes, there ARE actually people in the world who use date formats where the day goes in the middle. Sooooo much fun trying to find out what 07/02/2008 is supposed to mean, just because some users might believe it could be July 2... But then again, you guys over the pond may believe the same about users who put the month in the middle :-P, especially because in English, July 2 sounds a lot better than 2nd of July, something that does not neccessarily apply to other languages (i.e. in German, you would never say Juli 2 but always Zweiter Juli). I use 2008-02-07 whenever possible. It's clear that it means February 7 and it sorts properly, but dd/mm vs. mm/dd can be a really tricky problem.
Anoter fun thing, Number formats! 10.000,50 vs 10,000.50 vs. 10 000,50 vs. 10'000,50... This is my biggest nightmare right now, having to support a multi-cultural environent but not having any way to reliably know what number format the user will use.
Formal or Informal. In some language, there are two ways to address people, a formal way and a more informal way. In English, you just say "You", but in German you have to decide between the formal "Sie" and the informal "Du", same for French Tu/Vous. It's usually a safe bet to choose the formal way, but this is easily overlooked.
Calendars. In Europe, the first day of the Week is Monday, whereas in the US it's Sunday. Calendar Widgets are nice. Showing a Calendar with Sunday on the left and Saturday on the right to a European user is not so nice, it confuses them.
I worked on a project for my previous employer that used .NET, and there was a built in .resx format we used. We basically had a file that had all translations in the .resx file, and then multiple files with different translations. The consequence of this is that you have to be very diligent about ensuring that all strings visible in the application are stored in the .resx, and anytime one is changed you have to update all languages you support.
If you get lazy and don't notify the people in charge of translations, or you embed strings without going through your localization system, it will be a nightmare to try and fix it later. Similarly, if localization is an afterthought, it will be very difficult to put in place. Bottom line, if you don't have all visible strings stored externally in a standard place, it will be very difficult to find all that need to be localized.
One other note, very strictly avoid concatenating visible strings directly, such as
String message = "The " + item + " is on sale!";
Instead, you must use something like
String message = String.Format("The {0} is on sale!", item);
The reason for this is that different languages often order the words differently, and concatenating strings directly will need a new build to fix, but if you used some kind of string replacement mechanism like above, you can modify your .resx file (or whatever localization files you use) for the specific language that needs to reorder the words.
I was just listening to a Podcast from Scott Hanselman this morning, where he talks about internationalization, especially the really tricky things, like Turkish (with it's four i's) and Thai. Also, Jeff Atwood had a post:
Besides all the previous tips, remember that i18n it's not just about changing words for their equivalent on other languages, especially for non-latin languages alphabets (korean, Arabic) which written right to left, so the whole UI will have to conform, like
item 1
item 2
item 3
would have to be
arabic text 1 -
arabic text 2 -
arabic text 3 -
(reversed bullet list doesn't seem to work :P)
which can be a UI nightmare if your system has to apply changes dinamically once the user changes the language being used.
Another very hard thing is to test different languages, not just for the correctness of word, but since languages like Korean usually have bigger font type for their characters this may lead to language specific bugs (like "SAVE" text on a button being larger than the button itself for some language).
One of the funnier things to discover: italics and bold text makrup does not work with CJK (Chinese/Japanese/Korean) characters. They simply become unreadable. (OK, I couldn't really read them before either, but especially bolding just creates ink blots)
I think everyone working in internationalization should be familiar with the Common Locale Data Repository, which is now a sub-project of Unicode:
Common Locale Data Repository
Those folks are working hard to establish a standard resource for all kinds of i18n issues: currency, geographical names, tons of stuff. Any project that's maintaining its own core local data given that this project exists is pretty bonkers, IMHO.
I suggest to use something like 99translations.com to maintain your translations . Otherwise you won't be able to tell what of your translations are up to date in every language.
Another challenge will be accepting input from your users. In many cases, this is eased by the input processing provided by the operating system, such as IME in Windows, which works transparently with common text widgets, but this facility will not be available for every possible need.
One website I use has a translation method the owner calls "wiki + machine translation". This is a community based site so is obviously different to the needs of companies.
http://blog.bookmooch.com/2007/09/23/how-bookmooch-does-its-translations/
One thing no one have mentioned yet is strings with some warying part as in "The unit will arive in 5 days" or "On Monday something happens." where 5 and Monday will change depending on state. It is not a good idea to split those in two and concatenate them. With only one varying part and good documentation you might get away with it, with two varying parts there will be some language that preferes to change the order of them.