Related
If I were to go to a dev. company and say "Please build me 'this' scripting language", are there any specific documents, and what types of documents that I should create, so they know exactly what I am wanting?
Any guidance is much appreciated.
Thank you
As Dave Swersky says, there are lots and lots of very well developed scripting languages already out there. Lua, for example, is very simple yet surprisingly powerful and has tiny yet efficient implementations.
But if you really, really want a scripting language custom built for you, then the nicest thing you could do for the implementor would be to provide him with a description of the language in Backus-Naur form. Ideally, that will allow them to create a scanner and parser for your language with very little effort. In addition, you will need to define your language's data types and how the map to the hardware's data types, and you will need to specify which library or built-in functions you want to provide, and how those will have effects in the outside world. The Lua Reference Manual is concise yet complete and defines that language and its library well. You'll want to come up with something comparable, or make do with a hand-waving simplification and hope your implementors invest their own brain power in filling the gaps left by your description.
It would help if you could offer some detail on why you would want to go to the time and expense to develop your own scripting language when there are at least three or four excellent options out there for free. If you have a targeted need for a language that does something specific, read up on Domain Specific Languages.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Most people would agree that internationalizing an existing app is more expensive than developing an internationalized app from scratch.
Is that really true? Or when you write an internationalized app from scratch the cost of doing I18N is just being amortized over multiple small assignments and nobody feels on his shoulders the whole weight of the internationalization task?
You can even claim that a mature app has many and many LOC that where deleted during the project's history, and that they don't need to be I18Ned if internationalization is made as an after thought, but would have been I18N if the project was internationalized from the very beggining.
So do you think a project starting today, must be internationalized, or can that decision be deferred to the future based on the success (or not) the software enjoys and the geographic distribution of the demand.
I am not talking about the ability to manipulate unicode data. That you have for free in most mainstream languages, databases and libraries. I am talking specifically of supporting your own software's user interface in multiple languages and locales.
"when you write an internationalized app from scratch the cost of doing I18N is ... amortized"
However, that's not the whole story.
Retroactively tracking down every message to the users is -- in some cases -- impossible.
Not hard. Impossible.
Consider this.
theMessage = "Some initial part" + some_function() + "some following part";
You're going to have a terrible time finding all of these kinds of situations. After all, some_function just returns a String. You don't know if it's a database key (never shown to a person) or a message which must be translated. And when it's translated, grammar rules may reveal that a 3-part string concatenation was a dumb idea.
You can't simply GREP every String-valued function as containing a possible I18N message that must be translated. You have to actually read the code, and possibly rewrite the function.
Clearly, when some_function has any complexity to it at all, you're stumped as to why one part of your application is still in Swedish while the rest was successfully I18N'd into other languages. (Not to pick on Swedes in particular, replace this with any language used for development different from final deployment.)
Worse, of course, if you're working in C or C++, you might have some of this split between pre-processor macros and proper C-language syntax.
And in a dynamic language -- where code can be built on the fly -- you'll be paralyzed by a design in which you can't positively identify all the code. While dynamically generating code is a bad idea, it also makes your retroactive I18N job impossible.
I'm going to have to disagree that it costs more to add it to an existing application than from scratch with a new one.
A lot of the time i18n is not required until the application gets 'big'. When you do get big, you will likely have a bigger development team to devote to i18n so it will be less of a burden.
You may not actually need it. A lot of small teams put great effort to support internationalization when you have no customers who require it.
Once you have internationalized, it makes incremental changes more time consuming. It doesn't take a lot of extra time but every time you need to add a string to the product, you need to add it to the bundle first and then add a reference. No it is not a lot of work but it is effort and does take a bit of time.
I prefer to 'cross that bridge when we come to it' and internationalize only when you have a paying customer looking for it.
Yes, internationalizing an existing app is definitely more expensive than developing the app as internationalized from day one. And it's almost never trivial.
For instance
Message = "Do you want to load the " & fileType() & " file?"
cannot be internationalised without some code alterations because many languages have grammatical rules like gender agreement. You often need a different message string for loading every possible file type, unlike in English when it's possible to bolt together substrings.
There are many other issues like this: you need more UI space because some languages need more characters than English to express the same concept, you need bigger fonts for East Asia, you need to use localised date/times in the user interface but perhaps English US when communicating with databases, you need to use semicolon as a delimeter for CSV files, string comparisons and sorting are cultural, phone numbers & addresses...
So do you think a project starting
today, must be internationalized, or
can that decision be deferred to the
future based on the success (or not)
the software enjoys and the geographic
distribution of the demand?
It depends. How likely is the specific project to be internationalised? How important it is to get a first version fast?
If you truly think you get "unicode handling" "for free", you may have a surprise coming your way when you try.
Unless you use a framework that has proven i18n ability beyond languages with the ANSI or very similar character sets, you will find several niggles and more major issues where the unicode handling isn't quite right, or simply unavailable. Even with relatively common languages (e.g. German) you can run into difficulty with shrinking or expanding letter counts and APIs that don't support unicode.
And then think of languages with different reading-ordering!
This is one of the reasons you should really plan it in from the beginning, and test the stuff to destruction on the set of languages you plan to support.
The concept of i18n and l10n is broader than merely translating strings to and fro some languages.
Example: Consider the input of date and time by users. If you haven't internationalization in mind when you design
a) the interface for the user and
b) the storage, retrieval and display mechanism
you will get a really bad time, when you want to enable other input schemes.
Agreed, in most cases i18n is not necessary in the first place. But, and that is my point, if you don't spend a thought on some areas, that must be touched for i18n, you will find yourself ending up rewriting large portions of the original code. And then, adding i18n is a lot more expensive than having spent some thought beforehand.
One thing that seems like it can be a big issue is the different character counts for a message in various languages. I do some work on iPhone apps and especially on a small screen if you design the UI for a message that has 10 characters and then you try to internationalize later and find you need 20 characters to display the same thing you now have to redo your UI to accommodate. Even with desktop apps this can still be a large PITA.
It depends on your project and how your team is organised.
I've been involved in the internationalization of a website, and it was one developer full-time for a year, probably about 6-8 months part-time for me to handle installation impacts when needed (reorganising files, etc), and other developers getting involved from time to time when their projects needed heavy refactoring. This was in an application that was at v3.
So that's definitely expensive. What you have to ask is how expensive is it to provide a localization system from the start, and how will that impact the project in the early stages. Your project at v1 may not be able to survive delays and setbacks caused by issues with a hastily-designed internationalization framework, while a stable v3 project with a wide customer base may have the capital to invest in doing that properly.
It also depends on whether you want to internationalize everything including log messages, or just the UI strings, and how many of those UI strings there are, and who you have available to do localization and the QA that goes with it, and even what languages you want to support - for example, does your system need to support unicode strings (which is a requirement for Asian languages).
And don;t forget that changing the database backend to support internationalized data can be costly as well. Just try to change that varchar field to nvarchar when you already have 20,000,000 records.
I think it depends on a language. Every j2ee(java web) app is i18n, because its very easy(even IDE can extract strings for you and you just name them).
In j2ee its cheaper to add it later, however the culture is to add them as soon as possible. I think its because j2ee uses a lot of open-source and almost all open-source libs are i18n. its great idea for them, but not for most j2ee app. most enterprise apps are just for one company that speak one language.
Plus if you have bad testers putting it too soon makes them give you bug reports about labels and translations(I only once saw translations done NOT by developers). After testers are done with it you have buggy app with excellent i18n support. However it might be fun for users to switch language and see if they can use it. However using your app its just boring work for them, so they wont even do that. The only users of i18n are the testers.
Weird string joining is not in j2ee culture since you know that one day someone might want to make it i18n. Only problem is extracting labels from html templates.
I cant say what is expensive but, i can tell you that a clean API lets you internationalise your Aplication at very low cost.
Many programming languages share generic and even fairly universal features. For example, if you compared Java, VB6, .NET, PHP, Python, then you would find common functions such as control structures, numeric and string manipulation, etc.
What has been done to define these features at a meta-language (or language-agnostic) level?
UML offers a descriptive reference of software in every aspect, but the real-world focus seems to be data processes. Is UML relevant?
I'm not asking "Why we don't have a single language that replaces the current plethora." We need many different tools (at least in this eon).
I'm not asking that all languages fit a template -- assembly vs. compiled languages are different enough to make that unfeasible (and some folks call HTML a language, though I wouldn't). Any attempt would start with a properly narrow scope. In line with this, I wouldn't expect the model to cover even a small selection with full validity.
I would expect however that such a model could be used to transpose from one language to another (with limited goals -- think jist translation).
There have been many attempts at this, but none have been very successful. The earliest I'm aware of is UNCOL more than 50 years ago.
You've given a list of languages that have a lot in common because they're pretty similar -- they're all procedural languages with common roots and some OO extensions thrown in, so that's not too suprising. If you start looking at different languages like LISP, haskell, erlang, prolog, or even SQL you start seeing very different things.
What you're describing sounds like the formal semantics of programming languages. There are a variety of approaches and each will give a way to formally specify the meaning of a program in some programming language. In some cases, this specification is essentially a translation into another language such as lambda calculus, or compilation for a formally specified abstract machine such as SECD.
There is so much work here it's hard to pick a specific reference. But I hope I've given you some useful keywords to continue your search.
UML is typically used to define algorithms/code in simpler terms before moving on to real code.
To answer what I am guessing to be your question, there is already a defined set of required parts of languages while,for,if,else... Will this ever be set as a standard, or made into a base library that is used by all languages: no, this is because the different developers of languages like to do it themselves.
I think the closest you can get to this without loss of generality is a Turing machine, which is not very useful for practical purposes. But if you allow Turing machine languages to be "labeled" and reused, you could build up the concepts you need, working from low- to high-level.
I think that MOF is the universal language.
You can for example create UML diagrams from MOF via a UML metamodel. If you save this metamodel information into xmi then you can save what ever information you need and even more than in any language. XMI semantic is so rich that there is no limit to its use. If you map UML to xmi on the top of a metamodel live synchronize with MOF then this is for me the universal language.
The author of Pattern Calculus seems to propose such a universal model. I expect that it will turn out to be just as useful as previous attempts to define a universal model, that is to say, good in parts but not the last word.
What tools are available to aid in decoding unknown binary data formats?
I know Hex Workshop and 010 Editor both support structures. These are okay to a limited extent for a known fixed format but get difficult to use with anything more complicated, especially for unknown formats. I guess I'm looking at a module for a scripting language or a scriptable GUI tool.
For example, I'd like to be able to find a structure within a block of data from limited known information, perhaps a magic number. Once I've found a structure, then follow known length and offset words to find other structures. Then repeat this recursively and iteratively where it makes sense.
In my dreams, perhaps even automatically identify possible offsets and lengths based on what I've already told the system!
Here are some tips that come to mind:
From my experience, interactive scripting languages (I use Python) can be a great help. You can write a simple framework to deal with binary streams and some simple algorithms. Then you can write scripts that will take your binary and check various things. For example:
Do some statistical analysis on various parts. Random data, for example, will tell you that this part is probably compressed/encrypted. Zeros may mean padding between parts. Scattered zeros may mean integer values or Unicode strings and so on. Try to spot various offsets. Try to convert parts of the binary into 2 or 4 byte integers or into floats, print them and see if they make sence. Write some functions that will search for repeating or very similar parts in the data, this way you can easily spot headers.
Try to find as many strings as possible, try different encodings (c strings, pascal strings, utf8/16, etc.). There are some good tools for that (I think that Hex Workshop has such a tool). Strings can tell you a lot.
Good luck!
For Mac OS X, there's a great tool that's even better than my iBored: Synalyze It!
(http://www.synalysis.net/)
Compared to iBored, it is better suited for non-blocked files, while also giving full control over structures, including scriptability (with Lua). And it visualizes structures better, too.
Tupni; to my knowledge not directly available out of Microsoft Research, but there is a paper about this tool which can be of interest to someone wanting to write a similar program (perhaps open source):
Tupni: Automatic Reverse Engineering of Input Formats (# ACM digital library)
Abstract
Recent work has established the importance of automatic reverse
engineering of protocol or file format specifications. However, the
formats reverse engineered by previous tools have missed important
information that is critical for security applications. In this
paper, we present Tupni, a tool that can reverse engineer an input
format with a rich set of information, including record sequences,
record types, and input constraints. Tupni can generalize the format
specification over multiple inputs. We have implemented a
prototype of Tupni and evaluated it on 10 different formats: five
file formats (WMF, BMP, JPG, PNG and TIF) and five network
protocols (DNS, RPC, TFTP, HTTP and FTP). Tupni identified all
record sequences in the test inputs. We also show that, by aggregating
over multiple WMF files, Tupni can derive a more complete
format specification for WMF. Furthermore, we demonstrate the
utility of Tupni by using the rich information it provides for zeroday
vulnerability signature generation, which was not possible with
previous reverse engineering tools.
My own tool "iBored", which I released just recently, can do parts of this. I wrote the tool to visualize and debug file system formats (UDF, HFS, ISO9660, FAT etc.), and implemented search, copy and later even structure and templates support. The structure support is pretty straight-forward, and the templates are a way to identify structures dynamically.
The entire thing is programmable in a Visual BASIC dialect, allowing you to test values, read specific blocks, and all.
The tool is free, works on all platforms (Win, Mac, Linux), but as it's personal tool which I just released to the public to share it, it's not much documented.
However, if you want to give it a try, and like to give feedback, I might add more useful features.
I'd even open source it, but as it's written in REALbasic, I doubt many people will join such a project.
Link: iBored home page
I still occasionally use an old hex editor called A.X.E., Advanced Hex Editor. It seems to have largely disappeared from the Internet now, though Google should still be able to find it for you. The last version I know of was version 3.4, but I've really only used the free-for-personal-use version 2.1.
Its most interesting feature, and the one I've had the most use for deciphering various game and graphics formats, is its graphical view mode. That basically just shows you the file with each byte turned into a color-coded pixel. And as simple as that sounds, it has made my reverse-engineering attempts a lot easier at times.
I suppose doing it by eye is quite the opposite of doing automatic analysis, though, and the graphical mode won't be much use for finding and following offsets...
The later version has some features that sound like they could fit your needs (scripts, regularity finder, grammar generator), but I have no idea how good they are.
There is Hachoir which is a Python library for parsing any binary format into fields, and then browse the fields. It has lots of parsers for common formats, but you can also write own parsers for your files (eg. when working with code that reads or writes binary files, I usually write a Hachoir parser first to have a debugging aid). Looks like the project is pretty much inactive by now, though.
Kaitai is an open-source language for describing binary structures in data streams. It comes with a translator that can output parsing code for many programming languages, for inclusion in your own program code.
My project icebuddha.com supports this using python to describe the format in the browser.
A cut'n'paste of my answer to a similar question:
One tool is WinOLS, which is designed for interpreting and editing vehicle engine managment computer binary images (mostly the numeric data in their lookup tables). It has support for various endian formats (though not PDP, I think) and viewing data at various widths and offsets, defining array areas (maps) and visualising them in 2D or 3D with all kinds of scaling and offset options. It also has a heuristic/statistical automatic map finder, which might work for you.
It's a commercial tool, but the free demo will let you do everything but save changes to the binary and use engine management features you don't need.
How have you implemented Internationalization (i18n) in actual projects you've worked on?
I took an interest in making software cross-cultural after I read the famous post by Joel, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). However, I have yet to able to take advantage of this in a real project, besides making sure I used Unicode strings where possible. But making all your strings Unicode and ensuring you understand what encoding everything you work with is in is just the tip of the i18n iceberg.
Everything I have worked on to date has been for use by a controlled set of US English speaking people, or i18n just wasn't something we had time to work on before pushing the project live. So I am looking for any tips or war stories people have about making software more localized in real world projects.
It has been a while, so this is not comprehensive.
Character Sets
Unicode is great, but you can't get away with ignoring other character sets. The default character set on Windows XP (English) is Cp1252. On the web, you don't know what a browser will send you (though hopefully your container will handle most of this). And don't be surprised when there are bugs in whatever implementation you are using. Character sets can have interesting interactions with filenames when they move to between machines.
Translating Strings
Translators are, generally speaking, not coders. If you send a source file to a translator, they will break it. Strings should be extracted to resource files (e.g. properties files in Java or resource DLLs in Visual C++). Translators should be given files that are difficult to break and tools that don't let them break them.
Translators do not know where strings come from in a product. It is difficult to translate a string without context. If you do not provide guidance, the quality of the translation will suffer.
While on the subject of context, you may see the same string "foo" crop up in multiple times and think it would be more efficient to have all instances in the UI point to the same resource. This is a bad idea. Words may be very context-sensitive in some languages.
Translating strings costs money. If you release a new version of a product, it makes sense to recover the old versions. Have tools to recover strings from your old resource files.
String concatenation and manual manipulation of strings should be minimized. Use the format functions where applicable.
Translators need to be able to modify hotkeys. Ctrl+P is print in English; the Germans use Ctrl+D.
If you have a translation process that requires someone to manually cut and paste strings at any time, you are asking for trouble.
Dates, Times, Calendars, Currency, Number Formats, Time Zones
These can all vary from country to country. A comma may be used to denote decimal places. Times may be in 24hour notation. Not everyone uses the Gregorian calendar. You need to be unambiguous, too. If you take care to display dates as MM/DD/YYYY for the USA and DD/MM/YYYY for the UK on your website, the dates are ambiguous unless the user knows you've done it.
Especially Currency
The Locale functions provided in the class libraries will give you the local currency symbol, but you can't just stick a pound (sterling) or euro symbol in front of a value that gives a price in dollars.
User Interfaces
Layout should be dynamic. Not only are strings likely to double in length on translation, the entire UI may need to be inverted (Hebrew; Arabic) so that the controls run from right to left. And that is before we get to Asia.
Testing Prior To Translation
Use static analysis of your code to locate problems. At a bare minimum, leverage the tools built into your IDE. (Eclipse users can go to Window > Preferences > Java > Compiler > Errors/Warnings and check for non-externalised strings.)
Smoke test by simulating translation. It isn't difficult to parse a resource file and replace strings with a pseudo-translated version that doubles the length and inserts funky characters. You don't have to speak a language to use a foreign operating system. Modern systems should let you log in as a foreign user with translated strings and foreign locale. If you are familiar with your OS, you can figure out what does what without knowing a single word of the language.
Keyboard maps and character set references are very useful.
Virtualisation would be very useful here.
Non-technical Issues
Sometimes you have to be sensitive to cultural differences (offence or incomprehension may result). A mistake you often see is the use of flags as a visual cue choosing a website language or geography. Unless you want your software to declare sides in global politics, this is a bad idea. If you were French and offered the option for English with St. George's flag (the flag of England is a red cross on a white field), this might result in confusion for many English speakers - assume similar issues will arise with foreign languages and countries. Icons need to be vetted for cultural relevance. What does a thumbs-up or a green tick mean? Language should be relatively neutral - addressing users in a particular manner may be acceptable in one region, but considered rude in another.
Resources
C++ and Java programmers may find the ICU website useful: http://www.icu-project.org/
Some fun things:
Having a PHP and MySQL Application that works well with German and French, but now needs to support Russian and Chinese. I think I move this over to .net, as PHP's Unicode support is - in my opinion - not really good. Sure, juggling around with utf8_de/encode or the mbstring-functions is fun. Almost as fun as having Freddy Krüger visit you at night...
Realizing that some languages are a LOT more Verbose than others. German is a LOT more verbose than English usually, and seeing how the German Version destroys the User Interface because too little space was allocated was not fun. Some products gained some fame for their creative ways to work around that, with Oblivion's "Schw.Tr.d.Le.En.W." being memorable :-)
Playing around with date formats, woohoo! Yes, there ARE actually people in the world who use date formats where the day goes in the middle. Sooooo much fun trying to find out what 07/02/2008 is supposed to mean, just because some users might believe it could be July 2... But then again, you guys over the pond may believe the same about users who put the month in the middle :-P, especially because in English, July 2 sounds a lot better than 2nd of July, something that does not neccessarily apply to other languages (i.e. in German, you would never say Juli 2 but always Zweiter Juli). I use 2008-02-07 whenever possible. It's clear that it means February 7 and it sorts properly, but dd/mm vs. mm/dd can be a really tricky problem.
Anoter fun thing, Number formats! 10.000,50 vs 10,000.50 vs. 10 000,50 vs. 10'000,50... This is my biggest nightmare right now, having to support a multi-cultural environent but not having any way to reliably know what number format the user will use.
Formal or Informal. In some language, there are two ways to address people, a formal way and a more informal way. In English, you just say "You", but in German you have to decide between the formal "Sie" and the informal "Du", same for French Tu/Vous. It's usually a safe bet to choose the formal way, but this is easily overlooked.
Calendars. In Europe, the first day of the Week is Monday, whereas in the US it's Sunday. Calendar Widgets are nice. Showing a Calendar with Sunday on the left and Saturday on the right to a European user is not so nice, it confuses them.
I worked on a project for my previous employer that used .NET, and there was a built in .resx format we used. We basically had a file that had all translations in the .resx file, and then multiple files with different translations. The consequence of this is that you have to be very diligent about ensuring that all strings visible in the application are stored in the .resx, and anytime one is changed you have to update all languages you support.
If you get lazy and don't notify the people in charge of translations, or you embed strings without going through your localization system, it will be a nightmare to try and fix it later. Similarly, if localization is an afterthought, it will be very difficult to put in place. Bottom line, if you don't have all visible strings stored externally in a standard place, it will be very difficult to find all that need to be localized.
One other note, very strictly avoid concatenating visible strings directly, such as
String message = "The " + item + " is on sale!";
Instead, you must use something like
String message = String.Format("The {0} is on sale!", item);
The reason for this is that different languages often order the words differently, and concatenating strings directly will need a new build to fix, but if you used some kind of string replacement mechanism like above, you can modify your .resx file (or whatever localization files you use) for the specific language that needs to reorder the words.
I was just listening to a Podcast from Scott Hanselman this morning, where he talks about internationalization, especially the really tricky things, like Turkish (with it's four i's) and Thai. Also, Jeff Atwood had a post:
Besides all the previous tips, remember that i18n it's not just about changing words for their equivalent on other languages, especially for non-latin languages alphabets (korean, Arabic) which written right to left, so the whole UI will have to conform, like
item 1
item 2
item 3
would have to be
arabic text 1 -
arabic text 2 -
arabic text 3 -
(reversed bullet list doesn't seem to work :P)
which can be a UI nightmare if your system has to apply changes dinamically once the user changes the language being used.
Another very hard thing is to test different languages, not just for the correctness of word, but since languages like Korean usually have bigger font type for their characters this may lead to language specific bugs (like "SAVE" text on a button being larger than the button itself for some language).
One of the funnier things to discover: italics and bold text makrup does not work with CJK (Chinese/Japanese/Korean) characters. They simply become unreadable. (OK, I couldn't really read them before either, but especially bolding just creates ink blots)
I think everyone working in internationalization should be familiar with the Common Locale Data Repository, which is now a sub-project of Unicode:
Common Locale Data Repository
Those folks are working hard to establish a standard resource for all kinds of i18n issues: currency, geographical names, tons of stuff. Any project that's maintaining its own core local data given that this project exists is pretty bonkers, IMHO.
I suggest to use something like 99translations.com to maintain your translations . Otherwise you won't be able to tell what of your translations are up to date in every language.
Another challenge will be accepting input from your users. In many cases, this is eased by the input processing provided by the operating system, such as IME in Windows, which works transparently with common text widgets, but this facility will not be available for every possible need.
One website I use has a translation method the owner calls "wiki + machine translation". This is a community based site so is obviously different to the needs of companies.
http://blog.bookmooch.com/2007/09/23/how-bookmooch-does-its-translations/
One thing no one have mentioned yet is strings with some warying part as in "The unit will arive in 5 days" or "On Monday something happens." where 5 and Monday will change depending on state. It is not a good idea to split those in two and concatenate them. With only one varying part and good documentation you might get away with it, with two varying parts there will be some language that preferes to change the order of them.