Convert chinese characters to hanyu pinyin - language-agnostic

How to convert from chinese characters to hanyu pinyin?
E.g.
你 --> Nǐ
马 --> Mǎ
More Info:
Either accents or numerical forms of hanyu pinyin are acceptable, the numerical form being my preference.
A Java library is preferred, however, a library in another language that can be put in a wrapper is also OK.
I would like anyone who has personally used such a library before to recommend or comment on it, in terms of its quality/ reliabilitty.

The problem of converting hanzi to pinyin is a fairly difficult one. There are many hanzi characters which have multiple pinyin representations, depending on context. Compare 长大 (pinyin: zhang da) to 长城 (pinyin: chang cheng). For this reason, single-character conversion is often actually useless, unless you have a system that outputs multiple possibilities. There is also the issue of word segmentation, which can affect the pinyin representation as well. Though perhaps you already knew this, I thought it was important to say this.
That said, the Adso Package contains both a segmenter and a probabilistic pinyin annotator, based on the excellent Adso library. It takes a while to get used to though, and may be much larger than you are looking for (I have found in the past that it was a bit too bulky for my needs). Additionally, there doesn't appear to be a public API anywhere, and its C++ ...
For a recent project, because I was working with place names, I simply used the Google Translate API (specifically, the unofficial java port, which, for common nouns at least, usually does a good job of translating to pinyin. The problem is commonly-used alternative transliteration systems, such as "HongKong" for what should be "XiangGang". Given all of this, Google Translate is pretty limited, but it offers a start. I hadn't heard of pinyin4j before, but after playing with it just now, I have found that it is less than optimal--while it outputs a list of potential candidate pinyin romanizations it makes no attempt to statistically determine their likelihood. There is a method to return a single representation, but it will soon be phased out, as it currently only returns the first romanization, not the most likely. Where the program seems to do well is with conversion between romanizations and general configurability.
In short then, the answer may be either any one of these, depending on what you need. Idiosyncratic proper nouns? Google Translate. In need of statistics? Adso. Willing to accept candidate lists without context information? Pinyin4j.

In Python try
from cjklib.characterlookup import CharacterLookup
cjk = CharacterLookup('C')
cjk.getReadingForCharacter(u'北', 'Pinyin')
You would get
['běi', 'bèi']
Disclaimer: I'm the author of that library.

For Java, I'd try the pinyin4j library

As mentioned in other answers the conversion is fuzzy and even google translate apparently gets a certain percentage of character combinations wrong.
A reasonable result which will not be 100% accurate can be achieved with open-source libraries available for some programming languages.
The simplest code to do the conversion with python with the pypinyin library (to install it use pip3 install pypinyin):
from pypinyin import pinyin
def to_pinyin(chin):
return ' '.join([seg[0] for seg in pinyin(chin)])
print(to_pinyin('好久不见'))
# OUTPUT: hǎo jiǔ bú jiàn
NOTE: The pinyin method from the module returns a list of possible candidate segments, and the to_pinyin method takes the first variant whenever more than one conversion is available. For tricky corner cases this is likely to produce incorrect results, but generally you'll probably get at least a ~90..95% success rate.
There are a few other python libraries for pinyin conversion but in my tests they proved to have a higher error rate than pypinyin. Also, they don't appear to be actively maintained.
If you need better accuracy then you'll need a more complex approach that will rely on bigger datasets and possibly some machine learning.

Related

Jenkins API tree filter using regex?

I'm using the tree parameter to filter JSON data coming back from the API and that works great. My issue is I need to fetch some specific data from an array with a bunch of stuff I don't care about. I'm wondering if there is a way, using the tree command, to filter using a regex or contains string?
For example, to give me back all fileNames that start with MyProject:
http://myapi.com?tree=fileName=MyProject*
Regular expressions are great for regular grammars.
Trees tend to follow context free grammars. You might do a lot better with a language that can support context aware operations, like XPath. Yes, a few very simple items might work without the extra features of XPath; however, once you do step on a use case that is beyond what is possible with regular grammars (they only support a small subset of what can be searched), it is literally impossible to accomplish the search with the tool in hand.
If you care to see how regular grammars tend to come with limitations, study the pumping lemma, and then think deeply about it's implications. A quick brush-up on parsing theory might also be useful. You are up against mathematics, including the parts of mathematics which include logical operations. It's not a matter of being a difficult problem to solve, it has been proven that regular expressions cannot match context free grammars.
If you are just more interested in getting the job done quickly. I suggest you start off by reading up on XPath and try to leverage one of the already available tools, or at least use it as a guide in your tree matching efforts.
I found that using not using the JSON and instead switching to XML you can filter with XPATH. An example on finding all job urls with name starting with "Test" is this:
https://{jenkins_instance_url}/view/All/api/xml?tree=jobs[name,url]&xpath=/*/job[starts-with(name,'Test')]/url&wrapper=jobs

How to translate from a language to another?

I don't want an automated solution.
When you have to translate a program from a language to another, what do you do? You prefer to rewrite it from the beginning or copy and paste it and change only what need to be changed?
What's the best choice?
It depends on
the goals (quick hack for one time use? long-lived production project for work?)
the resources I have (how many man-hours? Test suit and/or functional spec for old code? familiarity with both languages?)
most importantly, differences between the languages. Both the conceptual (OO? functional? reflection? control structures?) as well as available libraries.
Please note that this bullet is not as trivial as it seems - this depends in large part on how idiomatic the original program is - as an example, some people write very "C-like" Perl code (e.g. using C control flow and very C++ like OO design) which can be trivially copied to C or C++, and some people write incredibly intricate idiomatic Perl using functional programming, closures and reflective capabilties; which can't be obviously translated into C/C++.
Also, quality of the original code. E.g. good program will have separated business logic, usually expressed in standard configuration and control flow that's easier to directly clone.
E.g. translating from PHP to Perl for a hack job, you can often start out with copying, since many PHP constructs can be 1-to-1 mapped onto equivalent Perl constructs (just take your pick of Perl templating web library). The resulting code won't be GOOD Perl but will be Good Enough for some purposes.
On the other hand, translating, say, LISP code to Java, you're better off just translating the original code into a functionality specification and re-write from scratch. Your example of Python and JavaScript is probably in the same box.
Usually you have two languages that share at least some concepts (e.g. both have OO, and some imperative control structures) and thus you end up with some combination of the two approaches - parts of the code can be "thoughtlessly" translated, parts need to be re-written from scratch.
The more of the second (complete rewrite) approach, the better quality idiomatic and powerful code you end up with.
Normally, a rewrite using the original for inspiration and direction is the best choice.
The way you may do something in one language might very well not be the right way to do it in another.
When it comes to copy-paste, it is rare that you can just do that - languages are different and follow different syntax rules.
Of course, this all depends on the source and destination languages.
With your comment - javascript and python, I would say a rewrite is the best option.
That probably mostly depends on the similarity between the two languages and the meaning of "translation".
For instance, translating a bit of C89 code to C++ might not be so hard considering it should compile out of the box when you copy-paste (C++ is a compatible super-set to C89). I would hardly consider that a "translation", though.
On the other hand, translating Java to Haskell would certainly require a complete rewrite as the language paradygms, even worse than the syntax, are completely different.
Consider Wikipedia's list of programming languages. I'm too lazy to count how many of them are on this list, but let's assume there are 100.
If you want to translate one of them into another, that means that there are at least 100*99 = 9900 possible combinations for translation.
And that's an awful lot. Since most languages are unique, translation is very, very dependent on source and destination language.
Consider this Pascal to C converter. The author states it took him one and a half year to make a good translator for these particular languages. Obviously, this isn't a trivial task.
Depending on your ambitions, you might spend either one day, or many years translating a program from language A to language B.
How long this takes depends on your skill, size of your source code, complexity of languages A and B and their similarity.
As you can see, this isn't a trivial task and is highly dependent on your situation.

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.

runnable pseudocode?

I am attempting to determine prior art for the following idea:
1) user types in some code in a language called (insert_name_here);
2) user chooses a destination language from a list of well-known output candidates (javascript, ruby, perl, python);
3) the processor translates insert_name_here into runnable code in destination language;
4) the processor then runs the code using the relevant system call based on the chosen language
The reason this works is because there is a pre-established 1 to 1 mapping between all language constructs from insert_name_here to all supported destination languages.
(Disclaimer: This obviously does not produce "elegant" code that is well-tailored to the destination language. It simply does a rudimentary translation that is runnable. The purpose is to allow developers to get a quick-and-dirty implementation of algorithms in several different languages for those cases where they do not feel like re-inventing the wheel, but are required for whatever reason to work with a specific language on a specific project.)
Does this already exist?
The .NET CLR is designed such that C++.Net, C#.Net, and VB.Net all compile to the same machine language, and you can "decompile" that CLI back in to any one of those languages.
So yes, I would say it already exists though not exactly as you describe.
There are converters available for different languages. The problem you are going to have is dealing with libraries. While mapping between language statements might be easy, finding mappings between library functions will be very difficult.
I'm not really sure how useful that type of code generator would be. Why would you want to write something in one language and then immediately convert it to something else? I can see the rationale for 4th Gen languages that convert diagrams or models into code but I don't really see the point of your effort.
Yes, a program that transform a program from one representation to another does exist. It's called a "compiler".
And as to your question whether that is always possible: as long as your target language is at least as powerful as the source language, then it is possible. So, if your target language is Turing-complete, then it is always possible, because there can be no language that is more powerful than a Turing-complete language.
However, there does not need to be a dumb 1:1 mapping.
For example: the Microsoft Volta compiler which compiles CIL bytecode to JavaScript sourcecode has a problem: .NET has threads, JavaScript doesn't. But you can implement threads with continuations. Well, JavaScript doesn't have continuations either, but you can implement continuations with exceptions. So, Volta transforms the CIL to CPS and then implements CPS with exceptions. (Newer versions of JavaScript have semi-coroutines in the form of generators; those could also be used, but Volta is intended to work across a wide range of JavaScript versions, including obviously JScript in Internet Explorer.)
This seems a little bizarre. If you're using the term "prior art" in its most common form, you're discussing a potentially patentable idea. If that is the case, you have:
1/ Published the idea, starting the clock running on patent filing - I'm assuming, perhaps incorrectly, that you're based in the U.S. Other jurisdictions may have other rules.
2/ Told the entire planet your idea, which means it's pretty much useless to try and patent it, unless you act very fast.
If you're not thinking about patenting this and were just using the term "prior art" in a laypersons sense, I apologize. I work for a company that takes patents very seriously and it's drilled into us, in great detail, what we're allowed to do with information before filing.
Having said that, patentable ideas must be novel, useful and non-obvious. I would think that your idea would not pass on the third of these since you're describing a language translator which would have the prior art of the many pascal-to-c and fortran-to-c converters out there.
The one glimmer of hope would be the ability of your idea to generate one of multiple output languages (which p2c and f2c don't do) but I think even that would be covered by the likes of cross compilers (such as gcc) which turn source into one of many different object languages.
IBM has a product called Visual Age Generator in which you code in one (proprietary) language and it's converted into COBOL/C/Java/others to run on different target platforms from PCs to the big honkin' System z mainframes, so there's your first problem (thinking about patenting an idea that IBM, the biggest patenter in the world, is already using).
Tons of them. p2c, f2c, and the original implementation s of C++ and Objective C strike me immediately. Beyond that, it's kind of hard to distinguish what you're describing from any compiler, especially for us old guys whose compilers generated ASM code for an intermediate represetation anyway.

Theory, examples of reversible parsers?

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?
Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.
Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.
I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.
Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.
There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json
I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.
In addition to 'Visitor', 'unparser' is another good keyword to web-search for.
That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.
I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.
The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.
There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.