Theory, examples of reversible parsers? - language-agnostic

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?

Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.

Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.

I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.

Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.

There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json

I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.

In addition to 'Visitor', 'unparser' is another good keyword to web-search for.

That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.

I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.

The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.

There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.

Related

Custom parser for HTML5 and other languages

I'm attempting to write my very own custom parser (in C#) for (X)HTML5 and whatever might be embedded (EcmaScript, CSS) - just to learn and have fun. Although I'm an intermediate programmer I don't know much about parsers and all the technical stuff. I am able to create a lexical analyser (tokeniser) for HTML5 fairly easily but the syntactical analysis (parsing) is a bit tricky. I'm not sure if I should first lexically analyse all the source input and then do the other or try both at the same time; get char until I have a token, realise what the token syntactially means and then expect a certain token relevant to the previous one. The problem that I face is that HTML might have other languages such as CSS and JavaScript embedded and they, as far as I can see, will have different categories of tokens, so I'm not sure how to "know" where I am in the code as I tokenise it in order to have varying definitions of what a token "is". Any thoughts? Also, what are the benefits/drawbacks of analysing lexically first and then syntactically vs. doing both in at the same time?
If this is purely for your own education regarding parsing, I would suggest tacking a much smaller / easier field than HTML , CSS and JS parsing as HTML and JS both represent some really quite nasty parsing problems which even the most experienced parser writer would feel nervous tackling.
A language based off Scheme or Basic would probably be my first pick.
(A personal favourite is building a parser / interpreter as I go through http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-10.html )
(Also picking up a copy of something like Modern Complier Design probably wouldn't hurt: http://www.amazon.com/Modern-Compiler-Design-D-Grune/dp/0471976970 )
If it has to be web related in order to keep your interest, I'd take a stab at doing your parser for one of the smaller web related languages such as sass ( http://sass-lang.com )
On the other hand, if this is something work related where you really need to parse those specific things, I'd suggest skipping the effort of writing your own parser entirely and hook into something like the Razor or Chromium libraries.
And to directly answer at least the second half of your question: I would recommend always splitting the various phases of parsing / interpreting out as far as possible from each other.
Each problem is difficult enough on it's own without trying to be "too smart" and attempting to combine functionality into a single sweep.
Where ever possible I'd suggest keeping things as high level, abstract and "clean" as possible... thus construct a tree of nodes specifically for lexical parsing and another for syntactical parsing... and in the case of combined languages as HTML, CSS and JS, a different AST and parsing code for each.
There is a great course on Udacity [1] called Programming Languages that covers the full concept of HTML and Javacript processing .
It covers in depth the lexical analysis, parsing and interpretation . It only covers a subset of Javascript so you have further development ahead of you after you finish the course, but you will have acquired the general structure and the concepts .
[1] http://www.udacity.com/overview/Course/cs262/CourseRev/apr2012

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?
They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.
The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.
You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.
In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.
Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.

How do HTML parses work if they're not using regexp?

I see questions every day asking how to parse or extract something from some HTML string and the first answer/comment is always "Don't use RegEx to parse HTML, lest you feel the wrath!" (that last part is sometimes omitted).
This is rather confusing for me, I always thought that in general, the best way to parse any complicated string is to use a regular expression. So how does a HTML parser work? Doesn't it use regular expressions to parse.
One particular argument for using a regular expression is that there's not always a parsing alternative (such as JavaScript, where DOMDocument isn't a universally available option). jQuery, for instance, seems to manage just fine using a regex to convert a HTML string to DOM nodes.
Not sure whether or not to CW this, it's a genuine question that I want to be answered and not really intended to be a discussion thread.
So how does a HTML parser work? Doesn't it use regular expressions to parse?
Well, no.
If you reach back in your brain to a theory of computation course, if you took one, or a compilers course, or something similar, you may recall that there are different kinds of languages and computational models. I'm not qualified to go into all the details, but I can review a few of the major points with you.
The simplest type of language & computation (for these purposes) is a regular language. These can be generated with regular expressions, and recognized with finite automata. Basically, that means that "parsing" strings in these languages use state, but not auxiliary memory. HTML is certainly not a regular language. If you think about it, the list of tags can be nested arbitrarily deeply. For example, tables can contain tables, and each table can contain lots of nested tags. With regular expressions, you may be able to pick out a pair of tags, but certainly not anything arbitrarily nested.
A classic simple language that is not regular is correctly matched parentheses. Try as you might, you will never be able to build a regular expression (or finite automaton) that will always work. You need memory to keep track of the nesting depth.
A state machine with a stack for memory is the next strength of computational model. This is called a push-down automaton, and it recognizes languages generated by context-free grammars. Here, we can recognize correctly matched parentheses--indeed, a stack is the perfect memory model for it.
Well, is this good enough for HTML? Sadly, no. Maybe for super-duper carefully validated XML, actually, in which all the tags always line up perfectly. In real-world HTML, you can easily find snippets like <b><i>wow!</b></i>. This obviously doesn't nest, so in order to parse it correctly, a stack is just not powerful enough.
The next level of computation is languages generated by general grammars, and recognized by Turing machines. This is generally accepted to be effectively the strongest computational model there is--a state machine, with auxiliary memory, whose memory can be modified anywhere. This is what programming languages can do. This is the level of complexity where HTML lives.
To summarize everything here in one sentence: to parse general HTML, you need a real programming language, not a regular expression.
HTML is parsed the same way other languages are parsed: lexing and parsing. The lexing step breaks down the stream of individual characters into meaningful tokens. The parsing step assembles the tokens, using states and memory, into a logically coherent document that can be acted on.
Usually by using a tokeniser. The draft HTML5 specification has an extensive algorithm for handling "real world HTML".
Regular expressions are just one form of parser. An honest-to-goodness HTML parser will be significantly more complicated than can be expressed in regexes, using recursive descent, prediction, and several other techniques to properly interpret the text. If you really want to get into it, you might check out lex & yacc and similar tools.
The prohibition against using regexes for HTML parsing should probably be written more correctly as: "Don't use naive regular expressions to parse HTML..." (lest ye feel the wrath) "...and treat the results with caution." For certain specific goals, a regex may well be perfectly adequate, but you need to be very careful to be aware of the limitations of your regex and as cautious as is appropriate to the source of the text you're parsing (e.g., if it's user input, be very careful indeed).
Parsing HTML is the transformation of a linear text into a tree structure. Regular expressions cannot generally handle tree structures. The regular expression you need at each point to get the next token changes all the time. You can use regular expressions in a parser, but you will need a whole array of regular expressions for each possible state of parsing.
If you want to have a 100% solution: You need to write your own custom code that iterates through the HTML character-by-character and you need to have a tremendous amount of logic to determine if you should stop the current node and start the next.
The reason is that this is valid HTML:
<ul>
<li>One
<li>Two
<li>Three
</ul>
But so is this:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
If you are ok with "90% solution": Then using an XML parser to load a document is fine. Or using Regex (though the xml is easier if you are then master of the content).

runnable pseudocode?

I am attempting to determine prior art for the following idea:
1) user types in some code in a language called (insert_name_here);
2) user chooses a destination language from a list of well-known output candidates (javascript, ruby, perl, python);
3) the processor translates insert_name_here into runnable code in destination language;
4) the processor then runs the code using the relevant system call based on the chosen language
The reason this works is because there is a pre-established 1 to 1 mapping between all language constructs from insert_name_here to all supported destination languages.
(Disclaimer: This obviously does not produce "elegant" code that is well-tailored to the destination language. It simply does a rudimentary translation that is runnable. The purpose is to allow developers to get a quick-and-dirty implementation of algorithms in several different languages for those cases where they do not feel like re-inventing the wheel, but are required for whatever reason to work with a specific language on a specific project.)
Does this already exist?
The .NET CLR is designed such that C++.Net, C#.Net, and VB.Net all compile to the same machine language, and you can "decompile" that CLI back in to any one of those languages.
So yes, I would say it already exists though not exactly as you describe.
There are converters available for different languages. The problem you are going to have is dealing with libraries. While mapping between language statements might be easy, finding mappings between library functions will be very difficult.
I'm not really sure how useful that type of code generator would be. Why would you want to write something in one language and then immediately convert it to something else? I can see the rationale for 4th Gen languages that convert diagrams or models into code but I don't really see the point of your effort.
Yes, a program that transform a program from one representation to another does exist. It's called a "compiler".
And as to your question whether that is always possible: as long as your target language is at least as powerful as the source language, then it is possible. So, if your target language is Turing-complete, then it is always possible, because there can be no language that is more powerful than a Turing-complete language.
However, there does not need to be a dumb 1:1 mapping.
For example: the Microsoft Volta compiler which compiles CIL bytecode to JavaScript sourcecode has a problem: .NET has threads, JavaScript doesn't. But you can implement threads with continuations. Well, JavaScript doesn't have continuations either, but you can implement continuations with exceptions. So, Volta transforms the CIL to CPS and then implements CPS with exceptions. (Newer versions of JavaScript have semi-coroutines in the form of generators; those could also be used, but Volta is intended to work across a wide range of JavaScript versions, including obviously JScript in Internet Explorer.)
This seems a little bizarre. If you're using the term "prior art" in its most common form, you're discussing a potentially patentable idea. If that is the case, you have:
1/ Published the idea, starting the clock running on patent filing - I'm assuming, perhaps incorrectly, that you're based in the U.S. Other jurisdictions may have other rules.
2/ Told the entire planet your idea, which means it's pretty much useless to try and patent it, unless you act very fast.
If you're not thinking about patenting this and were just using the term "prior art" in a laypersons sense, I apologize. I work for a company that takes patents very seriously and it's drilled into us, in great detail, what we're allowed to do with information before filing.
Having said that, patentable ideas must be novel, useful and non-obvious. I would think that your idea would not pass on the third of these since you're describing a language translator which would have the prior art of the many pascal-to-c and fortran-to-c converters out there.
The one glimmer of hope would be the ability of your idea to generate one of multiple output languages (which p2c and f2c don't do) but I think even that would be covered by the likes of cross compilers (such as gcc) which turn source into one of many different object languages.
IBM has a product called Visual Age Generator in which you code in one (proprietary) language and it's converted into COBOL/C/Java/others to run on different target platforms from PCs to the big honkin' System z mainframes, so there's your first problem (thinking about patenting an idea that IBM, the biggest patenter in the world, is already using).
Tons of them. p2c, f2c, and the original implementation s of C++ and Objective C strike me immediately. Beyond that, it's kind of hard to distinguish what you're describing from any compiler, especially for us old guys whose compilers generated ASM code for an intermediate represetation anyway.