Data structures- what determines inclusion in language library? - language-agnostic

With an assignment dealing with access control lists, it was required to construct a doubly-linked list first, as Java doesn't include that in the SUN API. I can understand if the professor wanted us to create the Doubly Linked List from scratch to understand how it works (like writing a sort program vs using the baked in methods), but why exclude it from the library?
That got me to thinking, what determines if a data structure is included in the basic language library? For example, in Java, why is there a LinkedList class but not a DoublyLinkedList?

According to the Java Docs:
All of the operations perform as could
be expected for a doubly-linked list.
Operations that index into the list
will traverse the list from the
beginning or the end, whichever is
closer to the specified index.

I believe, of course this is opinion, that the only data structures that all programming languages have internally built into them are the following: Array, List, Tree, Graph, and Bag. I think List and Bag can pretty much be interchangeable in terms of vocabulary. Keyword being think.

Related

Jenkins API tree filter using regex?

I'm using the tree parameter to filter JSON data coming back from the API and that works great. My issue is I need to fetch some specific data from an array with a bunch of stuff I don't care about. I'm wondering if there is a way, using the tree command, to filter using a regex or contains string?
For example, to give me back all fileNames that start with MyProject:
http://myapi.com?tree=fileName=MyProject*
Regular expressions are great for regular grammars.
Trees tend to follow context free grammars. You might do a lot better with a language that can support context aware operations, like XPath. Yes, a few very simple items might work without the extra features of XPath; however, once you do step on a use case that is beyond what is possible with regular grammars (they only support a small subset of what can be searched), it is literally impossible to accomplish the search with the tool in hand.
If you care to see how regular grammars tend to come with limitations, study the pumping lemma, and then think deeply about it's implications. A quick brush-up on parsing theory might also be useful. You are up against mathematics, including the parts of mathematics which include logical operations. It's not a matter of being a difficult problem to solve, it has been proven that regular expressions cannot match context free grammars.
If you are just more interested in getting the job done quickly. I suggest you start off by reading up on XPath and try to leverage one of the already available tools, or at least use it as a guide in your tree matching efforts.
I found that using not using the JSON and instead switching to XML you can filter with XPATH. An example on finding all job urls with name starting with "Test" is this:
https://{jenkins_instance_url}/view/All/api/xml?tree=jobs[name,url]&xpath=/*/job[starts-with(name,'Test')]/url&wrapper=jobs

What's the use of abstract syntax trees?

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.
Why are ASTs useful?
They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.
The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.
The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.
The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.
You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).
Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).
I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.
In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.
I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.
A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.
Late to the question but I thought I'd add something. You don't actually have to build an AST. It is possible to emit instructions directly as you parse the source code. In this case, the AST is implied in the parsing grammar. For simple languages, especially dynamically typed languages, this is a perfectly ok strategy. For more complex languages or where you need to further analyze the source code, an AST can be very useful. For example, if your language is statically typed, ie your variables are declared with fixed types then the AST can be used to check that you're not assigning the wrong type to a variable. eg assigning a string to a variable that is declared to hold an integer would be wrong and this can be caught more conveniently with the AST.
Also, as others have mentioned, the AST offers a clean separation between syntax analysis and code generation and makes the code much more modular.

Parsing language for both binary and character files

The problem:
You have some data and your program needs specified input. For example strings which are numbers. You are searching for a way to transform the original data in a format you need.
And the problem is: The source can be anything. It can be XML, property lists, binary which
contains the needed data deeply embedded in binary junk. And your output format may vary
also: It can be number strings, float, doubles....
You don't want to program. You want routines which gives you commands capable to transform the data in a form you wish. Surely it contains regular expressions, but it is very good designed and it offers capabilities which are sometimes much more easier and more powerful.
ADDITION:
Many users have this problem and hope that their programs can convert, read and write data which is given by other sources. If it can't, they are doomed or use programs like business
intelligence. That is NOT the problem.
I am talking of a tool for a developer who knows what is he doing, but who is also dissatisfied to write every time routines in a regular language. A professional data manipulation tool, something like a hex editor, regex, vi, grep, parser melted together
accessible by routines or a REPL.
If you have the spec of the data format, you can access and transform the data at once. No need to debug or plan meticulously how to program the transformation. I am searching for a solution because I don't believe the problem is new.
It allows:
joining/grouping/merging of results
inserting/deleting/finding/replacing
write macros which allows to execute a command chain repeatedly
meta-grouping (lists->tables->n-dimensional tables)
Example (No, I am not looking for a solution to this, it is just an example):
You want to read xml strings embedded in a binary file with variable length records. Your
tool reads the record length and deletes the junk surrounding your text. Now it splits open
the xml and extracts the strings. Being Indian number glyphs and containing decimal commas instead of decimal points, your tool transforms it into ASCII and replaces commas with points. Now the results must be stored into matrices of variable length....etc. etc.
I am searching for a good language / language-design and if possible, an implementation.
Which design do you like or even, if it does not fulfill the conditions, wouldn't you want to miss ?
EDIT: The question is if a solution for the problem exists and if yes, which implementations are available. You DO NOT implement your own sorting algorithm if Quicksort, Mergesort and Heapsort is available. You DO NOT invent your own text parsing
method if you have regular expressions. You DO NOT invent your own 3D language for graphics if OpenGL/Direct3D is available. There are existing solutions or at least papers describing the problem and giving suggestions. And there are people who may have worked and experienced such problems and who can give ideas and suggestions. The idea that this problem is totally new and I should work out and implement it myself without background
knowledge seems for me, I must admit, totally off the mark.
UPDATE:
Unfortunately I had less time than anticipated to delve in the subject because our development team is currently in a hot phase. But I have contacted the author of TextTransformer and he kindly answered my questions.
I have investigated TextTransformer (http://www.texttransformer.de) in the meantime and so far I can see it offers a complete and efficient solution if you are going to parse character data.
For anyone who will give it a try to implement a good parsing language, the smallest set of operators to directly transform any input data to any output data if (!) they were powerful enough seems to be:
Insert/Remove: Self-explaining
Group/Ungroup: Split the input data into a set of tokens and organize them into groups
and supergroups (datastructures, lists, tables etc.)
Transform
Substituition: Change the content of the tokens (special operation: replace)
Transposition: Change the order of tokens (swap,merge etc.)
Have you investigated TextTransformer?
I have no experience with this, but it sounds pretty good and the author makes quite competent posts in the comp.compilers newsgroup.
You still have to some programming work.
For a programmer, I would suggest:
Perl against a SQL backend.
For a non-programmer, what it sounds like you're looking for is some sort of business intelligence suite.
This suggestion may broaden the scope of your search too much... but here it is:
You could either reuse, as-is, or otherwise get "inspiration" from the [open source] code of the SnapLogic framework.
Edit (answering the comment on SnapLogic documentation etc.)
I agree, the SnapLogic documentation leaves a bit to be desired, in particular for people in your situation, i.e. when just trying to quickly get an overview of what SnapLogic can do, and if it would generally meet their needs, without investing much time and learn the system in earnest.
Also, I realize that the scope and typical uses of of SnapLogic differ, somewhat, from the requirements expressed in the question, and I should have taken the time to better articulate the possible connection.
So here goes...
A salient and powerful feature of SnapLogic is its ability to [virtually] codelessly create "pipelines" i.e. processes made from pre-built components;
Components addressing the most common needs of Data Integration tasks at-large are supplied with the SnapLogic framework. For example, there are components to
read in and/or write to files in CSV or XML or fixed length format
connect to various SQL backends (for either input, output or both)
transform/format [readily parsed] data fields
sort records
join records for lookup and general "denormalized" record building (akin to SQL joins but applicable to any input [of reasonnable size])
merge sources
Filter records within a source (to select and, at a later step, work on say only records with attribute "State" equal to "NY")
see this list of available components for more details
A relatively weak area of functionality of SnapLogic (for the described purpose of the OP) is with regards to parsing. Standard components will only read generic file formats (XML, RSS, CSV, Fixed Len, DBMSes...) therefore structured (or semi-structured?) files such as the one described in the question, with mixed binary and text and such are unlikely to ever be a standard component.
You'd therefore need to write your own parsing logic, in Python or Java, respecting the SnapLogic API of course so the module can later "play nice" with the other ones.
BTW, the task of parsing the files described could be done in one of two ways, with a "monolithic" reader component (i.e. one which takes in the whole file and produces an array of readily parsed records), or with a multi-component approach, whereby an input component reads in and parse the file at "record" level (or line level or block level whatever this may be), and other standard or custom SnapLogic components are used to create a pipeline which effectively expresses the logic of parsing a record (or block or...) into its individual fields/attributes.
The second approach is of course more modular and may be applicable if the goal is to process many different files format, whereby each new format requires piecing together components with no or little coding. Whatever the approach used for the input / parsing of the file(s), the SnapLogic framework remains available to create pipelines to then process the parsed input in various fashion.
My understanding of the question therefore prompted me to suggest SnapLogic as a possible framework for the problem at hand, because I understood the gap in feature concerning the "codeless" parsing of odd-formatted files, but also saw some commonality of features with regards to creating various processing pipelines.
I also edged my suggestion, with an expression like "inspire onself from", because of the possible feature gap, but also because of the relative lack of maturity of the SnapLogic offering and its apparent commercial/open-source ambivalence.
(Note: this statement is neither a critique of the technical maturity/value of the framework per-se, nor a critique of business-oriented use of open-source, but rather a warning that business/commercial pressures may shape the offering in various direction)
To summarize:
Depending on the specific details of the vision expressed in the question, SnapLogic may be worthy of consideration, provided one understands that "some-assembly-required" will apply, in particular in the area of file parsing, and that the specific shape and nature of the product may evolve (but then again it is open source so one can freeze it or bend it as needed).
A more generic remark is that SnapLogic is based on Python which is a very swell language for coding various connectors, convertion logic etc.
In reply to Paul Nathan you mentioned writing throwaway code as something rather unpleasant. I don't see why it should be so. After all, all of our code will be thrown away and replaced eventually, no matter how perfect we wrote it. So my opinion is that writing throwaway code is pretty much ok, if you don't spend too much time writing it.
So, it seems that there are two approaches to solving your solution: either a) find some specific tool intended for the purpose (parse data, perform some basic operations on it and storing it in some specific structure) or b) use some general purpose language with lots of libraries and code it yourself.
I don't think that approach a) is viable because sooner or later you'll bump into an obstacle not covered by the tool and you'll spend your time and nerves hacking the tool, or mailing the authors and waiting for them to implement what you need. I might as well be wrong, so please if you find a perfect tool, drop here a link (I myself am doing lots of data processing in my day job and I can't swear that I couldn't do it more efficiently).
Approach b) may at first seem "unpleasant", but given a nice high-level expressive language with bunch of useful libraries (regexps, XML manipulation, creating parsers...) it shouldn't be too hard, and may be gradually turned into a DSL for the very purpose. Beside Perl which was already mentioned, Python and Ruby sound like good candidates for these languages (I bet some Lisp derivatives too, but I have no experience there).
You might find AntlrWorks useful if you go so far as defining formal grammars for what you're parsing.

Use of Facade Pattern

How can I know that I need a facade Pattern at a point in my application development?
How can I draw the line between Facade Pattern and Template Pattern?
For example: In [this] article, we see that, int placeOrder(int CustomerID, List<BasketItem> Products) has a number of predefined steps in the algorithm. So why don't the author use Template Pattern here?
Facade deals with interface, not implementation. Its purpose is to hide internal complexity behind a single interface that appears simple on the outside. In the example from your question, the facade hides four classes (Order, OrderLine, Address, BasketItem) behind a single method.
Template method deals with implementation. Its purpose is to extract the common algorithm from several ones that differ only in a 'fill in the blanks' way. The template method in the superclass implements the common algorithm and each subclass 'fills in the blanks' in its own specific way.
So why don't the author use Template Pattern here?
It would make sense to make placeOrder a template method if there were several similar versions of the operation. Maybe a few methods like placePhoneOrder, placeInternetOrder, placeManuallyEnteredOrder could be refactored into a single template placeOrder with some subclasses implementing only the {phone,internet,manual}-specific differences.
The facade pattern is appropriate when you have a complex system that you want to expose to clients in a simplified way, or you want to make an external communication layer over an existing system which is incompatible with your system. It is a structural pattern. See here: http://en.wikipedia.org/wiki/Facade_pattern
The template pattern, on the other hand, is a behavioral pattern that will help you when dealing with the inner implementation of a component. See here: http://en.wikipedia.org/wiki/Template_method_pattern
Suppose you have a few services, libraries or whatever. These libraries need interoperation in order to perform some higher level services. Then you may wish to wrap those calls and intialization code that usually go together and offer a bunch of functions to hide those details and make it simple to use those services for specific scenarios. Then it is a good use for facade pattern.
UPDATE: In the article mentioned the PlaceOrder method has one single implementation that works for all orders. Template pattern is meant to prescribe a series of steps that have to be followed but allow subclasses to offer their custom implementation of those fixed steps. For example, if you needed orders for televisions to be processed differently from orders for microwaves you could use the template pattern to redefine some imaginary DispatchParcel method (to send microwave as a simple package but television with extra service to help lift the heavy device to the upper floor). In our case there is no need for reimplementation of ProcessOrder steps so there is no need for template pattern as one single implementation suits all types of orders.

Theory, examples of reversible parsers?

Does anyone out there know about examples and the theory behind parsers that will take (maybe) an abstract syntax tree and produce code, instead of vice-versa. Mathematically, at least intuitively, I believe the function of code->AST is reversible, but I'm trying to find work/examples of this... besides the usual resources like the Dragon book and such. Any ideas?
Such thing is called a Visitor. Is traverses the tree and does whatever has to be done, for example optimize or generate code.
Our DMS Software Reengineering Toolkit insists on parsers and parser-inverses (called "prettyprinters") as "poker-ante" to mechanical processing (analyzing/transforming) of arbitrary languages. These provide full round-trip: source text to ASTs with captured position information (file/line/column) and comments, and AST to legal source text including regenerating the original token positions ("fidelity printing") or nicely formatted ("prettyprinting") options, including regeneration of the comments.
Parsers are often specified by a combination of grammars and lexical definitions of tokens; these notations are typically compiled into efficient parsing engines, and DMS does that for the "parser" side, as you might expect. Other folks here suggest that a "visitor" is the way to do prettyprinting, and, like assembly code, it is the right way to implement prettyprinting at the lowest level of abstraction. However, DMS prettyprinters are specified in terms of a text-box construction language over grammar terms something like Latex, that enables one to control the placement of the various language elements horizontally, vertically, embedded, spaced, concatenated, laminated, etc. DMS compiles these into efficient low-level visitors (as other answers suggest) that implement the box generation. But like the parser generator, you don't have see all the ugly detail.
DMS has some 30+ sets of these language front ends for a various programming langauge and formal notations, ranging from C++, C, Java, C#, COBOL, etc. to HTML, XML, assembly languages from some machines, temporaral property specifications, specs for composable abstract algebras, etc.
I rather like lewap's response:
find a mathematical way to express a
visitor and you have a dual to the
parser
But you asked for a sample, so try this on for size: Visual Studio contains a UML editor with excellent symmetry. The way both it and the editors are implemented, all constitute views of the model, and editing either modifies the model resulting in all remaining in synch.
Actually, generating code from a parse tree is strictly easier than parsing code, at least in a mathematical sense.
There are many grammars which are ambiguous, that is, there is no unique way to parse them, but a parse tree can always be converted to a string in a unique way, modulo whitespace.
The Dragon book gives a good description of the theory of parsers.
There are theory, working implementations and examples of reversible parsing in Haskell. The library is by Paweł Nowak. Please refer to
https://hackage.haskell.org/package/syntax
as your starting point. You can find the examples at following URLs.
https://hackage.haskell.org/package/syntax-example
https://hackage.haskell.org/package/syntax-example-json
I don't know where to find much about the theory, but boost::spirit 2.0 has both qi (parser) and karma (generator), sharing the same underlying structure and grammar, so it's a practical implementation of the concept.
Documentation on the generator side is still pretty thin (spirit2 was new in Boost 1.38, and is still in beta), but there are a few bits of karma sample code around, and AFAIK the library's in a working state and there are at least some examples available.
In addition to 'Visitor', 'unparser' is another good keyword to web-search for.
That sounds a lot like the back end of a non-optimizing compiler that has it's target language the same as it's source language.
One question would be whether you require the "unparsed" code to be identical to the original, or just functionally equivalent.
For example, would it be OK for the output to use a different indentation style than the original? That information wouldn't normally be stored in the AST because it's not semantically important.
One thing to look at would be automatic code refactoring tools.
I've been doing these forever, and calling them "DeParse".
It only gets tricky if you also want to recapture whitespace and comments. You have to tuck them into the parse tree so you can regenerate them on output.
The "Visitor Pattern" idea is good. But, I should consider "Visitor" pattern as a lineal list pattern, or, as a generic pattern, and add patterns for more specific cases like Lists, Matrices, and Trees.
Look for a "Hierarchical Visitor Pattern" or "Tree Visitor Pattern" on the web.
You have a tree data structure ("Collection") and want to do something with the data, each time you "visit", "iterate" or "read" an item from the tree.
In your case, you have a tree data structure, that represents the result of scanning/parsing some source code. Then you have read each item's data, and transform it into destination code.
There are several "lens languages" that allow bidirection transformation of source code.
It is also possible to implement reversible parsers using definite clause grammars in Prolog. In SWI-Prolog, the phrase/3 predicate converts parse trees into text and vice-versa. This book provides some additional examples of reversible parsing in Prolog.