Abstract syntax tree construction and traversal

Abstract syntax tree construction and traversal - language-agnostic

I am unclear on the structure of abstract syntax trees. To go "down (forward)" in the source of the program that the AST represents, do you go right on the very top node, or do you go down? For instance, would the example program
a = 1
b = 2
c = 3
d = 4
e = 5
Result in an AST that looks like this:
or this:
Where in the first one, going "right" on the main node will advance you through the program, but in the second one simply following the next pointer on each node will do the same.
It seems like the second one would be more correct since you don't need something like a special node type with a potentially extremely long array of pointers for the very first node. Although, I can see the second one becoming more complicated than the first when you get into for loops and if branches and more complicated things.

The first representation is the more typical one, though the second is compatible with the construction of a tree as a recursive data structure, as may be used when the implementation platform is functional rather than imperative.
Consider:
This is your first example, except shortened and with the "main" node (a conceptual straw man) more appropriately named "block," to reflect the common construct of a "block" containing a sequence of statements in an imperative programming language. Different kinds of nodes have different kinds of children, and sometimes those children include collections of subsidiary nodes whose order is important, as is the case with "block." The same might arise from, say, an array initialization:
int[] arr = {1, 2}
Consider how this might be represented in a syntax tree:
Here, the array-literal-type node also has multiple children of the same type whose order is important.

Where in the first one, going "right"
on the main node will advance you
through the program, but in the second
one simply following the next pointer
on each node will do the same.
It seems like the second one would be
more correct since you don't need
something like a special node type
with a potentially extremely long
array of pointers for the very first
node
I'd nearly always prefer the first approach, and I think you'll find it much easier to construct your AST when you don't need to maintain a pointer to the next node.
I think its generally easier to have all objects descend from a common base class, similar to this:
abstract class Expr { }
class Block : Expr
{
Expr[] Statements { get; set; }
public Block(Expr[] statements) { ... }
}
class Assign : Expr
{
Var Variable { get; set; }
Expr Expression { get; set; }
public Assign(Var variable, Expr expression) { ... }
}
class Var : Expr
{
string Name { get; set; }
public Variable(string name) { ... }
}
class Int : Expr
{
int Value { get; set; }
public Int(int value) { ... }
}
Resulting AST is as follows:
Expr program =
new Block(new Expr[]
{
new Assign(new Var("a"), new Int(1)),
new Assign(new Var("b"), new Int(2)),
new Assign(new Var("c"), new Int(3)),
new Assign(new Var("d"), new Int(4)),
new Assign(new Var("e"), new Int(5)),
});

It depends on the language. In C, you'd have to use the first form to capture the notion of a block, since a block has a variable scope:
{
{
int a = 1;
}
// a doesn't exist here
}
The variable scope would be an attribute of what you call the "main node".

I believe your first version make more sense, for a couple of reasons.
Firstly, the first more clearly demonstrates the "nestedness" of the program, and also is clearly implemented as a rooted tree (which is the usual concept of a tree).
The second, and more important reason, is that your "main node" could really have been a "branch node" (for example), which can simply be another node within a larger AST. This way, your AST can be viewed in a recursive sense, where each AST is a node with other ASTs as it children. This make the design of the first much simpler, more general, and very homogeneous.

Suggestion: When dealing with tree data structures, wheter is compiler-related AST or other kind, always use a single "root" node, it may help you perform operations and have more control:
class ASTTreeNode {
bool isRoot() {...}
string display() { ... }
// ...
}
void main ()
{
ASTTreeNode MyRoot = new ASTTreeNode();
// ...
// prints the root node, plus each subnode recursively
MyRoot.Show();
}
Cheers.

Related

Returning by reference from struct method in D

I'm starting my journey in D from C++. In C++ passing by reference or value is quite explicit, but in D it seems to vary between structs and classes.
My question is how can I force a return by reference?
I have a simple XmlNode class for building Xml trees (which is a lift from my C++ code):
import std.stdio;
struct XmlNode
{
string _name;
string _data;
XmlNode[] _children;
this(string name, string data="")
{
_name = name;
_data = data;
}
//Trying to return a reference to the added Node
ref XmlNode addChild(string name,string data = "")
{
_children ~= XmlNode(name,data);
return _children[$-1];
}
string toString(bool bPlain = true, string indent = "")
{
//Omitted for brevity
}
}
And here is the testing code:
int main()
{
auto root = XmlNode("root");
//Chained call
root.addChild("Level 1").addChild("Level 2","42");
//Call in two parts
auto n = root.addChild("Level 1");
n.addChild("Level 2","101"); //n seems to be a copy not a reference
//Chained call
root.addChild("Level 1").addChild("Level 2","999");
writeln(root.toString(false));
return 0;
}
which gives the following output:
root
Level 1
Level 2
42
Level 1
Level 1
Level 2
999
As you can see the 'chained' use of addChild() performs as hoped. But if I try to break it up into two separate calls, only the first has an effect, and the second seems to operate on a copy of the first, not a reference. I optimistically added a ref qualifier to the addChild() signature, but that doesn't seem to help.
As ever, I'd be grateful for any advice (using DMD / Visual D / Visual Studio / Windows 10).

auto n = root.addChild("Level 1");
Here, though addChild returns a reference, it is assigned to a variable, and thus dereferenced and copied. Instead, you probably want:
auto n = &root.addChild("Level 1");
Note that D does not have reference variables, like in C++. Variables can be only pointers (though it's possible to write a wrapper template with reference-like semantics).
Also note that in the current design of XmlNode, the returned reference will only be valid until the next time _children is modified (as that may cause a reallocation and thus move the contents to another address, making any extant references outdated). It is a common footgun, which could be avoided by storing references of XmlNode (or making it a reference type i.e. a class), at the cost of extra dereferences and allocations.

Liskov Substitute Principle (LSP) with Code example

Liskov Substitution Principle requires that
Preconditions cannot be strengthened in a subtype.
Postconditions cannot be weakened in a subtype.
Invariants of the supertype must be preserved in a subtype.
History constraint (the "history rule"). Objects are regarded as being modifiable only through their methods (encapsulation). Since subtypes may introduce methods that are not present in the supertype, the introduction of these methods may allow state changes in the subtype that are not permissible in the supertype. The history constraint prohibits this.
Can anybody please post an example violating each of these points and another example solving those?

All four items in the question have been thoroughly reviewed in this article.
Preconditions cannot be strengthened in a subtype.
This answer presents "real duck" and "electric duck" example, I suggest you go check it out. I'll use it in this item, for brevity.
It means that subtypes can't get in the way of how the original methods behaved in the base class. In the above mentioned answer's code, both ducks can swim, but the ElectricDuck will only swim if it's turned on. Therefore, any unit of code that requires that a duck (from the interface IDuck) swim now won't work, unless explicitly specified that the duck is ElectricDuck (and then turned on), which needs to be implemented everywhere.
Postconditions cannot be weakened in a subtype.
For this one, we can step back from the duck analogy. Let's take this answer as a base. Assume we have a baseclass that accepts only positive integers. If in a subtype, while extending the method, we remove the condition that the number must be positive, then all units of code that used to take for granted that the number was positive is now under risk of breaking, since now there's no guarantee that the number is positive. Here's a representation of this idea:
public class IndexBaseClass
{
protected int index;
public virtual int Index
{
get
{
//Will return positive integers only
return index < 0 ? 0 : index;
}
set
{
index = value;
}
}
}
public class IndexSubClass : IndexBaseClass
{
public override int Index
{
get
{
//Will pay no mind whether the number is positive or negative
return index;
}
}
}
public class Testing
{
public static int GetIndexOfList(IndexBaseClass indexObject)
{
var list = new List<int>
{
1, 2, 3, 4
};
return list[indexObject.Index];
}
}
If we call GetIndexOfList passing an IndexSubClass object, there's no guarantee that the number will be positive, hence potentially breaking the application. Imagine you're already calling this method from all over your code. You'd have to waste your time checking for positive values in all implementations.
Invariants of the supertype must be preserved in a subtype.
A parent class may have some invariants, that is, some conditions that must remain true for as long as the object exists. No subclass should inherit the class and eliminate this invariant, under the risk of all implementations so far breaking down. In the example below, the parent class throws an Exception if it's negative and then set it, but the subclass just plain ignores it, it just sets the invariant.
The following code was taken from here:
public class ShippingStrategy
{
public ShippingStrategy(decimal flatRate)
{
if (flatRate <= decimal.Zero)
throw new ArgumentOutOfRangeException("flatRate", "Flat rate must be positive
and non-zero");
this.flatRate = flatRate;
}
protected decimal flatRate;
}
public class WorldWideShippingStrategy : ShippingStrategy
{
public WorldWideShippingStrategy(decimal flatRate)
: base(flatRate)
{
//The subclass inherits the parent's constructor, but neglects the invariant (the value must be positive)
}
public decimal FlatRate
{
get
{
return flatRate;
}
set
{
flatRate = value;
}
}
}
History constraint (the "history rule").
This one is the same as the last rule. It states that the subtype should not introduce methods that mutate an immutable property in the parent class, such as adding a new Set method in a subclass to a property that once was only settable through the constructor.
An example:
public class Parent
{
protected int a;
public Parent(int a)
{
this.a = a;
}
}
public class Child : Parent
{
public Child(int a) : base(a)
{
this.a = a;
}
public void SetA(int a)
{
this.a = a;
}
}
Now, a previously immutable property in the parent class is now mutable, thanks to the subclass. That is also a violation of the LSP.

Do you know the ICollection interface?
Imagine you are writing a method that gets ICollection and manipulate it by using its Add method or better yet its Clear method
If someone passes an ReadOnlyCollection (that implements ICollection) you'll get an exception for using Add.
Now you would never expect that since the interface defines that is ok therefore the ReadOnlyCollection violated LSP.

Communication between visitor and visitee

My current project contains a complex object hierarchy. The following structure is a simplified example of this hierarchy for demonstration purposes:
Library
Category "Fiction"
Category "Science Fiction"
Book A (Each book contains pages, not displayed here)
Book B
Category "Crime"
Book C
Category "Non-fiction"
(Many subcategories)
Now, I want to avoid having nested loops all over my code whenever I need some information from the data structure, because when the structure changes I'd have to update all the loops.
So I plan on using the visitor pattern, which seems to give me the flexibility I need. It would look something like this:
class Library
{
void Accept(ILibraryVisitor visitor)
{
IterateCategories(this.categories, visitor);
}
void IterateCategories(
IEnumerable<Category> categorySequence,
ILibraryVisitor visitor)
{
foreach (var category in categorySequence)
{
visitor.VisitCategory(category.Name);
IterateCategories(category.Subcategories, visitor);
foreach (var book in category.Books)
{
// Could also pass in a book instance, not sure about that yet...
visitor.VisitBook(book.Title, book.Author, book.PublishingDate);
foreach (var page in book.Pages)
{
visitor.VisitPage(page.Number, page.Content);
}
}
}
}
}
interface ILibraryVisitor
{
void VisitCategory(string name);
void VisitBook(string title, string author, DateTime publishingDate);
void VisitPage(int pageNumber, string content);
}
I'm already seeing some possible problems though, so I'm hoping you can give me some advice.
Question 1
If I wanted to create a list of book titles prefixed by the (sub)categories it belongs to (e.g. Fiction » Science Fiction » Book A), a simple visitor implementation would appear to do the trick:
// LibraryVisitor is a base implementation with no-op methods
class BookListingVisitor : LibraryVisitor
{
private Stack<string> categoryStack = new Stack<string>();
void VisitCategory(string name)
{
this.categoryStack.Push(name);
}
// Other methods
}
Here I have already run into a problem: I have no clue on when to pop the stack, because I don't know when a category ends. Is it a common approach to split up the VisitCategory method into two methods, like below?
interface ILibraryVisitor
{
void VisitCategoryStart(string name);
void VisitCategoryEnd();
// Other methods
}
Or are there other ways of dealing with structures like this, which have a clear scope with a start and end?
Question 2
Suppose I only want to list the books that were published in 1982. A decorator visitor would separate the filtering from the listing logic:
class BooksPublishedIn1982 : LibraryVisitor
{
private ILibraryVisitor visitor;
public BooksPublishedIn1982(ILibraryVisitor visitor)
{
this.visitor = visitor;
}
void VisitBook(string title, string author, DateTime publishingDate)
{
if (publishingDate.Year == 1982)
{
this.visitor.VisitBook(string title, string author, publishingDate);
}
}
// Other methods that simply delegate to this.visitor
}
The problem here is that VisitPage will still be called for books that are not published in 1982. So the decorator somehow needs to communicate with the visited object:
Visitor: 'Hey, this book isn't from 1982, so please don't tell me anything about it.'
Library: 'Oh ok, then I won't show you its pages.'
The visit methods currently return void. I could change it to return a boolean which indicates whether to visit sub-items, but that feels kind of dirty. Are there common practices for letting the visitee know that it should skip certain items? Or perhaps I should look into a different design pattern?
P.S. If you think these should be two separate questions, just let me know and I'll be happy to split them up.

The Visitor pattern, as described by the GoF book, deals with class hierarchies and not with object hierarchies. To put it simply, adding a new Visitor type acts like adding a new virtual function to the base class and all the children, without touching their code.
The machinery of a Visitor consists of one Visitor::Visit function per class in the hierarchy, and the Accept function in the parent class and in all the descendants. It works by calling Accept(visitor) through a parent class reference. The implementation of Accept in the object that happens to be referenced calls the right kind of Visitor::Visit(this). It is all fully orthogonal to any object hierarchy that may exist between instances of different subclasses of our root class.
In your case, the ILibraryVisitor interface would have a VisitLibrary(Library) method, a VisitCategory(Category) method, a VisitBook(Book) method, and so on, while each of Library, Category, Book and so on would inherit a common base class and reimplement its Accept(ILibraryVisitor) method.
So far so good. But from this point on your implementation seems to get a bit disoriented. A Visitor does not call its own Visit functions! Members of the hierarchy do, Visitor implements these functions for their benefit. So how do we go down the category tree?
Remember that to call Accept(FooVisitor) replaces the method Foo in the root of the hierarchy, and FooVisitor::VisitBar replaces the implementation of bar::Foo . When we want to do something with an object, we call its methods. don't we? So let's do it (in pseudocode).
class LibraryVisitor : ILibraryVisitor
{
IterateChildren (List<ILibraryObject> objects) {
foreach obj in objects {
obj.Accept(this);
}
}
IterateSubcategories (Category cat) {
stack.push (cat); # we need a stack here to build a path
IterateChildren (cat.children); # both books and subcategories
stack.pop();
}
VisitLibrary (Library) = abstract
VisitCategory (Category) = abstract
VisitBook (page) = abstract
VisitPage (Page) = abstract
}
class MyLibraryVisitor : LibraryVisitor {
VisitLibrary (Library l ) { ... IterateChildren (categories) ... }
VisitCategory (Category c) = { ... IterateSubcategories (c) ... }
VisitBook (Book) = { ... IterateChildren (pages) ... }
VisitPage (Page) = { ... no children here, end of walk ... }
}
Note the ping-pong action between Visit and Accept. Visitor calls Accept on the children of the current visitee, the children call Visitor::Visit back, and Visitor calls Accept on their children etc.
This is how your second question is answered:
class BooksPublishedIn1982 : LibraryVisitor
{
VisitBook (Book b) {
if b.publishedIn (1982) {
IterateChildren(b.pages)
}
}
}
Once again, it is apparent that the tree walk and the visitor machinery have just about nothing to do with each other.
I have left the decision of iterating or not iterating children entirely with each Visit implementation. This need not be the case, you can easily split each VisitXYZ into two functions, VisitXYZProper and VisitXYZChildren. By default, VisitXYZ will call both and each concrete visitor may override that decision.

Proper usage of "this." keyword in C#?

I'm working through the book Head First C# (and it's going well so far), but I'm having a lot of trouble wrapping my head around the syntax involved with using the "this." keyword.
Conceptually, I get that I'm supposed to use it to avoid having a parameter mask a field of the same name, but I'm having trouble actually tracking it through their examples (also, they don't seem to have a section dedicated to that particular keyword, they just explain it and start using it in their examples).
Does anyone have any good rules of thumb they follow when applying "this."? Or any tutorials online that explain it in a different way that Head First C#?
Thanks!

Personally I only use it when I have to which is:
Constructor chaining:
public Foo(int x) : this(x, null)
{
}
public Foo(int x, string name)
{
...
}
Copying from a parameter name into a field (not as common in C# as in Java, as you'd usually use a property - but common in constructors)
public void SetName(string name)
{
// Just "name = name" would be no-op; within this method,
// "name" refers to the parameter, not the field
this.name = name;
}
Referring to this object without any members involved:
Console.WriteLine(this);
Declaring an extension method:
public static TimeSpan Days(this int days)
{
return TimeSpan.FromDays(days);
}
Some other people always use it (e.g. for other method calls) - personally I find that clutters things up a bit.

StyleCop's default coding style enforces the following rule:
A1101: The call to {method or property
name} must begin with the 'this.'
prefix to indicate that the item is a
member of the class.
Which means that every method, field, property that belongs to the current class will be prefixed by this. I was initially resistant to this rule, which makes your code more verbose, but it has grown on me since, as it makes the code pretty clear. This thread discusses the question.

I write this. if and only if it enhances readability, for example, when implementing a Comparable interface (Java, but the idea is the same):
public void compareTo(MyClass other) {
if (this.someField > other.someField) return 1;
if (this.someField < other.someField) return -1;
return 0;
}
As to parameter shadowing (e.g. in constructors): I usually give those a shorter name of the corresponding field, such as:
class Rect {
private int width, height;
public Rect(int w, int h) {
width = w;
height = h;
}
}

Basically, this gives you a reference to the current object. You can use it to access members on the object, or to pass the current object as parameters into other methods.
It is entirely unnecessary in almost all cases to place it before accessing member variables or method calls, although some style guidelines recommend it for various reasons.
Personally, I make sure I name my member variables to be clearly different from my parameters to avoid ever having to use 'this.'. For example:
private String _someData;
public String SomeData
{
get{return _someData;}
set{_someData = value;}
}
It's very much an individual preference though, and some people will recommend that you name the property and member variable the same (just case difference - 'someData' and 'SomeData') and use the this keyword when accessing the private member to indicate the difference.
So as for a rule of thumb - Avoid using it. If you find yourself using it to distinguish between local/parameters variables and member variables then rename one of them so you don't have to use 'this'.
The cases where I would use it are multiple constructors, passing a reference to other methods and in extension methods. (See Jon's answer for examples)

If you have a method inside a class which uses same class's fields, you can use this.
public class FullName
{
public string fn { set; get; }
public string sn { set; get; }
//overriding Equals method
public override bool Equals(object obj)
{
if (!(obj is FullName))
return false;
if (obj == null)
return false;
return this.fn == ((FullName)obj).fn &&
this.sn == ((FullName)obj).sn;
}
//overriding GetHashCode
public override int GetHashCode()
{
return this.fn.GetHashCode() ^ this.sn.GetHashCode();
}
}

How many variables are too much for a class?

I want to see if anyone has a better design for a class (class as in OOP) I am writing. We have a script that puts shared folder stats in a CSV file. I am reading that in and putting it in a Share class.
My boss wants to know information like:
Total Number of Files
Total Size of Files
Number of Office Files
Size of Office Files
Number of Exe Files
Size of Exe Files
etc ....
I have a class with variables like $numOfficeFiles, $sizeOfficeFiles, etc. with a ton of get/set methods. Isn't there a better way to do this? What is the general rule if you have a class with a lot of variables/properties?
I think of this as a language agnostic question, but if it matters, I am using PHP.

Whenever I see more than 5 or 6 non-final variables in a class I get antsy.
Chances are that they should probably be placed in a smaller class as suggested by Outlaw Programmer. There's also a good chance it could just be placed in a hashtable.
Here's a good rule of thumb: If you have a variable that has nothing but a setter and a getter, you have DATA, not code--get it out of your class and place it into a collection or something.
Having a variable with a setter and a getter just means that either you never do anything with it (it's data) or the code that manipulates it is in another class (terrible OO design, move the variable to the other class).
Remember--every piece of data that is a class member is something you will have to write specific code to access; for instance, when you transfer it from your object to a control on a GUI.
I often tag GUI controls with a name so I can iterate over a collection and automatically transfer data from the collection to the screen and back, significantly reducing boilerplate code; storing the data as member variables makes this process much more complicated (requires reflection).

Sometimes, data can be just data:
files = {
'total': { count: 200, size: 3492834 },
'office': { count: 25, size: 2344 },
'exe': { count: 30, size: 342344 },
...
}

"A class should do one thing, and do it well"
If you're not breaking this rule, then I'd say there aren't too many.
However it depends.
If by too many you mean 100's, then you might want to break it into a data class and collection as shown in the edit below.
Then you've only one get/set operation, however there are pros and cons to this "lazyness".
EDIT:
On second glance, you've pairs of variables, Count and Size.
There should be another class e.g. FileInfo with count and class, now your frist class just has FileInfo classes.
You can also put file type e.g. "All", "Exe" . . . on the File Info class.
Now the parent class becomes a collection of FileInfo objects.
Personally, I think I'd go for that.

I think the answer is "there's no such thing as too many variables."
But then, if this data is going to be kept for a while, you might just want to put it in a database and make your functions calls to the database.
I assume you don't want to recalculate all these values every time you're asked for them.

Each class' "max variables" count really is a function of what data makes sense for the class in question. If there are truly X different values for a class and all data is related, that should be your structure. It can be a bit tedious to create depending on the language being used, but I wouldn't say there is any "limit" that you shouldn't exceed. It is dictated by the purpose.

Sounds like you might have a ton of duplicate code. You want the # of files and the size of files for a bunch of different types. You can start with a class that looks like this:
public class FileStats
{
public FileStats(String extension)
{
// logic to discover files goes here
}
public int getSize() { }
public int getNumFiles() { }
}
Then, in your main class, you can have an array of all the file types you want, and a collection of these helper objects:
public class Statistics
{
private static final String[] TYPES = { "exe", "doc", "png" };
private Collection<FileStats> stats = new HashSet<FileStats>();
public static void collectStats()
{
stats.clear();
for(String type : TYPES)
stats.add(new FileStats(type));
}
}
You can clean up your API by passing a parameter to the getter method:
public int getNumFiles(String type)
{
return stats.get(type).getNumFiles();
}

There is no "hard" limit. OO design does however have a notion of coupling and cohesion. As long as your class is loosely coupled and highly cohesive I believe that you are ok with as many members/methods as you need.

Maybe I didn't understand the goal, but why do you load all the values into memory by using the variables, just to dump them to the csv file (when?). I'd prefer a stateless listener to the directory and writing values immediately to the csv.

I always try to think of a Class as being the "name of my container" or the "name of the task" that I am going to compute. Methods in the Class are "actions" part of the task.
In this case seems like you can start grouping things together, for example you are repeating the number and the size actions many times. Why not create a super class that other classes inherit from, for example:
class NameOfSuperClass {
public $type;
function __construct($type) {
$this->type = $type;
$this->getNumber();
$this->getSize();
}
public function getNumber() {
// do something with the type and the number
}
public function getSize() {
// do something with the type and the size
}
}
Class OfficeFiles extends NameOfSuperClass {
function __construct() {
$this->_super("office");
}
}
I'm not sure if this is right in PHP, but you get my point. Things will start to look a lot cleaner and better to manage.

Just from what I glanced at:
If you keep an array with all of the file names in it, all of those variables can be computed on the fly.

It's more of a readability issue.
I would wrap all the data into an array. And use just one pair of get/set methods.
Something like:
class Test()
{
private $DATA = array();
function set($what,$data) {
$DATA[$what] = $data;
}
function get($what) {
return $this->DATA[$what];
}
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Abstract syntax tree construction and traversal - language-agnostic

It depends on the language. In C, you'd have to use the first form to capture the notion of a block, since a block has a variable scope: { { int a = 1; } // a doesn't exist here } The variable scope would be an attribute of what you call the "main node".

Related

Returning by reference from struct method in D

Liskov Substitute Principle (LSP) with Code example

Communication between visitor and visitee

Proper usage of "this." keyword in C#?

How many variables are too much for a class?

Categories

Resources