Extracting 'useful' information out of sentences? - language-agnostic

I am currently trying to understand sentences of this form:
The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.
I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.
What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:
I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine
So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.
I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.

Really the best you could hope for is a Naive Bayesian Classifier with a sufficiently large (probably more than you have) training set and be willing to tolerate a fair rate of false determinations.
Seeking the holy grail of NLP is bound to leave you somewhat unsatisfied.

Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.
If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.
Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.

Related

Why IATA chose XML over JSON for NDC services?

May be it looks silly.. :)
But I am trying to find out what was the strong reason for IATA to choose XML over JSON.
I found lot of documentation on NDC XMLs, but couldn't find a proper functionality/feature of XML which is not possible in JSON, considering the advantages of using JSON
I could use some help in understanding it..
Thanks in advance..
A "why" question like this can be interpreted in two ways: (a) is there any historical evidence to show who made the decision, when they made it, and what their arguments were for making it? Or (b) can you think of any good reason why intelligent people might have made this choice?
I can't answer (a), but for (b) you have to look at the timeline. With something as big as IATA, it's likely they've been talking about this for at least 10 and maybe 20 years. Ten years ago, JSON was being promoted as "lightweight" - it didn't carry all the baggage of schemas, validation, transformation and query languages that went with XML. If you were in an airline, you didn't think of that as baggage, you thought of it as essential infrastructure. Being "lightweight" simply isn't a benefit in that world; on the contrary, the word is almost a suggestion that it's not up to doing heavyweight tasks.
Frankly (and at the risk of straying towards question (a)) I think it's very unlikely that the question of using JSON ever came up; they would all have been far too heavily committed to XML before anyone ever took JSON seriously. Don't forget that in 2005 XML was delivering things no one dreamt possible ten years earlier: a robust and rigorous data syntax, completely standardised, with full Unicode support, available at low cost on all platforms, with lots of tools around to support declarative processing. JSON was a new kid on the block, threatening to disrupt the consensus and fragment the industry, and for the people in this kind of community, that wasn't seen as something they needed or wanted.

How do I find methods?

Here's a somewhat general computer question. I've always been able to follow the LOGIC of programming, but when I go to code something, I always find that I don't know some method or another to get what I need to get done. When I see it, I always think, "OF COURSE!".
How do you go about finding relevant methods for your programming needs that are "built-in?" I don't enjoy re-inventing the wheel, but I find it difficult to find what I need to do what I want to do.
First try Google:
You can use google to search your required method. For example If I want to search a value in array in PHP then I go to Google and type "Search values in array in PHP". I find my required function at first place.
Then try Standard Documentation:
Try standard documentation to search for your required method. For example if my problem is related to strings in PHP then I go to String Functions documentation and find the required function.
Finally try Stackoverflow:
Otherwise you can ask your problem at Stackoverflow for your required methods and libraries. You will always get a shortest way.
What you are asking here is for the best way to do research. Well, that's hard skill to explain, even more so to teach.
Nevertheless here are some tips:
Go to a search engine. It makes no
sense to start in a place like MSDN,
since all of its content is indexed
by the search engines anyway.
Phrase your question several
different ways.
As you learn more
about the issue you will learn new
vocabulary about it. Use that new
vocabulary to do even more searches.
If the searches turn out empty,
switch to browsing a specific
section of the official
documentation that you think is the
most related to what you are doing. If nothing else, it will expand your horizons around the issue and give you more vocabulary to do more searches.
Finally, if all else fails ask a question on StackOverflow explaining what you want to do as clearly as possible.
Note that if there's a simple API that does what you need, you will rarely reach step 4.
You say:
It's very frustrating to suddenly find
an "easy" button mid-way through.
Try to see it differently. Think of these moments as blessings. You've just learned something. You invested a lot of effort - and instead of seeing that effort as wasted, see it as critical to proper learning. You - better than the guy who just happened across the magic method - really understand what it's for and something about how it works. And you really, really, understand why you need it, and you properly appreciate its value. You're never going to forget that method.
So it was costly, but you learned something important. Celebrate, and move on.
It is usually included in some form of documentation. Most IDEs support the documentation format and gives you auto-complete functionality.
if you are using MVS so MSDN is really good for it
In addition to this and this answer above, google's basic and advanced searching tips prove very helpful.
In addition to above, changing the order of keywords in search criteria also sorts the list in different orders.
In essence I believe that searching is still an art rather than a science, and is best learnt - quoting from David Reis' answer above: "2. As you learn more about the issue you will learn new vocabulary about it. Use that new vocabulary to do even more searches."
Search in the API documentation. But the best way to (I found so) is to search on the internet for multiple solutions and then choose the one that you think is best. Make your search as narrow as possible. For example you want to implement random number generation function, then search like this, "How to generate random numbers in Java?".
Namespaces, namingconventions, Autocomplete/Intellisence
I assume that you are trying to find some kind of Object-Oriented-apis . I use .net in my example.
First try to find a class that might be responsable for the method you are looking for.
Example: If you want to "Make a new Directory in the Filesystem" you must know (or learn) that (in dotnet) these classes are in the namespace System.IO:
This namespace contains subnamespaces like Compresseion and Classes like File, Path, Directory, ...
Second you sould know NamingConventions. There are common Naming-Prefixes for methods like Get, Set, Insert, Create. In the documentation for class Directory you will find a CreateDirectory-Method.
If you have an intelligent editor that knows your programming language and the classes and namespaces learning is much easier. In the dotnet-world this feature is called Autocomplete/Intellisence

Improving the way we write code?

While thinking about software-engineering in general I came across the question why we don't see any improvements in the way we write/document code.
Think about it: There has not been a revolutionary improvement since we've moved from punch cards to text editing. The last improvement I've seen is syntax highlighting and context sensitive help (e.g. Intellisense or ctags). Not something I would call revolutionary.
That makes me wonder: Why is it so?
I'll start with something I miss badly:
Lots of my code deals with geometry.
For documentation describing geometric relationships always ends up in a big heap of hard to read mathematical stuff (due to the lack of proper equation type-setting in ASCII). However, if I could embed a little drawing or scribble into the code everything would be much easier, neater and better to be understood.
What can you think up that would make your coding/text editing/documention tasks easier?
I'm surprised that nobody has yet mentioned No Silver Bullet. In 1986 (!), Frederick Brooks predicted that:
There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude [tenfold] improvement within a decade in productivity, in reliability, in simplicity. [...] We cannot expect ever to see two-fold gains every two years."
And in 23 years, he's been proven right. We've come up with a number of things such as syntax highlighting and Intellisense which have improved productivity significantly, but certainly not by an order of magnitude. As time marches on, we'll continue to make several incremental improvements, but the fact is there is no silver bullet: there's not going to be some magical revelation in the way we write code that will improve productivity by an order of magnitude.
I'm suprised that no one seems to have mentioned Donald Knuth's seminal Literate Programming - write your code as if it were a book or a scientific paper.
There has not been a revolutionary improvement since we've moved from punch cards to text editing
Never used a line editor, have you?
But seriously, text (especially in the representations chosen for modern languages) is
easily processed
fairly easy to specify
information dense
precise
Anything that comes along to replace it has to be a net win across all four of those properties. Not easy.
I disagree. We do have changes, small, but changes.
How common is the "for each" construct? Compare it to 20 years ago. How about the Domain Specific Languages movement? What about the idea that we should code in layers? How about Behavior Driven Development? Coding by complying to a specification... which writes a nice document as output when all runs fine. How about the standarization of regular expressions? PCRE. What about Alan Kay's group's DSL related work on "Moore's Law for Software", which explored a more advanced implementation of Cairo and generated TCP/IP code using diagrams from RFCs?
Documentation is a two way dialog. Both as code being more understandable and people learning this special language. You wouldn't say that German needs documentation if you know German. I know natural languages are very far away from computer languages, but there's a movement to make code more expressive. It's not about the new tools, it's about how we are coding.
One thing I've done recently in some of the more math-heavy sections of my application is to include the LaTeX markup for the particular equation as a comment/docstring. Right now, I just copy-paste into an online equation editor, but it would be very helpful to see the formula itself (with things like Greek letters and sub/superscripts) rather than a bunch of ASCII code.
Source Code In Database. In a nutshell, source code is parsed and put into a database. You'd then need an integrated IDE to view and edit the code, but at this point, syntax is decoupled from format. YOUR IDE could show you a program in a way that's completely different from someone else's, tuned to the task you're working on. I'd list some specific examples, but that article covers pretty much everything.
I'm surprised nobody mentioned it - javadoc is basically HTML, so there's nothing preventing you from embedding images (or anything else) in code. Simple, effective and ubiquitous, it's one of the things Java did right.
DrScheme let's you do these things. Here's the things you can insert from the PLT website:
http://docs.plt-scheme.org/drscheme/Menus.html#(part._.Insert)
3.1.6 Insert
Insert Comment Box : Inserts a box that is ignored by DrScheme; use it to write comments for people who read your program.
Insert Image... : Opens a find-file dialog for selecting an image file in GIF, BMP, XBM, XPM, PNG, or JPG format. The image is treated as a value.
Insert Fraction... : Opens a dialog for a mixed-notation fraction, and inserts the given fraction into the current editor.
Insert Large Letters... : Opens a dialog for a line of text, and inserts a large version of the text (using semicolons and spaces).
Insert λ : Inserts the symbol λ (as a Unicode character) into the program. The λ symbol is normally bound the same as lambda.
Insert Java Comment Box : Inserts a box that is ignored by DrScheme. Unlike the Insert Comment Box menu item, this is designed for the ProfessorJ language levels. See ProfessorJ.
Insert Java Interactions Box : Inserts a box that will allows Java expressions and statements within Scheme programs. The result of the box is a Scheme value corresponding to the result(s) of the Java expressions. At this time, Scheme values cannot enter the box. The box will accept one Java statement or expression per line.
Insert XML Box : Inserts an XML; see XML Boxes and Scheme Boxes for more information.
Insert Scheme Box : Inserts a box to contain Scheme code, typically used inside an XML box; see XML Boxes and Scheme Boxes.
Insert Scheme Splice Box : Inserts a box to contain Scheme code, typically used inside an XML box; see also XML Boxes and Scheme Boxes.
Insert Pict Box : Creates a box for generating a Slideshow picture. Inside the pict box, insert and arrange Scheme boxes that produce picture values.
You also insert your unit tests with the code that you're testing. Pretty neat stuff.
I think integrated IDEs with semantic highlighting and **semantically-constrained suggestions* (a la IDEA or Eclipse) are a huge advancement.
But that happened 8-10 years ago.
Template-based programming feels useful never seems to catch on. Recently I was impressed with a demo of the Meta-programming system, which leverages the interactive nature of the IDE to simplify the task of writing templates and what are (essentially) type-aware macros.
Meta-programming might help you define geometry-based macros that would substitute for a number of lines of code. I could imagine something that let you embed a more-readable 'math language' inside Java, and then parses its contents into something machine-readable.
I'd say version control was a pretty huge leap in how we work. The ability to keep a full record of every change anyone has made to the codebase, and to revert changes where necessary, has made a big difference.
I certainly respect Fred Brooks' argument No Silver Bullet, but I think the way we write code is nowhere near optimal, so there is lots more room for improvement. I tried to explain this in my book.
We're all familiar with "code golf", where you compete relentlessly to minimize something. That is a good way to approach the minimum possible value of that something.
What's great about this is that you are allowed, even encouraged, to break from traditions, prior conceptions, accepted wisdom, in the quest for winning. In short, you learn new things.
If the measure to be minimized is wall-clock execution time, you can do aggressive optimization.
If the measure is source code size (lines or characters) you get "code golf".
The measure I like best is "edit count". That is, given a code base, suppose a new requirement comes along. That requirement is implemented, completely, by editing the code base. Then a "diff" is done from old to new code base. The number of differences found is the edit count. Averaged over the set of likely new functional requirements, that is the measure to minimize.
If this is done aggressively, being free to contradict all conventional wisdom, the code base approaches a state I would call a domain-specific language (DSL). In this language, concepts expressed in code are in nearly 1-1 correspondence with problem-oriented concepts. In this state, it is not easy for the source code to be self-inconsistent (i.e. have bugs) because the fewer edits that have to be made to the source code, the fewer chances there are to make a mistake. It's also the case that such code tends to be short. But unlike "code golf" it tends to be very clear, because it maps the problem concepts so clearly.
So, tools and techniques that help in minimizing edit count can, in my opinion, be considered "silver bullets". DSL is one such. Code generation is another. My favorite optimization technique is another. For coding dynamically changing UIs there is differential execution. There are bound to be more, waiting to be discovered. Of course, everything depends on the training and experience of the "marksman" (the coder).
I think there are lots of new ideas to be discovered. The trick is to tell the difference between the ones that move us forward, versus the ones that hold us back.
I think this is where Doxygen and other documentation systems help. If we can embed small, discrete comments that link to other information such as:
/* help: fooimg.png */
And then have an external documentation system do that, then great.
Even better would be allowing our text-editor to treat those things as hyperlinks to external documentation.
I would reference a drawing as a reference in the code documentation. I see no reason why you can't have footnotes in code.
The ability to make a section of code read-only is something I've wanted
It sounds like you might be interested in Jonathan Edward's research. See, for example:
"The Summer of Code"
"What's next?"
"The future of programming"
Diffing and searching pictures is hard. Diff and search are very important to programmers. Using pictures instead of text is only a marginal improvement in many situations, it has some drawbacks, and it requires general acceptance before it's really worth doing (since you don't make things more understandable if your reader doesn't grok what you've done).
Plus, programmers have a million little tricks that make their lives easier, based on text representations of code, that they'd lose if you gave them code to read that was expressed in anything other than text. Sure, they might replace or re-implement those tricks over time, but in the short term they're gone.
You don't see lawyers switching from English to little back-of-a-napkin diagrams in contracts, either (the Creative Commons licenses try, but cannot make the picture be the formal representation of the contract). Probably for similar reasons.
If someone comes up with a programming language and IDE that, on balance, beats text-based ones; and successfully markets it; then you'll see the start of a revolutionary shift from text to a new format. If nobody comes up with any such thing, then we're not missing out. If someone comes up with something that is more productive but it doesn't gain traction because of independent advantages of other technologies, then that loss is the price we pay for free-market capitalism. Perhaps the ideas will be recycled eventually...
That said, integration between code and documentation could clearly be improved, and there are many efforts underway to do so, using various techniques with varying success. Again, the problem is that any particular cunning plan can in practice only really be implemented in one or a few languages and development environments at a time, and so has difficulty proving that it really is better. Embedding documentation in code is possibly the only universal advance since the invention of the API...
I think there's still a lot that can be done with text, though. For example, debugger technology makes a big difference to programmer productivity in certain common circumstances (namely: when a test fails or something else unexpected happens, but it's not obvious what the faulty assumption is in the code you're looking at). There may be lower-hanging fruit in terms of making programming better, than the actual business of expressing the program.
The last improvement I've seen is
syntax highlighting and context
sensitive help
Then you haven't looked much. Modern IDEs can do far, FAR more than that, namely show you the semantic structure of code (e.g. inheritance hierarchies) and even manipulate it (automatic refactoring) or enrich it with external data (such as who last changed a particular line of code).
I've used emacs, I like text macros. But, what I really want is parse macros. I'd like my editor to expose the machinery behind refactoring in such a way that I can write my transformations on the parse tree of the language itself.
For example, Python added += at one point when my code was littered with x = x + 1 lines. If I could have written a search and replace command that worked on the parse tree, I could have quickly cleaned up large amounts of my source code.
So, I want standard search and replace, but I want it at the level where the structure of my code has meaning, at the abstract syntax tree.
If you've ever used ReSharper, each of its refactorings and recommendations are written in the manner I describe, they find a pattern in the parse tree and suggest a replacement, or for a refactoring, apply a known replacement. I want access to that machinery for my own tasks!
Have you used Doxygen or similar for documenting your code? You can add links to images, and other file types (often stored in same directory as source code) that will get sucked into the generated documentation. I realize that this is one step removed from seeing the detail directly in you favorite editor but it definitely improves how we document our code.
Programming languages are a specialized form of mathematical notation, since you can express a programming language mathematically. Notation changes slowly, and so we don't get fast progress in our languages. Mostly, we advance when we come up with a new thing to fit into the notation, like using i to refer to the square root of negative one.
There are documentation schemes that allow you to embed things other than text. There was at least one programming scheme, Donald Knuth's Web, that allowed you to have a presentation and an execution version of a program (unfortunately, the base source code, the stuff you'd actually hack, was rather messy).
You could easily have a text editor that could treat comments as HTML, provided of course it could recognize comments as it saw them.
I've been thinking a lot about how to make coding faster and more efficient for the past years, always trying to keep it realistic and doing minimalistic implementations. Those are not revolutionary ideas, but since the original poster talked about the punching card to code typing transition, I thought of talking about other ways of communicating to the computer what we want to program.
My ideas are visual or vocal programming. The motivation behind is that there are only a number of ways a loop can be efficiently programmed, and an aware IDE could make some smart code substitution decisions depending on inputs other than typed lines of code.
Visual programming vs Coding: encapsulate (literally) code into "boxes" which have inputs and outputs, and connect them together across a horizontal timeline. This is a high-level concept that would be intrinsically interesting for multithreading development since you can have multiple lines or threads happening at the same time. Every process can be divided into a "box", no matter how you see it. Sending an e-mail in its most basic form is a box which takes an email as input and outputs a success/fail signal. Since the boxes and the lines are distributed across a timeline, the notion of time and event chronology isn't lost and feedback lines are possible.
Vocal programming vs Coding: The effectiveness of this technique would revolve around the effectiveness of the vocal syntax decided to create code and move the cursor. For example, you can say to the microphone "for variable zero to 10" and the system will automatically generate the following code placing the cursor inside:
for (x=0;x<10;x++){
// Cursor would be there after after the call
}
In terms of usability, you would need to be in a relatively silent room to minimize other sounds that might harm the voice recognition so this technology could be used in specialized environments mostly.
This is the result of my extensive programming experience using a wide range of hardware and programming languages. Let me know what you guys think, I'd love having a constructive discussion about that.
A few weeks back the "Intentional Software" created quite a buzz about their new language. I've yet to watch the presentation, but here is a quote from a review by Martin Fowler:
They started worryingly, with the
usual unrevealing Powerpoints, but
then they switched to showing the
workbench and the curtain finally
opened. To gauge the reaction, take a
look at Twitter.
#pandemonial Quite impressed! This is sweet! Multiple domains, multiple
langs, no question is going unanswered
#csells OK, watching a live electrical circuit rendered and
working in a C# file is pretty damn
cool.
#jolson Two words to say about the Electronics demo for Intentional
Software: HOLY CRAPOLA. That's it, my
brain has finally exploded.
#gblock This is not about snazzy demos, this is about completely
changing the world we know it.
#twleung ok, the intellisense for the actuarial formulas is just awesome
#lobrien This is like seeing a 100-mpg carburetor : OMG someone is
going to buy this and put it in a
vault!
Two quotes come instantly to mind:
"If it ain't broke, don't fix it."
"Use the best tool for the job."
Of course, although the core code is still written as text, alll the tools and libraries have changed massively since the days of punched cards.
This has been touched on by others, and it wouldn't revolutionise programming, but anyway...
I think it would be nice if code editors moved slightly beyond plain text editors. Even with syntax highlighting and code completion (which I think are incredibly good things), the editors of today (at least, the ones I use) still display exactly the same ASCII text (or whatever encoding is used) that is in the source files. I would be interested to see how well it would work if editors displayed, for example (some examples are more adventurous than others):
Comments in a text box with a light-blue background and no // or /* ... */ visible
Javadoc comments could have semi-rich text editing support (for those who do HTML Javadoc comments) (seriously, I would appreciate it if code editors rendered Javadoc comments as HTML because they're not the easiest to skim over when their HTML as plain text)
Functions in text boxes that could be collapsed to show only the signature (the collapsing can be done by current editors) and can be dragged around as boxes
Lines between function boxes to indicate how functions are connected
Zooming out so that rather than seeing a single source file (class in many languages) you can see multiple files and the way connect to each other (this would essentially be building UML-like diagramming directly into the code editor)
I think this (in my mind at least) would work without requiring additional markup in the source files so users of plain text editors wouldn't be disadvantaged by having all this extra markup cluttering the files.
Part of the problem might stem from the fact that when you don't code we don't call it programming: Assembling modular components using a GUI for instance.
You might be interested in these alternative programming "languages".
[Ladder][1], designed to mimic the way relay-logic-schemes work. Horrible IMO, but easy to understand for the old guys who did logic with sticks and stones. [http://www.amci.com/tutorials/images/ladder-diagram.gif][2]
[SFC, Sequential function chart][3], designed to simplify parallell programming. Code is written into boxes and these boxes can be placed paralell to each other and will thus execute simultaneously. By connecting the end of several boxes you can syncronize events. Very common for automation applications.
[Mathematica][5]!!!, Might not be the best programming language but the syntax highlighting(if you can call it that) is awesome! For example you can input a matrix by seeing the matrix nicly aligned instead of a huge double[][]. Graphs can be inserted in the code and the formatting of mathematical expressions looks like it does when you write on a paper. No more paranthesis-madness or long Math.PI expressions that really only need one character. And best of all, the files are just plain text even if it is rendered nicely in the editor!
Debuggers is also an area where lots of improvement has been done. Debuggers with replay are starting to come and also visual debuggers where data can be modified in real time. Edit and continiue is also a feature i wouldn't want to live withot.
WTF "new users can only post a maximum of one hyperlink", you will have to google the stuff i originally added to this post >:(
A brain-to-computer translator. Typing is the real
bottleneck. It really just needs to derive the algorithms I
think up and convert that to machine code.
I would say a lot of the newer languages are pretty great at
quickly creating algorithms. The improvements aren't so much
revolutionary now as they are evolutionary.
Dare I say it might actually be a new development language (perhaps even a new paradigm) to take us through such revolution;
I think you might want to take a look at Leo. This is one guys attempt at answering what you're asking about. I still can't wrap my VIM head around it personally, but others take to it quickly. It's not just a programming IDE, but more of an information organizer. It's written in Python, but I don't see why you can't code in other languages with it. The power of Leo is not so much the language, but the ability to express your thoughts and organize them whether it be in code, diagrams, images, or diagrams. Look over the tutorial and examples to get a feel for it. You might like it.
Automated semantic source code transformations, where a program can be reliably examined and manipulated by using an abstract interface/frontend to it that is aware of the underlying semantics.
So that source code can be queried and dealt with pretty much like a SQL database.
Allowing you to do static analysis of source code and refactor even complex source code by doing something along the lines of:
FIND CALLERS OF FUNCTION "foo" WHERE SIGNATURE("int","int","char*") AND RETURN_TYPE("bool");
...
RENAME MACRO "max" TO "maximum" IN FILE "macros.hxx";
RENAME NAMESPACE "prj" TO "project";
RENAME SYMBOL "OLDFOO" IN NAMESPACE "project";
RENAME FUNCTION "log" TO "show_log";
RENAME CLASS "FOO" TO "OLDFOO";
RENAME METHOD "FOO::inc" TO "FOO::increment";
...
CHANGE SIGNATURE IN FUNCTION "foo" WHERE SIGNATURE("int","int") TO SIGNATURE("double","double");
CHANGE SIGNATURE IN METHOD "myClass::handle" WHERE SIGNATURE("char") TO SIGNATURE("unsigned char")
MOVE FUNCTION "foo" in FILE "stuff.cc" TO "foo_funcs.cc";

How to partition a problem into smaller understandable portions?

I'm not sure if it's possible to give general advice on this topic, but please try. It's hard to explain my case because it's too complex to explain. And that's exactly the problem.
I seem to constantly stumble on a situation where I try to design some part of my project, but it has so many things to take into consideration that I'm unable to get a grasp of it.
Are there any general tips or advice on how to look at my system in smaller pieces at a time? How to find smaller portions that could be designed separately on their own?
Create a glossary.
In other words, identify the terms that are meaningful to the project domain — not from the programmer's point of view, but from a user's, who is familiar with the subject matter.
Then define the terms as precisely and discretely as you can. A good definition in this form can serve as a kind of pseudocode.
Since you have not identified even the domain of your problem, I'll choose a random example. In a civilian personnel system, you might have terms like:
billet: a term of service (from start date to end date) at a particular grade and step
employee: a series of billets associated with a particular SSN
grade and step: row and column in the federal general schedule
And so on. This isn't to identify functional units, as it sounds like you are trying to do, but it's a good preparatory step before doing so, so that you can express your functional steps in well-defined terms.
Your key goals are:
High cohesion: Code (methods, fields, classes) within one piece/module/partition should interact intensively; it should make sense for these elements to know about each other. If you find that some of them don't interact much with the rest, they probably belong somwhere else or should form their own partition. If you find code outside interacting intensively with the partition and knowing too much about its inner workings, it probably belongs inside. The typical example is found in OO code written in procedural style, with "dumb" data objects and "manager" code that operates on them but should really be part of the data objects.
Loose coupling: Interaction between pieces/modules/partitions should only happen through narrow, well-defined, well-documented APIs. Try to identify such APIs and see what code is needed to implement them and what code will use them.
It's useful to approach problem decomposition both top-down and bottom-up.
If you're having trouble splitting a big problem into two or more smaller problems, try to think of the smallest possible problems that will need to be solved. Once those are handled, you may start to see ways to combine them into larger problems as you approach your original large problem.
When I find myself copying and pasting chunks of code with minimal adjustments I realize that's a "partition" and then create a class, method, function, or whatever.
Actually, the whole object oriented approach is what it's all about. Try thinking of your application as tangible things that do stuff. Write pseudo code describing what the things are and what they do, I find lots of "partitions" this way.
Here's a try, kind of wild guess.
People usually underestimate how long it will take them to do the work. If your project is large, then most likely you'll need several people to work on it, so you can try planning with that in mind. Now a person can be expected to hold just one area in the head, so you'll need to explain to him exactly what kind of task he's supposed to do.
So I'd say you should try to write a job description that should encompass as much as possible for one person to seriously concentrate on. Repeat, until you have broken your project into parts you wanted to. As a benefit, you're ready to assemble your team. But if you find out the parts are small, maybe you'll still be able to do it yourself.

The best way to familiarize yourself with an inherited codebase

Stacker Nobody asked about the most shocking thing new programmers find as they enter the field.
Very high on the list, is the impact of inheriting a codebase with which one must rapidly become acquainted. It can be quite a shock to suddenly find yourself charged with maintaining N lines of code that has been clobbered together for who knows how long, and to have a short time in which to start contributing to it.
How do you efficiently absorb all this new data? What eases this transition? Is the only real solution to have already contributed to enough open-source projects that the shock wears off?
This also applies to veteran programmers. What techniques do you use to ease the transition into a new codebase?
I added the Community-Building tag to this because I'd also like to hear some war-stories about these transitions. Feel free to share how you handled a particularly stressful learning curve.
Pencil & Notebook ( don't get distracted trying to create a unrequested solution)
Make notes as you go and take an hour every monday to read thru and arrange the notes from previous weeks
with large codebases first impressions can be deceiving and issues tend to rearrange themselves rapidly while you are familiarizing yourself.
Remember the issues from your last work environment aren't necessarily valid or germane in your new environment. Beware of preconceived notions.
The notes/observations you make will help you learn quickly what questions to ask and of whom.
Hopefully you've been gathering the names of all the official (and unofficial) stakeholders.
One of the best ways to familiarize yourself with inherited code is to get your hands dirty. Start with fixing a few simple bugs and work your way into more complex ones. That will warm you up to the code better than trying to systematically review the code.
If there's a requirements or functional specification document (which is hopefully up-to-date), you must read it.
If there's a high-level or detailed design document (which is hopefully up-to-date), you probably should read it.
Another good way is to arrange a "transfer of information" session with the people who are familiar with the code, where they provide a presentation of the high level design and also do a walk-through of important/tricky parts of the code.
Write unit tests. You'll find the warts quicker, and you'll be more confident when the time comes to change the code.
Try to understand the business logic behind the code. Once you know why the code was written in the first place and what it is supposed to do, you can start reading through it, or as someone said, prolly fixing a few bugs here and there
My steps would be:
1.) Setup a source insight( or any good source code browser you use) workspace/project with all the source, header files, in the code base. Browsly at a higher level from the top most function(main) to lowermost function. During this code browsing, keep making notes on a paper/or a word document tracing the flow of the function calls. Do not get into function implementation nitti-gritties in this step, keep that for a later iterations. In this step keep track of what arguments are passed on to functions, return values, how the arguments that are passed to functions are initialized how the value of those arguments set modified, how the return values are used ?
2.) After one iteration of step 1.) after which you have some level of code and data structures used in the code base, setup a MSVC (or any other relevant compiler project according to the programming language of the code base), compile the code, execute with a valid test case, and single step through the code again from main till the last level of function. In between the function calls keep moting the values of variables passed, returned, various code paths taken, various code paths avoided, etc.
3.) Keep repeating 1.) and 2.) in iteratively till you are comfortable up to a point that you can change some code/add some code/find a bug in exisitng code/fix the bug!
-AD
I don't know about this being "the best way", but something I did at a recent job was to write a code spider/parser (in Ruby) that went through and built a call tree (and a reverse call tree) which I could later query. This was slightly non-trivial because we had PHP which called Perl which called SQL functions/procedures. Any other code-crawling tools would help in a similar fashion (i.e. javadoc, rdoc, perldoc, Doxygen etc.).
Reading any unit tests or specs can be quite enlightening.
Documenting things helps (either for yourself, or for other teammates, current and future). Read any existing documentation.
Of course, don't underestimate the power of simply asking a fellow teammate (or your boss!) questions. Early on, I asked as often as necessary "do we have a function/script/foo that does X?"
Go over the core libraries and read the function declarations. If it's C/C++, this means only the headers. Document whatever you don't understand.
The last time I did this, one of the comments I inserted was "This class is never used".
Do try to understand the code by fixing bugs in it. Do correct or maintain documentation. Don't modify comments in the code itself, that risks introducing new bugs.
In our line of work, generally speaking we do no changes to production code without good reason. This includes cosmetic changes; even these can introduce bugs.
No matter how disgusting a section of code seems, don't be tempted to rewrite it unless you have a bugfix or other change to do. If you spot a bug (or possible bug) when reading the code trying to learn it, record the bug for later triage, but don't attempt to fix it.
Another Procedure...
After reading Andy Hunt's "Pragmatic Thinking and Learning - Refactor Your Wetware" (which doesn't address this directly), I picked up a few tips that may be worth mentioning:
Observe Behavior:
If there's a UI, all the better. Use the app and get a mental map of relationships (e.g. links, modals, etc). Look at HTTP request if it helps, but don't put too much emphasis on it -- you just want a light, friendly acquaintance with app.
Acknowledge the Folder Structure:
Once again, this is light. Just see what belongs where, and hope that the structure is semantic enough -- you can always get some top-level information from here.
Analyze Call-Stacks, Top-Down:
Go through and list on paper or some other medium, but try not to type it -- this gets different parts of your brain engaged (build it out of Legos if you have to) -- function-calls, Objects, and variables that are closest to top-level first. Look at constants and modules, make sure you don't dive into fine-grained features if you can help it.
MindMap It!:
Maybe the most important step. Create a very rough draft mapping of your current understanding of the code. Make sure you run through the mindmap quickly. This allows an even spread of different parts of your brain to (mostly R-Mode) to have a say in the map.
Create clouds, boxes, etc. Wherever you initially think they should go on the paper. Feel free to denote boxes with syntactic symbols (e.g. 'F'-Function, 'f'-closure, 'C'-Constant, 'V'-Global Var, 'v'-low-level var, etc). Use arrows: Incoming array for arguments, Outgoing for returns, or what comes more naturally to you.
Start drawing connections to denote relationships. Its ok if it looks messy - this is a first draft.
Make a quick rough revision. Its its too hard to read, do another quick organization of it, but don't do more than one revision.
Open the Debugger:
Validate or invalidate any notions you had after the mapping. Track variables, arguments, returns, etc.
Track HTTP requests etc to get an idea of where the data is coming from. Look at the headers themselves but don't dive into the details of the request body.
MindMap Again!:
Now you should have a decent idea of most of the top-level functionality.
Create a new MindMap that has anything you missed in the first one. You can take more time with this one and even add some relatively small details -- but don't be afraid of what previous notions they may conflict with.
Compare this map with your last one and eliminate any question you had before, jot down new questions, and jot down conflicting perspectives.
Revise this map if its too hazy. Revise as much as you want, but keep revisions to a minimum.
Pretend Its Not Code:
If you can put it into mechanical terms, do so. The most important part of this is to come up with a metaphor for the app's behavior and/or smaller parts of the code. Think of ridiculous things, seriously. If it was an animal, a monster, a star, a robot. What kind would it be. If it was in Star Trek, what would they use it for. Think of many things to weigh it against.
Synthesis over Analysis:
Now you want to see not 'what' but 'how'. Any low-level parts that through you for a loop could be taken out and put into a sterile environment (you control its inputs). What sort of outputs are you getting. Is the system more complex than you originally thought? Simpler? Does it need improvements?
Contribute Something, Dude!:
Write a test, fix a bug, comment it, abstract it. You should have enough ability to start making minor contributions and FAILING IS OK :)! Note on any changes you made in commits, chat, email. If you did something dastardly, you guys can catch it before it goes to production -- if something is wrong, its a great way to get a teammate to clear things up for you. Usually listening to a teammate talk will clear a lot up that made your MindMaps clash.
In a nutshell, the most important thing to do is use a top-down fashion of getting as many different parts of your brain engaged as possible. It may even help to close your laptop and face your seat out the window if possible. Studies have shown that enforcing a deadline creates a "Pressure Hangover" for ~2.5 days after the deadline, which is why deadlines are often best to have on a Friday. So, BE RELAXED, THERE'S NO TIMECRUNCH, AND NOW PROVIDE YOURSELF WITH AN ENVIRONMENT THAT'S SAFE TO FAIL IN. Most of this can be fairly rushed through until you get down to details. Make sure that you don't bypass understanding of high-level topics.
Hope this helps you as well :)
All really good answers here. Just wanted to add few more things:
One can pair architectural understanding with flash cards and re-visiting those can solidify understanding. I find questions such as "Which part of code does X functionality ?", where X could be a useful functionality in your code base.
I also like to open a buffer in emacs and start re-writing some parts of the code base that I want to familiarize myself with and add my own comments etc.
One thing vi and emacs users can do is use tags. Tags are contained in a file ( usually called TAGS ). You generate one or more tags files by a command ( etags for emacs vtags for vi ). Then we you edit source code and you see a confusing function or variable you load the tags file and it will take you to where the function is declared ( not perfect by good enough ). I've actually written some macros that let you navigate source using Alt-cursor,
sort of like popd and pushd in many flavors of UNIX.
BubbaT
The first thing I do before going down into code is to use the application (as several different users, if necessary) to understand all the functionalities and see how they connect (how information flows inside the application).
After that I examine the framework in which the application was built, so that I can make a direct relationship between all the interfaces I have just seen with some View or UI code.
Then I look at the database and any database commands handling layer (if applicable), to understand how that information (which users manipulate) is stored and how it goes to and comes from the application
Finally, after learning where data comes from and how it is displayed I look at the business logic layer to see how data gets transformed.
I believe every application architecture can de divided like this and knowning the overall function (a who is who in your application) might be beneficial before really debugging it or adding new stuff - that is, if you have enough time to do so.
And yes, it also helps a lot to talk with someone who developed the current version of the software. However, if he/she is going to leave the company soon, keep a note on his/her wish list (what they wanted to do for the project but were unable to because of budget contraints).
create documentation for each thing you figured out from the codebase.
find out how it works by exprimentation - changing a few lines here and there and see what happens.
use geany as it speeds up the searching of commonly used variables and functions in the program and adds it to autocomplete.
find out if you can contact the orignal developers of the code base, through facebook or through googling for them.
find out the original purpose of the code and see if the code still fits that purpose or should be rewritten from scratch, in fulfillment of the intended purpose.
find out what frameworks did the code use, what editors did they use to produce the code.
the easiest way to deduce how a code works is by actually replicating how a certain part would have been done by you and rechecking the code if there is such a part.
it's reverse engineering - figuring out something by just trying to reengineer the solution.
most computer programmers have experience in coding, and there are certain patterns that you could look up if that's present in the code.
there are two types of code, object oriented and structurally oriented.
if you know how to do both, you're good to go, but if you aren't familiar with one or the other, you'd have to relearn how to program in that fashion to understand why it was coded that way.
in objected oriented code, you can easily create diagrams documenting the behaviors and methods of each object class.
if it's structurally oriented, meaning by function, create a functions list documenting what each function does and where it appears in the code..
i haven't done either of the above myself, as i'm a web developer it is relatively easy to figure out starting from index.php to the rest of the other pages how something works.
goodluck.