How do you refactor a large messy codebase?

How do you refactor a large messy codebase? - language-agnostic

I have a big mess of code. Admittedly, I wrote it myself - a year ago. It's not well commented but it's not very complicated either, so I can understand it -- just not well enough to know where to start as far as refactoring it.
I violated every rule that I have read about over the past year. There are classes with multiple responsibilities, there are indirect accesses (I forget the term - something like foo.bar.doSomething()), and like I said it is not well commented. On top of that, it's the beginnings of a game, so the graphics is coupled with the data, or the places where I tried to decouple graphics and data, I made the data public in order for the graphics to be able to access the data it needs...
It's a huge mess! Where do I start? How would you start on something like this?
My current approach is to take variables and switch them to private and then refactor the pieces that break, but that doesn't seem to be enough. Please suggest other strategies for wading through this mess and turning it into something clean so that I can continue where I left off!
Update two days later: I have been drawing out UML-like diagrams of my classes, and catching some of the "Low Hanging Fruit" along the way. I've even found some bits of code that were the beginnings of new features, but as I'm trying to slim everything down, I've been able to delete those bits and make the project feel cleaner. I'm probably going to refactor as much as possible before rigging my test cases (but only the things that are 100% certain not to impact the functionality, of course!), so that I won't have to refactor test cases as I change functionality. (do you think I'm doing it right or would it, in your opinion, be easier for me to suck it up and write the tests first?)
Please vote for the best answer so that I can mark it fairly! Feel free to add your own answer to the bunch as well, there's still room for you! I'll give it another day or so and then probably mark the highest-voted answer as accepted.
Thanks to everyone who has responded so far!
June 25, 2010: I discovered a blog post which directly answers this question from someone who seems to have a pretty good grasp of programming: (or maybe not, if you read his article :) )
To that end, I do four things when I
need to refactor code:
Determine what the purpose of the code was
Draw UML and action diagrams of the classes involved
Shop around for the right design patterns
Determine clearer names for the current classes and methods

Pick yourself up a copy of Martin Fowler's Refactoring. It has some good advice on ways to break down your refactoring problem. About 75% of the book is little cookbook-style refactoring steps you can do. It also advocates automated unit tests that you can run after each step to prove your code still works.
As for a place to start, I would sit down and draw out a high-level architecture of your program. You don't have to get fancy with detailed UML models, but some basic UML is not a bad idea. You need a big picture idea of how the major pieces fit together so you can visually see where your decoupling is going to happen. Just a page or two of some basic block diagrams will help with the overwhelming feeling you have right now.
Without some sort of high level spec or design, you just risk getting lost again and ending up with another unmaintainable mess.
If you need to start from scratch, remember that you never truly start from scratch. You have some code and the knowledge you gained from your first time. But sometimes it does help to start with a blank project and pull things in as you go, rather than put out fires in a messy code base. Just remember not to completely throw out the old, use it for its good parts and pull them in as you go.

What was most important for me on different occasions were unit tests: I took a few days to write tests for the old code and then I was free to refactor with confidence. How exactly is a different question, but having the tests made it possible for me to make real, substantial changes to the code.

I'll second everyone's recommendations for Fowler's Refactoring, but in your specific case you may want to look at Michael Feathers' Working Effectively with Legacy Code, which is really perfect for your situation.
Feathers talks about Characterization Tests, which are unit tests not to assert known behaviour of the system but to explore and define the existing (unclear) behaviour -- in the case where you've written your own legacy code, and fixing it yourself, this may not be so important, but if your design is sloppy then it's quite possible there are parts of the code that work by 'magic' and their behaviour isn't clear, even to you -- in that case, characterization tests will help.
One great part of the book is the discussion about finding (or creating) seams in your codebase -- seams are natural 'fault lines', if you like, where you can break into the existing system to start testing it, and pulling it towards a better design. Hard to explain but well worth a read.
There's a brief paper where Feathers fleshes out some of the concepts from the book, but it really is well worth hunting down the whole thing. It's one of my favourites.

Just an additional refactoring that is more important than you think: Name things correctly!
This goes for any variable name and method name. If the name does not accurately reflect what the thing is used for, then rename it to something more accurate. This might require several iterations. If you cannot find a short, and entirely accurate name, then that item does too much and you have an excellent candidate for a code snippet that needs to be split. The names also clearly indicate where the cuts are to be made.
Also, document your stuff. Whenever the answer to WHY? is not clearly conveyed by the answer to HOW? (being the code) you will need to add some documentation. Capturing design decisions is probably the most important task as it is very hard to do in code.

You could always start from "scratch". That doesn't mean scrap it and start from nothing, but try to rethink high-level things from the beginning, since you seem to have learned a lot since the last time you worked on it.
Start from a higher level, and as you build the scaffolding of your new and improved structure, take all the code you can reuse, which will probably be more than you think if you're willing to read through it and make some small changes.
When you're making the changes, be sure to be strict with yourself about following all the good practices you now know, because you will really thank yourself later.
It can be surprisingly refreshing to properly re-make program to do exactly what it did before, only more "cleanly". ;)
As others have mentioned as well, unit-tests are your best friend! They help you ensure that your refactoring works, and if you're starting from "scratch", it's the perfect time to write them.

You're in a much better position than many people facing this problem in that you understand what the code is supposed to do.
Taking variables out of a shared scope, as you're doing, is a great start, in that you're partitioning responsibilities. Ultimately you want each class to express a single responsibility. A few other things you might look at:
Easy targets for refactoring are code that's duplicated in lots of places and long methods.
If you're managing application state through statically initialized singletons or worse, a global state that everything is talking to, consider moving it to a managed initialization system (i.e. a dependency injection framework like spring or guice) or at least make sure that the initialization isn't entangled with the rest of the code.
Centralize and standardize how you're accessing outside resources, especially if you've got things like file locations or urls hardcoded.

Buy an IDE that has good refactoring support. I think IntelliJ is the best, but Eclipse has it now, too.
The unit test idea is key as well. You will want to have a suite of large, overall transactions that will give you the overall behavior of the code.
Once you have those, start creating unit tests for classes and smaller packages. Write the tests to demonstrate proper behavior, make your changes, and re-run the tests to demonstrate that you haven't broken everything.
Track code coverage as you go. You'll want to work it up to 70% or better. For the classes you change, you'll want those to be 70% or better before you make your changes.
Build up that safety net over time and you'll be able to refactor with some confidence.

very slowly :D
No seriously... take it one step at a time. For instance, refactor something only if it affects or helps you write the current bug/feature that you are working on right now and do no more than that. And before you refactor make darn sure that you have some kind of automated test in place that gets run on each build that will actually test what you are writing/refactoring. Even if you don't have unit tests, it is never too late to start adding them for all new and modified code that is being written. Over time, your code base will get better in small increments daily or weekly instead of worse - all without you making monumental heaps of changes.
In my personal opinion and experience, it's not worth it to just refactor a (legacy) codebase en masse for the sake of refactoring. In those cases, it's best to just start from scratch and do it right all over again (and very rarely are you afforded the opportunity to do such a thing). Hence, just refactoring incremental is the way to go.

For Java code, my favorite first step is to run Findbugs and then remove all the dead stores, un-used fields, unreachable catch blocks, unused private methods and likely bugs.
Next I run CPD to look for evidence of cut-copy-paste code.
It isn't unusual to be able to reduce the code base by 5% by doing this. It also saves you from refactoring code that is never used.

I think you should use Eclipse as a IDE because it is having many plugins and free of cost.You should now follow the MVC pattern and yes must write test cases using JUnit.Eclipse also have plugin for JUnit and it is providing code refactoring facility too so that will reduce your some work.And always remember that writing a code is not important the main thing is to write clean code.So now give comments everywhere so that not only you but any other person read the code then while reading the code he must feel that he is reading an essay.

Refactor the low-hanging fruit. Nibble away at the easy bits, and as you do that, the harder bits will begin to be a little easier. When there aren't any bits left to refactor, you're done.
The refactorings you'll probably find most useful are Rename Method (and even more trivial Renamings like Field, Variable, and Parameter), Extract Method, and Extract Class. For each refactoring you perform, write the necessary unit tests to make the refactoring safe, and run the entire suite of unit tests after each refactoring. It's tempting - and, let's be honest, pretty safe - to rely on the automated refactorings of your IDE, without the tests - but it's good practice and will be good to have the tests into the future as you add functionality to your project.

You might want to look at Martin Fowler's book Refactoring. This is the book that popularized the term and technique (my thought when taking his course: "I've been doing a lot of this all along, I didn't know it had a name"). A quote from the link:
Refactoring is a controlled technique
for improving the design of an
existing code base. Its essence is
applying a series of small
behavior-preserving transformations,
each of which "too small to be worth
doing". However the cumulative effect
of each of these transformations is
quite significant. By doing them in
small steps you reduce the risk of
introducing errors. You also avoid
having the system broken while you are
carrying out the restructuring - which
allows you to gradually refactor a
system over an extended period of
time.
As others have pointed out, unit tests will allow you to refactor with confidence. And start by reducing code duplication. The book will give you lots of other insights.
Here is a catalog of refactorings.

The correct definition of messy code, is code that hard to maintain and change.
To use more mathematical definition, you can check your code by code metrics tools.
This way, you will keep the code that already good enough, and find very fast, the wrong code.
My experience say, that is very powerful way to improve the quality of your code. (if your tool can show you the result on each build or on realtime)

Throw it away, build it new.

Related

When is it time to refactor code?

On on hand:
1. You never get time to do it.
2. "Context switching" is mentally expensive (difficult to leave what you're doing in the middle of it).
3. It usually isn't an easy task.
4. There's always the fear you'll break something that's now working.
On the other:
1. Using that code is error-prone.
2. Over time you might realize that if you had refactored the code the first time you saw it, that would have saved you time on the long run.
So my question is - Practically - When do you decide it's time to refactor your code?
Thanks.

A couple of observations:
On on hand:
1. You never got time to do it.
If you treat re-factoring as something separate from coding (instead of an intrinsic part of coding decently), and if you can't manage time, then yeah, you'll never have time for it.
"Context switching" is mentally expensive (difficult to leave what you're doing in the middle of it).
See previous point above. Refactoring is an active component of good coding practices. If you separate the two as if they were two different tasks, then 1) your coding practices need improvement/maturing, and 2) you will engage in severe context switching if your code is in a severe need of refactoring (again, code quality.)
It's usually isn't an easy task.
Only if the code you produce is not amenable to refactoring. That is, code that is hard to refactor exhibits one or more of the following (list is not universally inclusive):
High cyclomatic complexity,
No single responsibility per class (or procedure),
High coupling and/or poor low cohesion (aka poor LCOM metrics),
poor structure
Not following the SOLID principles.
No adherence to the Law of Demeter when appropriate.
Excessive adherence to the Law of Demeter when inappropriate.
Programming against implementations instead of interfaces.
There's always the fear you'll break something that's now working.
Testing? Verification? Analysis? Any of these before being checked into source control (and certainly before being delivered to the users)?
On the other:
1. Using that code is error-prone.
Only if it has never tested/verified and/or if there is no clear understanding of the conditions and usage patterns under which the potentially error-prone code operates acceptably.
Over time you might realize that if you would have refactored the code the
first time you saw it - That would have save you time on the long run.
That realization should not occur over time. Good engineering and work ethics calls for that realization to occur when the artifact (being hardware or software) is in the making.
So my question is - Practically - When do you decide it's time to refactor your code?
Practically, when I'm coding; I detect an area that needs improvement (or something that needs correction after a change on requirements or expectations); and I get an opportunity to improve it without sacrificing a deadline. If I cannot re-factor at that moment, I simply document the perceived defect and create a workable, realistic plan to revisit the artifact for refactoring.
In real life, there will be moments that we'll code some ugly kludge just to get things running, or because we are drained and tired or whatever. It's reality. Our job is to make sure that those incidents do not pile up and remain unattended. And the key to this is to refactor as you code, keep the code simple and with a good, simple and elegant structure. And by "elegant" I don't mean "smart-ass" or esoteric, but that displays what is typically considered readable, simple, composable attributes (and mathematical attributes when they apply practically.)
Good code lends itself to refactoring; it displays good metrics; its structure resembles both computer science function composition and mathematical function composition; it has a clear responsibility; it makes its invariants, pre and post-conditions evident; and so on and so on.
Hope it helps.

One of the most common mistakes i see is people associating the word "Refactor" with "Big Change".
Refactoring code does not always have to be big. Even small changes such as changing a bool to a proper enum, or renaming a method to be closer to the actual function vs. the intent is refactoring your code. With the exception of the end of a milestone, I try to make at least a very small refactoring every single time I check in. And you'd be amazed at how quickly this makes a visible difference in the code.
Bigger changes do take bigger planning though. I try and schedule about 1/2 a day every two weeks during a normal development cycle to tackle a bigger refactoring change. This is enough time to make a substantial improvement to the code base. If the refactoring fails 1/2 a day is not that much of a loss. And it's rarely a total loss because even the failed refactoring will teach you something about your code.

Whenever it smells, I refactor. I may not make it perfect now, but I can at least make a small step towards a better state. And those small changes do add up over time...
If I am in the middle of something when I notice the smell, and fixing it isn't trivial (or I am just before release), I may make a (mental or paper) note to return to it once I am finished with the primary task.
Practice makes one better :-) But if I don't see a solution to a problem, I put it aside and let it brew for a while, discuss it with coworkers, or even post it on SO ;-)
If I don't have unit tests and the fix isn't trivial, I start with the tests. If the tests aren't trivial either, I apply point 2.

I start to refactor as soon as I find I am repeating my self. DRY principles ftw.
Also, if methods/functions get too long, to the point where they look unwieldy, or their purpose is being obscured by the length of the function, I break it into private subfunctions that explain what is really going on.
Lastly, if everything's up and running, and the code is dog-slow, I start to look at refactoring for the sake of performance.

When implementing a new feature I often notice that the task would be much simpler if the code I'm working on was structured in a different way. In this case I usually step back, try to do the refactoring first, and only after this is done I continue implementing the new feature.
I also have a habit to track all potential improvements that come to my mind either in notes or the bug tracker. The ideas bake there for some time, some of them don't feel so compelling anymore, and the reasonable ones are implemented during a day which I dedicate to smaller tasks.

Refactor code when it needs to be refactored. Some symptoms I look for:
duplicate code in similar objects.
duplicate code in within methods of one object.
anytime the requirements have changed twice or more.
anytime somebody says "we will clean that up later".
any time I read through code and shake my head thinking "what goofball did this" (even when the goofball in question is me)
In general, less design and/or less clear requirements means more oppurtunities for refactoring.

This might sound like a joke, but really, I only refactor when things "get messy". When a simple task starts taking more time and usual, when I have to twist my mind around to remember what function is doing what and such. Also, if the code starts running slow and it's not because I'm running in a development enviroment (a lot of variable outputs and such) if I can't optimise it, I refactor the code. As you said, it's worthed on the long run.
Still, I allways make sure I have enough time to think things through before I start so I don't get in this sittuation.
Cheers!

I usually refactor when one of the following is true:
I have nothing better to do and waiting for the next project to come to my inbox
The additions/changes I'm making to the code cannot work unless, or would be better if, I refactor
I am aesthetically displeased with the way the code is laid out

Martin Fowler in his book if the same name suggests you do it the third time you're in a block of code to make another change. First time in the block, you happen to notice you should refactor, but don't have time. Second time back...same thing. Third time back-now refactor.
Also, I read the developers of a current release of smalltalk (squeak.org, I think) say they go through a couple weeks of intense coding...then they step back and look at what can be refactored.
Personally I have to resist the impulse to refactor as I code or I get 'paralyzed'.

How to convince your fellow developer to write short methods?

Long methods are evil on several grounds:
They're hard to understand
They're hard to change
They're hard to reuse
They're hard to test
They have low cohesion
They may have high coupling
They tend to be overly complex
How to convince your fellow developer to write short methods? (weapons are forbidden =)
question from agiledeveloper

Ask them to write unit tests for the methods.

That depends on your definitions of "short" and "long".
When I hear someone say "write short methods", I immediately react badly because I've encountered too much spaghetti written by people who think the ideal method is two lines long: One line to do the tiniest possible unit of work followed by one line to call another method. (You say long methods are evil because "they're hard to understand"? Try walking into a project where every trivial action generates a call stack 50 methods deep and trying to figure out which of those 50 layers is the one you need to change...)
On the other hand, if, by "short", you mean "self-contained and limited to a single conceptual function", then I'm all for it. But remember that this can't be measured simply by lines of code.
And, as tydok pointed out, you catch more flies with honey than vinegar. Try telling them why your way is good instead of why their way is bad. If you can do this without making any overt comparisons or references to them or their practices (unless they specifically ask how your ideas would relate to something they're doing), it'll work even better.

You made a list of drawbacks. Try to make a list of what you'll gain by using short methods. Concrete examples. Then try to convince him again.

I read this quote from somewhere:
Write your code as if the person who has to maintain it is a violent psycho, who knows where you live.

In my experience the best way to convince a peer in these cases is by example. Just find opportunities to show them your code and discuss with them the benefits of short functions vs. long functions. Eventually they'll realize what's better spontaneously, without the need to make them feel "bad" programmers.

Code Reviews!
I suggest you try and get some code reviews going. This way you could have a little workshop on best practices and whatever formatting your company adhers to. This adds the context that short methods is a way to make code more readable and easier to understand and also compliant with the SRP.

If you've tried to explain good design and people just aren't getting it, or are just refusing to get it, then stop trying. It's not worth the effort. All you'll get is a bad rep for yourself. Some people are just hopeless.
Basically what it comes down to is that some programmers just aren't cut out for development. They can understand code that's already written, but they can't create it on their own.
These folks should be steered toward a support role, but they shouldn't be allowed to work on anything new. Support is a good place to see lots of different code, so maybe after a few years they'll come to see the benefits of good design.
I do like the idea of Code Reviews that someone else suggested. These sloppy programmers should not only have their own code reviewed, they should sit in on reviews of good code as well. That will give them a chance to see what good code is. Possibly they've just never seen good code.

To expand upon rvanider's answer, performing the cyclomatic complexity analysis on the code did wonders to get attention to the large method issue; getting people to change was still in the works when I left (too much momentum towards big methods).
The tipping point was when we started linking the cyclomatic complexity to the bug database. A CC of over 20 that wasn't a factory was guaranteed to have several entries in the bug database and oftentimes those bugs had a "bloodline" (fix to Bug A caused Bug B; fix to Bug B caused Bug C; etc). We actually had three CC's over 100 (max of 275) and those methods accounted for 40% of the cases in our bug database -- "you know, maybe that 5000 line function isn't such a good idea..."
It was more evident in the project I led when I started there. The goal was to keep CC as low as possible (97% were under 10) and the end result was a product that I basically stopped supporting because the 20 bugs I had weren't worth fixing.
Bug-free software isn't going to happen because of short methods (and this may be an argument you'll have to address) but the bug fixes are very quick and are often free of side-effects when you are working with short, concise methods.
Though writing unit tests would probably cure them of long methods, your company probably doesn't use unit tests. Rhetoric only goes so far and rarely works on developers who are stuck in their ways; show them numbers about how those methods are creating more work and buggy software.

Finding the right blend between function length and simplicity can be complex. Try to apply a metric such as Cyclomatic Complexity to demonstrate the difficulty in maintaining the code in its present form. Nothing beats a non-personal measurement that is based on testing factors such as branch and decision counts.

Not sure where this great quote comes from, but:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it"

Force him to read Code Complete by Steve McConnell. Say that every good developer has to read this.

Get him drunk? :-)
The serious point to this answer is the question, "why do I consistently write short functions, and hate myself when I don't?"
The reason is that I have difficulty understanding complex code, be that long functions, things that maintain and manipulate a lot of state, or that sort of thing. I noticed many years ago that there are a fair number of people out there that are significantly better at dealing with this sort of complexity than I am. Ironically enough, it's probably because of that that I tend to be a better programmer than many of them: my own limitations force me to confront and clean up that sort of code.
I'm sorry I can't really provide a real answer here, but perhaps this can provide some insight to help lead us to an answer.

Force them to read the book "Clean Code", there are many others but this one is new, good and an easy read.

Asking them to write Unit tests for the complex code is a good avenue to take. This person needs to see for himself what that debt that complexity brings when performing maintenance or analysis.
The question I always ask my team is: "It's 11 pm and you have to read this code - can you? Do you understand under pressure? Can you, over the phone, no remote login, lead them to the section where they can fix an error?" If the answer is no, the follow up is "Can you isolate some of the complexity?"
If you get an argument in return, it's a lost cause. Throw something then.

I would give them 100 lines of code all under 1 method and then another 100 lines of code divided up between several methods and ask them to write down an explanation of what each does.
Time how long it takes to write both paragraphs and then show them the result.
...Make sure to pick code that will take twice or three times as long to understand if it were all under one method - Main() -
Nothing is better than learning by example.

short or long are terms that can be interpreted differently. For one short is a 2 line method while some else will think that method with no more than 100 lines of code are pretty short.
I think it would be better to state that a single method should not do more than one thing at the same time, meaning it should only have one responsibility.
Maybe you could let your fellow developers read something about how to practice the SOLID principles.

I'd normally show them older projects which have well written methods. I would then step through these methods while explaining the reasons behind why we developed them that way.
Hopefully when looking at the bigger picture, they would understand the reasons behind this.
ps. Also, this exercise could be used in conjuction as a mini knowledge transfer on older projects.

Show him how much easier it is to test short methods. Prove that writing short methods will make it easier and faster for him to write the tests for his methods (he is testing these methods, right?)
Bring it up when you are reviewing his code. "This method is rather long, complicated, and seems to be doing four distinct things. Extract method here, here, and here."

Long methods usually mean that the object model is flawed, i.e. one class has too many responsibilities. Chances are that you don't want just more functions, each one shorter, in the same class, but those responsibilies properly assigned to different classes.

No use teaching a pig to sing. It wastes your time and annoys the pig.
Just outshine someone.
When it comes time to fix a bug in the 5000 line routine, then you'll have a ten-line routine and a 4990-line routine. Do this slowly, and nobody notices a sudden change except that things start working better and slowly the big ball of mud evaporates.

You might want to tell them that he might have a really good memory, but you don't. Some people are able to handle much longer methods than others. If you both have to be able to maintain the code, it can only be done if the methods are smaller.
Only do this if he doesn't have a superiority complex
[edit]
why is this collecting negative scores?

You could start refactoring every single method they wrote into multiple methods, even when they're currently working on them. Assign extra time to your schedule for "refactoring other's methods to make the code maintanable". Do it like you think it should be done, and - here comes the educational part - when they complain, tell them you wouldn't have to refactor the methods if they would have made it right the first time. This way, your boss learns that you have to correct other's lazyness, and your co-workers learn that they should make it different.
That's at least some theory.

What are the disadvantages code reuse?

A few years ago, we needed a C++ IPC library for making function calls over TCP. We chose one and used it in our application. After a while, it became clear it didn't provide all functionality we needed. In the next version of our software, we threw the third party IPC library out and replaced it by one we wrote ourselves. From then on, I sometimes doubt whether this was a good decision, because it has proven to be quite a lot of work and it obviously felt like reinventing the wheel. So my question is: are there disadvantages to code reuse that justify this reinvention?

I can suggest a few
The bugs get replicated - If you reuse a buggy code :)
Sometimes it may add an additional overhead. As an example if you just need to do a simple thing it is not advisable to use a complex BIG library that implements the required feature.
You might face with some licensing concerns.
You may need to spend some time to learn\configure the external library. This may not be effective if the re-development takes a much lower time.
Reusing a poorly documented library may get more time than expected/estimated

P.S. The reasons for writing our own library were:
Evaluating external libraries is often very difficult and it takes a lot of time. Also, some problems only become visible after a thorough evaluation.
It made it possible to introduce some features that are specific for our project.
It is easier to do maintenance and to write extensions, as you know the library through and through.

It's pretty much always case by case. You have to look at the suitability and quality of what you're trying to reuse.

The number one issue is: you can only successfully reuse code if that code is GOOD code. If it was designed poorly, has bugs, or is very fragile then you'll run into the same issues you already did run into -- you have to go do it yourself anyway because it's so hard to modify the existing code.
However, if it's a third-party library that you are considering using that you don't have the source code for, it's a little different. You can try and get the source if it's that kind of library. Some commercial library vendors are open to suggestions and feature requests.

The Golden Wisdom :: It Has To Be Usable Before It Can Be Reusable.

The biggest disadvantage (you mention it yourself) by reusing third party libraries, is that you are strongly coupled and dependent to how that library works and how it's supposed to be used, unless you manage to create a middle interface layer that can take care of it.
But it's hard to create a generic interface, since replacing an existing library with another one, more or less requires that the new functionality works in similar ways. However, you can always rewrite the code using it, but that might be very hard and take a long time.
Another aspect is that if you reinvent the wheel, you have complete control over what's happening and you can do modifications as you see fit. This can be completely impossible if you are depending on a third part library being alive and constantly providing you with updates and bug fixes. On the other hand, reusing code this way enables you to focus on other things in your software, which sometimes might be the thing to do.
There's always a trade off.

If your code relies on external resources and those go away, you may be crippling portions of many applications.

Since most reused code comes from the internet, you run into all the issues with the Bathroom Wall of Code Atwood talks about. You can run into issues with insecure or unreliable borrowed code, and the more black boxed it is, the worse.

Disadvantages of code reuse:
Debugging takes a whole lot longer since it's not your code and it's likely that it's somewhat bloated code.
Any specific requirements will also take more work since you are constrained by the code you're re-using and have to work around it's limitations.
Constant code reuse will result in the long run in a bloated and disorganized applications with hard to chase bugs - programming hell.
Re-using code can (dependently on the case) reduce the challenge and satisfaction factor for the programmer, and also waste an opportunity to develop new skills.
It depends on the case, the language and the code you want to re-use or re-write. In general I believe that the higher-level the language is, the more I tend towards code reuse. Bugs in higher-level language can have a bigger impact, and they're easier to rewrite. High level code must stay readable, neat and flexible. Of course that could be said of all code, but, somehow, rewriting a C library sounds less of a good idea than rewriting (or rather re-factoring) PHP model code.
So anyway, these are some of the arguments I'd use to promote "reinventing the wheel".
Sometimes it's just faster, more fun, and better in the long run to rewrite from scratch than having to work around bugs and limitation of a current codebase.

Wondering what you are using to keep this library you reinvented?

Initial time for create a reusable code is more expensive and time cost
When master branch has an update you need to sync it and deploy again
The bugs get replicated - If you reuse a buggy code
Reusing a poorly documented code may get more time than expected/estimated

Maintaining code that is close to software rot

Let's say you're the lucky programmer who inherited a code that is close to software rot. Software rot as defined in Pragmatic Programmer is code that is too ugly (in this case, unrefactored code) it is compared to a broken window that no one wants to fix it and in turn can damage a house and can cause criminals to thrive in that city.
But it is the same code that Joel Spolsky in JoelOnSoftware values in such a way that it contains valuable patches which have been debugged throughout its lifetime (which can look unstructured and ugly).
How would you maintain this?

Have a look at Working Effectively with Legacy Code by Michael Feathers. Lots of good advice there.

Welc is a great book. You should certainly check it out.
If you don't want to wait for the book arrive, I can summarise the bits I think are important
You need to understand your system. Do some throwaway coding to understand the part you need to work on. E.g. be prepared to try and do some work to get the system under test based upon the knowledge that you will probably break it. (understand what went wrong)
Look for areas where you can break dependencies. Michael Feathers calls these seams. They are points where you can take abit of the legacy system and refactor it so it will be testable.
As you work on the system add tests as you go.

You can do a few things:
Refactor the code to make it more maintainable. If the code is being used for feature development as well then refactoring will make sense.
If the code is legacy code and is touched only for bug fixes then I would suggest you only fix as much as required and when required.
Often, the first impression people get from such legacy acquired code is that its messy. Give it some time and get comfortable with it. You may see some valid reasons as to why the code looks this way with time to come...

First, make sure that you have a robust test procedure for it, and that it will actually be tested again in depth, by several people (you, QA, ...).
Then, take some time, day after day, to improve the small parts you have to modify. The key is to have a management that understands "why it takes longer as expected". Explain that you have to do refactoring and that it is important for both short and long term, ask other developers to review the existing code and confirm your arguments.

What's the best way to become familiar with a large codebase? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Joining an existing team with a large codebase already in place can be daunting. What's the best approach;
Broad; try to get a general overview of how everything links together, from the code
Narrow; focus on small sections of code at a time, understanding how they work fully
Pick a feature to develop and learn as you go along
Try to gain insight from class diagrams and uml, if available (and up to date)
Something else entirely?
I'm working on what is currently an approx 20k line C++ app & library (Edit: small in the grand scheme of things!). In industry I imagine you'd get an introduction by an experienced programmer. However if this is not the case, what can you do to start adding value as quickly as possible?
--
Summary of answers:
Step through code in debug mode to see how it works
Pair up with someone more familiar with the code base than you, taking turns to be the person coding and the person watching/discussing. Rotate partners amongst team members so knowledge gets spread around.
Write unit tests. Start with an assertion of how you think code will work. If it turns out as you expected, you've probably understood the code. If not, you've got a puzzle to solve and or an enquiry to make. (Thanks Donal, this is a great answer)
Go through existing unit tests for functional code, in a similar fashion to above
Read UML, Doxygen generated class diagrams and other documentation to get a broad feel of the code.
Make small edits or bug fixes, then gradually build up
Keep notes, and don't jump in and start developing; it's more valuable to spend time understanding than to generate messy or inappropriate code.
this post is a partial duplicate of the-best-way-to-familiarize-yourself-with-an-inherited-codebase

Start with some small task if possible, debug the code around your problem.
Stepping through code in debug mode is the easiest way to learn how something works.

Another option is to write tests for the features you're interested in. Setting up the test harness is a good way of establishing what dependencies the system has and where its state resides. Each test starts with an assertion about the way you think the system should work. If it turns out to work that way, you've achieved something and you've got some working sample code to reproduce it. If it doesn't work that way, you've got a puzzle to solve and a line of enquiry to follow.

One thing that I usually suggest to people that has not yet been mentioned is that it is important to become a competent user of the existing code base before you can be a developer. When new developers come into our large software project, I suggest that they spend time becoming expert users before diving in trying to work on the code.
Maybe that's obvious, but I have seen a lot of people try to jump into the code too quickly because they are eager to start making progress.

This is quite dependent on what sort of learner and what sort of programmer you are, but:
Broad first - you need an idea of scope and size. This might include skimming docs/uml if they're good. If it's a long term project and you're going to need a full understanding of everything, I might actually read the docs properly. Again, if they're good.
Narrow - pick something manageable and try to understand it. Get a "taste" for the code.
Pick a feature - possibly a different one to the one you just looked at if you're feeling confident, and start making some small changes.
Iterate - assess how well things have gone and see if you could benefit from repeating an early step in more depth.

Pairing with strict rotation.
If possible, while going through the documentation/codebase, try to employ pairing with strict rotation. Meaning, two of you sit together for a fixed period of time (say, a 2 hour session), then you switch pairs, one person will continue working on that task while the other moves to another task with another partner.
In pairs you'll both pick up a piece of knowledge, which can then be fed to other members of the team when the rotation occurs. What's good about this also, is that when a new pair is brought together, the one who worked on the task (in this case, investigating the code) can then summarise and explain the concepts in a more easily understood way. As time progresses everyone should be at a similar level of understanding, and hopefully avoid the "Oh, only John knows that bit of the code" syndrome.
From what I can tell about your scenario, you have a good number for this (3 pairs), however, if you're distributed, or not working to the same timescale, it's unlikely to be possible.

I would suggest running Doxygen on it to get an up-to-date class diagram, then going broad-in for a while. This gives you a quickie big picture that you can use as you get up close and dirty with the code.

I agree that it depends entirely on what type of learner you are. Having said that, I've been at two companies which had very large code-bases to begin with. Typically, I work like this:
If possible, before looking at any of the functional code, I go through unit tests that are already written. These can generally help out quite a lot. If they aren't available, then I do the following.
First, I largely ignore implementation and look only at header files, or just the class interfaces. I try to get an idea of what the purpose of each class is. Second, I go one level deep into the implementation starting with what seems to be the area of most importance. This is hard to gauge, so occasionally I just start at the top and work my way down in the file list. I call this breadth-first learning. After this initial step, I generally go depth-wise through the rest of the code. The initial breadth-first look helps to solidify/fix any ideas I got from the interface level, and then the depth-wise look shows me the patterns that have been used to implement the system, as well as the different design ideas. By depth-first, I mean you basically step through the program using the debugger, stepping into each function to see how it works, and so on. This obviously isn't possible with really large systems, but 20k LOC is not that many. :)

Work with another programmer who is more familiar with the system to develop a new feature or to fix a bug. This is the method that I've seen work out the best.

I think you need to tie this to a particular task. When you have time on your hands, go for whichever approach you are in the mood for.
When you have something that needs to get done, give yourself a narrow focus and get it done.

Get the team to put you on bug fixing for two weeks (if you have two weeks). They'll be happy to get someone to take responsibility for that, and by the end of the period you will have spent so much time problem-solving with the library that you'll probably know it pretty well.

If it has unit tests (I'm betting it doesn't). Start small and make sure the unit tests don't fail. If you stare at the entire codebase at once your eyes will glaze over and you will feel overwhelmed.
If there are no unit tests, you need to focus on the feature that you want. Run the app and look at the results of things that your feature should affect. Then start looking through the code trying to figure out how the app creates the things you want to change. Finally change it and check that the results come out the way you want.
You mentioned it is an app and a library. First change the app and stick to using the library as a user. Then after you learn the library it will be easier to change.
From a top down approach, the app probably has a main loop or a main gui that controls all the action. It is worth understanding the main control flow of the application. It is worth reading the code to give yourself a broad overview of the main flow of the app. If it is a GUI app, creating a paper that shows which screens there are and how to get from one screen to another. If it is a command line app, how the processing is done.
Even in companies it is not unusual to have this approach. Often no one fully understands how an application works. And people don't have time to show you around. They prefer specific questions about specific things so you have to dig in and experiment on your own. Then once you get your specific question you can try to isolate the source of knowledge for that piece of the application and ask it.

Start by understanding the 'problem domain' (is it a payroll system? inventory? real time control or whatever). If you don't understand the jargon the users use, you'll never understand the code.
Then look at the object model; there might already be a diagram or you might have to reverse engineer one (either manually or using a tool as suggested by Doug). At this stage you could also investigate the database (if any), if should follow the object model but it may not, and that's important to know.
Have a look at the change history or bug database, if there's an area that comes up a lot, look into that bit first. This doesn't mean that it's badly written, but that it's the bit everyone uses.
Lastly, keep some notes (I prefer a wiki).
The existing guys can use it to sanity check your assumptions and help you out.
You will need to refer back to it later.
The next new guy on the team will really thank you.

I had a similar situation. I'd say you go like this:
If its a database driven application, start from the database and try to make sense of each table, its fields and then its relation to the other tables.
Once fine with the underlying store, move up to the ORM layer. Those table must have some kind of representation in code.
Once done with that then move on to how and where from these objects are coming from. Interface? what interface? Any validations? What preprocessing takes place on them before they go to the datastore?
This would familiarize you better with the system. Remember that trying to write or understand unit tests is only possible when you know very well what is being tested and why it needs to be tested in only that way.
And in case of a large application that is not driven towards databases, I'd recommend an other approach:
What the main goal of the system?
What are the major components of the system then to solve this problem?
What interactions each of the component has among them? Make a graph that depicts component dependencies. Ask someone already working on it. These componentns must be exchanging something among each other so try to figure out those as well (like IO might be returning File object back to GUI and like)
Once comfortable to this, dive into component that is least dependent among others. Now study how that component is further divided into classes and how they interact wtih each other. This way you've got a hang of a single component in total
Move to the next least dependent component
To the very end, move to the core component that typically would have dependencies on many of the other components which you've already tackled
While looking at the core component, you might be referring back to the components you examined earlier, so dont worry keep working hard!
For the first strategy:
Take the example of this stackoverflow site for instance. Examine the datastore, what is being stored, how being stored, what representations those items have in the code, how an where those are presented on the UI. Where from do they come and what processing takes place on them once they're going back to the datastore.
For the second one
Take the example of a word processor for example. What components are there? IO, UI, Page and like. How these are interacting with each other? Move along as you learn further.
Be relaxed. Written code is someone's mindset, froze logic and thinking style and it would take time to read that mind.

First, if you have team members available who have experience with the code you should arrange for them to do an overview of the code with you. Each team member should provide you with information on their area of expertise. It is usually valuable to get multiple people explaining things, because some will be better at explaining than others and some will have a better understanding than others.
Then, you need to start reading the code for a while without any pressure (a couple of days or a week if your boss will provide that). It often helps to compile/build the project yourself and be able to run the project in debug mode so you can step through the code. Then, start getting your feet wet, fixing small bugs and making small enhancements. You will hopefully soon be ready for a medium-sized project, and later, a big project. Continue to lean on your team-mates as you go - often you can find one in particular who is willing to mentor you.
Don't be too hard on yourself if you struggle - that's normal. It can take a long time, maybe years, to understand a large code base. Actually, it's often the case that even after years there are still some parts of the code that are still a bit scary and opaque. When you get downtime between projects you can dig in to those areas and you'll often find that after a few tries you can figure even those parts out.
Good luck!

You may want to consider looking at source code reverse engineering tools. There are two tools that I know of:
SWAG Kit (Linux only) link
Bauhaus academic commercial
Both tools offer similar feature sets that include static analysis that produces graphs of the relations between modules in the software.
This mostly consists of call graphs and type/class decencies. Viewing this information should give you a good picture of how the parts of the code relate to one another. Using this information, you can dig into the actual source for the parts that you are most interested in and that you need to understand/modify first.

I find that just jumping in to code can be a a bit overwhelming. Try to read as much documentation on the design as possible. This will hopefully explain the purpose and structure of each component. Its best if an existing developer can take you through it but that isn't always possible.
Once you are comfortable with the high level structure of the code, try to fix a bug or two. this will help you get to grips with the actual code.

I like all the answers that say you should use a tool like Doxygen to get a class diagram, and first try to understand the big picture. I totally agree with this.
That said, this largely depends on how well factored the code is to begin with. If its a gigantic mess, it's going to be hard to learn. If its clean, and organized properly, it shouldn't be that bad.

See this answer on how to use test coverage tools to locate the code for a feature of interest, without knowing anything about where that feature is, or how it is spread across many modules.

(shameless marketing ahead)
You should check out nWire. It is an Eclipse plugin for navigating and visualizing large codebases. Many of our customers use it to break-in new developers by printing out visualizations of the major flows.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008