February 19, 2006

Programming in the small - variable declarations.

There are some programming habits that I see very frequently that lead to code that's more difficult to understand and maintain than necessary. I'll just cover one in this article and then maybe others in future articles.

Variable declarations in outer scope.

Consider the run method of Graph2Csv from the previous article. I often see code where variables are declared outside of the scope where they are used. For example:

	private void run() throws IOException {
		BufferedReader input = new BufferedReader(new InputStreamReader(System.in));
		String line = input.readLine();
		String[] parts;
		String source;
		String destination;
		while(line != null){
			if(line.contains(" -> ")){
				parts = line.split(" -> ");
				source = parts[0];
				destination = parts[1].substring(0,parts[1].length()-1);
				...

The variables parts, source, destination could have been declared inside the loop (as shown previously).

I think this habit comes from experience with some other languages, or earlier education, in which it was either necessary, or deemed good practice, to declare variables at the beginning of functions/procedures/methods.

The consequences

Delaring these variables outside the loop takes slightly more lines without being any clearer. You don't know, without reading through the whole method, whether the variables are accessed later in the method (if declared within the loop then, of course, they can't possibly be being used outside of the loop).

Extracting methods

It's not as obvious whether the code within the loop can be extracted as a method. Looking at this code, I'm wondering whether:

			if(line.contains(" -> ")){
				parts = line.split(" -> ");
				source = parts[0];
				destination = parts[1].substring(0,parts[1].length()-1);
				node(destination).incoming++;
				node(source).outgoing++;
			}

can be extracted as a method - maybe called "parseLine", but having the declarations outside of this scope makes me unsure. Fortunately, Eclipse is good enough to do the refactoring correctly, even if it doesn't do a fantastic job of it. You end up with the declarations of parts, source, destination being left behind but unused.

Where it gets really ugly

The declaration of variables in an outer scope isn't such a big deal in this example because the code is small and simple enough, but when combined with long methods, particularly if the variable is used later in the method for some other purpose (e.g. if there was another loop later in the method), it can be really difficult to see how to split a long method up into understandable parts.

I'll tell you what I want, what I really, really want.

It would be great if Eclipse (or other IDEs) spotted that these variable declarations could be put into an inner scope and offer to refactor the code to move the declarations for you (if you know of a tool or IDE that does this, please post a comment).

Posted by ivan at February 19, 2006 11:07 AM
Copyright (c) 2004-2008 Ivan Moore
Comments

I think it's a matter of personal choice where you declare the variables. I personally like to declare just before the loop for the following reasons:

1) The sequence (of steps) in the while/for loop looks clean.
2) Don't have to make the decision which variables are needed after the loop and end up declaring some variables inside the loop and some before the loop.
3) Ability to initialize variables used in the loop.
4) In loops with large number of iterations (say a million), GC needs to reclaim only 1 object as opposed to million objects if the object is defined inside the loop.

Posted by: Sagar at February 19, 2006 3:35 PM

Nice summary Ivan -- this is one of those "personal preferences" (sorry Sagar) that make for all kinds of headaches when refactoring (just coming off one such an adventure, and was talking about this last week!).

Variable entanglement is one of the larger [I tried using the word c*sts here, as in the expensive, but you're content filter stopped me!] that make rewriting a function cheaper than refactoring it (particularly because it makes testability much more difficult).

Of course when programming is practiced as a team sport, it it isn't simply personal preference, but team standard. Our current standard is something like: "inline where possible, minimize the distance between declaration and *last* use, factor out for clarity and performance after it works."

BTW, there's a good discussion of this on the c2 wiki:
http://www.c2.com/cgi/wiki?DeclareVariablesAtFirstUse

Also, Steve McConnell (I can't remember which book, but I think its Code Complete) is where I first remember reading about the benefits of minimizing the distance between declaration and use. (The gist of his argument if I remember correctly, is that it reduces the amount of information one needs to have to understand what's going on in a section of code).

Posted by: Bill Caputo at February 19, 2006 4:24 PM

Oooh - what a great idea for an intention action! (I think they're the words you were looking for). You should submit the suggestion to the forums for IntelliJ and Eclipse.

Posted by: Marty Andrews at February 20, 2006 6:27 AM

This is already a built in intention in IntelliJ - if you turn it on, it will híghlight the declarations of parts, source and destination as you're editing - you can then just right-click or Alt-Enter on them and select "Narrow scope of Variable".
Alternatively, you can scan your entire project and fix all occurrences.

Posted by: Niels at February 20, 2006 3:17 PM

Sagar said, "In loops with large number of iterations (say a million), GC needs to reclaim only 1 object as opposed to million objects if the object is defined inside the loop."

This is not true in general, because we're talking about declaring a reference on the stack, not an Object on the heap. In the compiled bytecode it makes no difference where your variables are declared, only how many different new Objects you assign to them.

To illustrate, the following loop allocates zero objects (no 'new' calls):
for(int i = 0; i < 1000; i++) {
Object t = Boolean.TRUE;
Object f = Boolean.FALSE;
}

But this bad code allocates 2000 objects, because there's Objects being allocated on the heap within the loop:
for(int i = 0; i < Integer.MAX_VALUE; i++) {
Object t = new Boolean(true);
Object f = new Boolean(false);
}

Posted by: Jesse Wilson at February 20, 2006 4:12 PM

There is an exception that Sagar has reminded me of; if there is a value that is calculated but invariant, but used only inside the loop, then I'll still declare and assign it outside the loop. Not necessarily for efficiency, but to make it clear that it is invariant (I might also declare it final if I'm feeling really fussy).

As far as his point (2) goes - it is because I want to know that a variable is not used after the loop that it is declared inside the loop. Unless it's a loop invariant (and not assigned within the loop), when I see a variable declared outside a loop it immediately makes me think it must be used outside the loop, otherwise, why wasn't it declared inside the loop? That then can put me off an "extract method" refactoring without spending more time reading the code than if it had been declared inside the loop.

Posted by: ivan at February 21, 2006 9:53 AM

Thanks Bill for the pointer.

Jesse, I was somehow thinking that declaring new variable references every time in the loop creates more work for GC. Thanks for explaining.

I was also concerned that there may be stack overflows with many variable references getting created for each iteration in the loop. After reading about JVM stack frames and variable allocation (JVM Spec §3.6) - I am convinced that declaring variables inside the loop makes total sense.

Thank you Ivan for posting this, very helpful. I am changed!

Looking forward to more..

Posted by: Sagar at February 21, 2006 4:33 PM

In Demetra builds, Intellij has gone further to have both "Move assignment to initializer" and "Convert to Local".

I've seen the former when initializing after declaration, but not in a loop; and the latter when declaring outside the loop.

To further Ivan's post, I actually believe most everytime you have a loop of any complexity, the internals should be a method. To the point that I don't even code there, I just declare a method passing the object of this loop (aka "line"), adding vars to the sig as I need them.

Not having a bunch of variables lying around helps in a lot of subtle ways, like enforcing the Law of Demeter, better encapsulation, ....

Posted by: RedBeard at February 21, 2006 7:57 PM

I prefer declaring them as deeply nested as possible. In this example you can even make the variables final, which is another advantage.

Posted by: Wouter Lievens at February 22, 2006 8:23 AM