CPU profiling

A question about performance benchmarks came up, in case there is an interest in current performance hogs in the code, I did a quick run in a profiler, doing the following steps:

preconditions:

  • use Stalingrad 42 module, a module with a fairly big map, a few hundred counters
  • use the current master branch with the latest commits
  • make sure the tiling process is complete for this module, as that is the common scenario in everyday use

profiling:

  • start the Player, pass it the module and -standalone as arguments
  • select one of the premade scenarios in the wizard, this should be very similar to opening an existing savegame
  • wait until the map fully loads, this takes a long time, good for profiling
  • once the map loads up, scroll scroll scroll, zoom zoom zoom, watch how the map tiles appear on the screen one by one, this is also a bit slow at times, which is good for profiling
  • move some counters around, move some stacks around, do some right-click actions on some counters, this went rather quick so probably not much impact on CPU time
  • close the player (should have saved the game to profile the save code as well, I know, but for a user the time until it loads up and he can start moving the pieces around is probably more important than the performance of the save code)

Results are in the image below, the main performance hogs seem to be:

  • VASSAL.counters.PieceCloner.clonePiece() - 17% + 14% of total CPU time (17% in this KeyEvent-recursion, 14% in GlobalKeyCommand/GlobalCommand subtree)
  • VASSAL.counters.Decorator.getType() - 8%
  • VASSAL.build.module.Map.StackMetrics.draw() - 8%
  • VASSAL.build.module.BasicCommandEncoder.createPiece() - 10%
  • VASSAL.tools.imageop.ScaleOpTiledBitmapImpl$TileOp.eval() - 10%

I think it is safe to say that PieceCloner.clonePiece() is the big bad guy here, if CPU performance is a concern then all optimization efforts should be aimed at this one method. It might make sense to start by splitting it up into several small methods, as the JVM can optimize those much better. Maybe also remove the recursion and replace it with iteration. The C++ devs among us probably have even more ideas how to improve this code to make it faster.

The image is quite big and the forums won’t let me upload it so it’s here: drive.google.com/file/d/12lXK-y … sp=sharing

I’m not sure investing time in optimizing the PieceCloner is worth it. It’s primary purpose is to create new Pieces that are a duplicate of an existing piece or piece definition. It is an inherently expensive process and there is a limit to what you will be able to achieve.

A better approach is to look at why it is being called so often after a game has started.

Piece cloning should only happen relatively rarely, at game start when pieces are created, or when pieces are created manually by players (drag and drop from palette, replace etc.).

The real culprit is this little gem:

piece.setProperty(Properties.SNAPSHOT, PieceCloner.getInstance().clonePiece(piece));

which is called all over the place to create a backup copy of the current piece before doing something to it.

The only place that this backup is used is in the ReportState trait to report on the previous values of properties maintained by this piece. In early Vassal, this was the only way to save the property values from an earlier instance of a piece.

In 2009(!), we changed Decorator (and BasicPiece) to implement the PropertyNameSource interface which allows every Decorator to expose the names of the properties they maintain. I use this in the Beanshell expression builder to create lists of properties that can be inserted into the current expression.

I expect it will be FAR more efficient to extract the exposed property values into a Map and save that as the SNAPSHOT to use in the ReportState trait, than cloning the piece. You will also need to review the existing Decorator.getPropertyNames() implementations to make sure they are up to date.

I think you will get far more improvement for substantially less work.

Rgds.

Thus spake Brent Easton:

In 2009(!), we changed Decorator (and BasicPiece) to implement the
PropertyNameSource interface which allows every Decorator to expose the
names of the properties they maintain. I use this in the Beanshell
expression builder to create lists of properties that can be inserted
into the current expression.

I expect it will be FAR more efficient to extract the exposed property
values into a Map and save that as the SNAPSHOT to use in the
ReportState trait, than cloning the piece. You will also need to review
the existing Decorator.getPropertyNames() implementations to make sure
they are up to date.

I think you will get far more improvement for substantially less work.

I did some work on this which never made it to a release. What I had was
a top-level piece wrapper containing a HashMap from property names to
getters and setters for the properties. What this does is change getting
or setting a property from being O(n) to O(1), which should make a
massive difference with pieces that have a long list of traits. (It
might be the single most significant optimization we could do for a
lot of modules.)

I still have it somewhere. I should dig that out.


J.

Sounds good! This leads to another question though: At the moment the PieceCloner works on GamePiece-type objects, but GamePiece is not related to PropertyNameSource. All types that implement GamePiece except for the Stack and the Deck also extend or implement PropertyNameSource. For this change the GamePiece would have to extend PropertyNameSource, but how would Stack and Deck implement this interface, whose properties would they return, those of all their contents or just the properties of a single piece inside them or no properties at all?

Another two slightly off-topic questions about these properties, are the property keys always of type String? Are the only possible property keys the ones defined in the Properties interface, and could they be changed into an enum?

For the generation of the SNAPSHOTS for the ReportState, the PieceCloner is only ever called with a Decorator, or potentially a BasicPiece (For a piece consisting of a BasicPiece trait only), both of which implement PropertyNameSource.

Even though the Inner piece of a Decorator is defined as a GamePiece, it can never be anything except a Decorator or a GamePiece.

Sadly, a BasicPiece is not a Decorator, which will lead to some instanceof and casting ugliness somewhere.

There is no need to try and support Stacks or Decks.

Yes, the property names are always Strings. But the Properties interface only contains the property names that are commonly used from components outside a piece. There are many more properties used internally by various traits and some of those names are variable between modules (created by the module designer - Case in point, Dynamic Property names). getPropertyNames() in the PropertyNameSource interface returns those names.

I have made a rough proposal now under PR #75 github.com/vassalengine/vassal/pull/75, can someone, maybe Brent, have a look at this and give me some feedback? This downcasting sure looks unsafe but if we are 100% sure that only objects of the proper type ever pass through this code then casting like this is ok.

Also:

You haven’t used reflection for this have you? It’s ok to look at the method names and deduce the property names based on JavaBean convention (getX setX isX etc) but most other kinds of reflection are evil.

And either way, can you dig that out if you find the time?

Thus spake Flint1b:

You haven’t used reflection for this have you? It’s ok to look at the
method names and deduce the property names based on JavaBean convention
(getX setX isX etc) but most other kinds of reflection are evil.

And either way, can you dig that out if you find the time?

I will, I just haven’t gotten to it yet.


J.

Thus spake Flint1b:

I have made a rough proposal now under PR #75
github.com/vassalengine/vassal/pull/75[1], can someone, maybe
Brent, have a look at this and give me some feedback? This downcasting
sure looks unsafe but if we are 100% sure that only objects of the
proper type ever pass through this code then casting like this is ok.

Of possible relevance here is what’s in the VASSAL.property package.
It contains some machinery for getting and setting properties in a
type-safe way which I added for 3.2 as a start for cleaning up property
handling, and then it never got used.

The two interfaces PropertyExporter extends, PropertyNameSource and
PropertySource, are from just before I started introducing generics,
and are really unfortunate in that they keys and values are Objects
despite that we do know their types…

The VASSAL.property package was intended to address this. There’s
a lot of simplification to be had if it, or something like it were
to be used (and everything converted to use it). There’s also support
for separating persistent state from ephemeral state, which could be
some help toward effecting model-view separation.

Also:

“uckelman” wrote:

What I had was a top-level piece wrapper containing a HashMap from
property names to getters and setters for the properties. What this
does is change getting or setting a property from being O(n) to O(1),
which should make a massive difference with pieces that have a long
list of traits. (It might be the single most significant optimization
we could do for a lot of modules.)

I still have it somewhere. I should dig that out.

You haven’t used reflection for this have you? It’s ok to look at the
method names and deduce the property names based on JavaBean convention
(getX setX isX etc) but most other kinds of reflection are evil.

And either way, can you dig that out if you find the time?

I had a look around and turned up a few of my benchmark cases, but not
the code I had in mind. No matter, as I can still describe it:

Every trait class has a pair of getProperty() and setProperty() methods, which are more or less similar structurally. Here are the ones from Decorator:

@Override
public Object getProperty(Object key) {
if (Properties.KEY_COMMANDS.equals(key)) {
return getKeyCommands();
}
else if (Properties.INNER.equals(key)) {
return piece;
}
else if (Properties.OUTER.equals(key)) {
return dec;
}
else if (Properties.VISIBLE_STATE.equals(key)) {
return myGetState()+piece.getProperty(key);
}
else if (Properties.SELECTED.equals(key)) {
return selected;
}
else {
return piece.getProperty(key);
}
}

@Override
public void setProperty(Object key, Object val) {
if (Properties.INNER.equals(key)) {
setInner((GamePiece) val);
}
else if (Properties.OUTER.equals(key)) {
dec = (Decorator) val;
}
/**
* Cache Selection status and pass it on to all inner traits.
*/
else if (Properties.SELECTED.equals(key)) {
if (val instanceof Boolean) {
setSelected((Boolean) val);
}
else {
setSelected(false);
}
piece.setProperty(key, val);
}
else {
piece.setProperty(key, val);
}
}

These are painfully slow. Whenever you want to do a lookup of a
property, you havae to do a number of string comparisons which is O(n)
in the number of total properties—and not just the total number of
properties in the trait where your property is, but the total number of
properties in the whole stack of traits. This is tollerable when you
have simple pieces, but becomes a disaster when you have pieces with 50
traits (as could be seen in one of Brian’s stack traces a week or two
ago). If you’re unlucky and the property you want is in a trait near the
bottom of the stack, you will have to do hundreds of string comparisons
to get that value. The Gang of Four book should have included a note
that the Decorator pattern was not suitable for stacks of decorators as
deep as we have.

There are at least two ways the above could be improved: Locally, and globally.

Locally, one could put getters and setters for a trait into a map and
have O(1) access to properties within each trait:

protected interface Getter {
public Object get(GamePiece p, Object key);
}

protected interface Setter {
public void set(GamePiece p, Object key, Object value);
}

// This is a map of containing the getting code from old getProperty()
protected static final Map<Object, Getter> getters = Map.ofEntries<>(
entry(Properties.KEY_COMMANDS, (p, k) → p.getKeyCommands()),
entry(Properties.INNER, (p, k) → p.piece),
entry(Properties.OUTER, (p, k) → p.dec),
entry(Properties.VISIBLE_STATE, (p, k) →
p.myGetState() + p.piece.getProperty(k)),
entry(Properties.SELECTED, (p, k) → p.selected)
);

// This is a map of containing the setting code from old setProperty()
protected static final Map<Object, Setter> setters = Map.ofEntries<>(
entry(Properties.INNER, (p, k, v) → p.setInner((GamePiece) v)),
entry(Properties.OUTER, (p, k, v) → p.dec = (Decorator) v),
entry(Properties.SELECTED, (p, k, v) → {
setSelected(v instanceof Boolean ? (Boolean) v : false);
p.piece.setProperty(k, v);
})
);

// … or one could have a Map<Object, Pair<Getter, Setter>> to avoid
// repetition

// These cover the extremal case, when the decorator delegates getting or
// setting to the piece it wraps
protected static final Getter delegatedGetter =
(p, k) → p.piece.getProperty(k);

protected static final Setter delegatedSetter =
(p, k, v) → p.piece.setProperty(k, v);

@Override
public Object getProperty(Object key) {
return getters.getOrDefault(key, delegatedGetter).get(this, key);
}

@Override
public void setProperty(Object key, Object value) {
return setters.getOrDefault(key, delegatedSetter).set(this, key, value);
}

This ought to be quite a bit faster for property lookup. You still have
to traverse the chain of traits, but each step will be faster, since you
can avoid doing so many string comparisons at each step. (If you have a
stack of 20 traits, each with 9 properties and you have to go all the
way to the bottom to retrieve a property, it’s about 2.5x faster. It’s
still O(n), but now n is the number of traits rather than the number of
properties, so n is much smaller.) This also has the virtue of being
invisible from the outside.

The best solution, however, would be to store the outermost getter and
setter for each propery in a map held by the outermost decorator. That
way, there would be completely flat constant-time access to all
properties. This isn’t as straightforward to do as what I described
above, as it’s not static—it depends on trait order, so would have to
be set up for each piece (or, if one wanted to be clever, piece type).
Basically what you’d do at creation time is traverse the trait chain and
grab the first getter/setter you find for each property as you walk
toward the end. It may be possible to manage this without breaking the
API as well, though doing so could be rather fiddly. I know of nothing
which would improve performance across all modules more than this
change.


J.

Wow it sounds like just what the doctor ordered…

This sounds good in theory, but in practice, so far we have only determined the Piececloner.clone() method to be the one big performance hog. Once we remove most calls to it, we can measure again and go from there.

Remember one of the better things Donald Knuth wrote, “premature optimization is the root of all evil”, or the full quote:

Some smartass professor at the uni has actually taught us this:

for (int i = 0; i < getSize(); i++) {
  // do something
}

… is better written like this, cause it only calls getSize() once:

int size = getSize();
for (int i = 0; i < size; i++) {
  // do something
}

This is a) premature optimization and b) modern hotspot VMs optimize this on their own if they see the need for it

Thus spake Flint1b:

This sounds good in theory, but in practice, so far we have only
determined the Piececloner.clone() method to be the one big performance
hog. Once we remove most calls to it, we can measure again and go from
there.

I don’t know what else I can say other than that my claims are based on
having profiled this, in 2010 or so.

If you want the most egregious example I know of, try the VG Vietnam
1965-1975 module and check yourself.


J.

Yes, it will very much depend on the trait dept, number of GamePieces on the Map, plus the rate they are being hit by GKC’s. Try also the new scenario load and card play for Paths of Glory. (Sorry, Brian, but it runs like a dog).

Haha yes for sure, but an ancient dog loved by it’s community!

I tried the Stalingrad '42 module both before and after the PieceCloner changes, and didn’t see a noticeable difference.

I tried a bunch of actions on peices for a minute or so; some, like Rail Movement, were catastrophically slow. VisualVM’s profiler isn’t showing me any time taken by PieceCloner.clonePiece(), but I’m getting 13% in Zone.getMutablePropertiesContaineId(), 10% in ChangePropertyCommandEncoder.encode(), almost 10% in DynamicProperty$1.buildChildConfigurer(), 9% in Zone.getMap().

Exactly, this is why I don’t trust this profiler. There wouldn’t be any companies that earn money only by selling profilers if the VisualVM profiler was any good.

I just took a good old stopwatch and measured the following in Vassal 3.3.2 and PR 75:

  • start stalingrad 42
  • in wizard select “new game”, “campaign game”, wait until that loads
  • select “solitaire”, click “finish” and start the stopwatch
  • wait until the map is fully loaded and CPU usage goes down to idle levels
  • stop stopwatch

I got 1:20min for 3.3.2 versus 0:40 for PR 75.

And the results with “Path of Glory” were similar, I had 2:20min versus 1:00min there.

Thus spake Flint1b:

“uckelman” wrote:

VisualVM’s profiler isn’t showing me any time taken by
PieceCloner.clonePiece()

Exactly, this is why I don’t trust this profiler. There wouldn’t be any
companies that earn money only by selling profilers if the VisualVM
profiler was any good.

That doesn’t follow. There are companies selling compilers, and there are
very good free compilers available…

I just took a good old stopwatch and measured the following in Vassal
3.3.2 and PR 75:

  • start stalingrad 42
  • in wizard select “new game”, “campaign game”, wait until that loads
  • select “solitaire”, click “finish” and start the stopwatch
  • wait until the map is fully loaded and CPU usage goes down to idle
    levels
  • stop stopwatch

I got 1:20min for 3.3.2 versus 0:40 for PR 75.

Ah. We were not timing the same things. I will check again.


J.

They have to add things on top to compete with very good free stuff, usually support, a good IDE that is tailorcut to their compiler, features that the free compilers don’t have, licensing.

Thus spake Flint1b:

I got 1:20min for 3.3.2 versus 0:40 for PR 75.

I got 0:08.59 for master vs. 0:05.31 for PR 75. Yes, those are seconds,
that’s not a mistake.

As I’m getting times which are about 1/9 of yours, you can see why I didn’t
notice the difference.


J.

I know, I know, I need a faster computer. OTOH, this makes me a human performance profiler, maybe I should open up a profiling company and offer my services :smiley:

So, visualvm counted Decorator.getProperty() being called 5 MILLION TIMES while loading the Campaign scenario of Stalingrad '42, and Decorator.getMap() was called another 2m times. That could be largely avoided by having direct property access via the outermost piece. Traversal of the trait stack’s not free.