bench_isotonic.py

11 years ago

ENH - Improve performance of isotonic regression · 3753563c

Andrew Tulloch authored 11 years ago

This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements - on the order of ~7x
faster for problems of size ~1000, ~30x faster for problems of size
10,000, and ~250x faster for problems of size 100,000.

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements are surprisingly high.

Added a benchmark script. For an example usage, run:

```
python benchmarks/bench_isotonic.py --iterations 10 --log_min_problem_size 2 --log_max_problem_size 8 --dataset logistic
```

3753563c

History

ENH - Improve performance of isotonic regression

Andrew Tulloch authored 11 years ago

This supercedes #2940.

PAVA runs in linear time, but the existing algorithm was scaling
approximately quadratically, due to the array pop in the inner loop
being O(N).

I implemented the O(N) version of PAVA using the decreasing subsequences
trick, and the performance is significantly faster in benchmarking.

The benchmarks in this diff (adapted from one of the existing unit
tests) show significant performance improvements - on the order of ~7x
faster for problems of size ~1000, ~30x faster for problems of size
10,000, and ~250x faster for problems of size 100,000.

On correctness - unit tests cover the isotonic regression code fairly
well, and all pass before and after the change. It's a fairly well known
algorithm with a bunch of implementations, so I think this is correct.
In coding up this algorithm I made some mistakes and the unit tests
caught the failures, which makes me more confident in the correctness
now. Still, the performance improvements are surprisingly high.

Added a benchmark script. For an example usage, run:

```
python benchmarks/bench_isotonic.py --iterations 10 --log_min_problem_size 2 --log_max_problem_size 8 --dataset logistic
```