Finding minimum qualified subset in Haskell

27 04 2010

I recently stumbled upon an interesting blog post about efficiently selecting a minimal subset of a potentially large set. It references a paper that discusses different ways of enumerating subsets. The crux is to select k items out of n. One classic approach is lexicographical order, enumerating all n-bit integers from 0..2n-1. Each integer maps to a subset made by including from the n items those whose corresponding bit in the integer is 1. Another variant uses Gray code. The paper focuses on a different approach, called the Banker’s Sequence, a straightforward algorithm to generate all subsets, working from smallest up to largest.

The blog post discusses the challenge of creating a function banker(n, k, i) that returns an n-bit binary vector representing the ith subset of length k. The binary vector will have k 1 bits and n-k 0 bits. The author of the blog post provides a Java implementation capable of quickly generating vectors for specified positions.

I thought it would be interesting to experiment with this in Haskell. Here is what I came up with:

import Data.Foldable (foldl')

-- factorial of n
fact n = foldl' (*) 1 [1..n]

-- # of combinations of n things, taken k at a time (AKA binomial coefficient)
comb :: Integer -> Integer -> Integer
comb n k = (foldl' (*) 1 [(n-k+1)..n]) `div` fact k

-- return boolean vector for ith subset of size k from total n items
banker :: Int -> Int -> Integer -> [Int]
banker n k i = take n $ bankerR n k i [] ++ repeat 0
bankerR n k i acc = let k' = k - 1
                        (prefix, rangeLow, n') = bankpart n k' i
                        acc' = acc ++ prefix
                        i' = i - rangeLow
                    in if k' == 0
                         then acc'
                         else bankerR n' k' i' acc'

bankpart :: Int -> Int -> Integer -> ([Int], Integer, Int)
bankpart n k i = let !rangeHi = comb (fromIntegral (n-1)) (fromIntegral k)
                 in bankR1 n (n-k) 0 rangeHi [1]
  where
    bankR1 n nminusk rangeLow rangeHi acc =
      let n' = n - 1
          nminusk' = nminusk - 1
          rangeHi' = (rangeHi * (fromIntegral nminusk)) `div` (fromIntegral n)
          rangeLow' = rangeLow + rangeHi
          acc' = 0:acc
      in if i < rangeLow'
        then (acc, rangeLow, n')
        else bankR1 n' nminusk' rangeLow' rangeHi' acc'

This is a fairly straightforward implementation of the algorithm suggested on the blog post. I started to think of ways to cleanup/improve the code. Some ideas include returning a Bool vector rather than 1s and 0s, consider collapsing the double recursion into a single recursive loop, and consider adding bang patterns to force strict evaluation for possible speed improvement. I actually had bang patterns in initially (a habit I got in while solving Project Euler problems in Haskell), but took them out and didn’t notice much performance difference, although I wasn’t measuring.

Then I looked over the PDF paper again and realized the whole point to this exercise is to efficiently iterate over the subsets, finding the minimal subset that satisfies a certain criterion. I realized that implementing banker(n, k, i) (and presumably needing to wrap that in a couple loops, including one that varies k from 0..n) is not thinking about this in a good lazy functional programming way. Far better to directly provide a lazy list of all subsets, from smallest to largest, then filter this with a predicate, and then the head of the resulting list is the minimal subset we are looking for.

So here’s what I came with for this:

import Data.Maybe (listToMaybe)

-- return smallest sublist of l satisfying filter predicate
minFilterSubset :: ([a] -> Bool) -> [a] -> Int -> Maybe [a]
minFilterSubset p l minLen = listToMaybe $ filterSubsets p l minLen

-- return sub-lists of l, in increasing order of size, that satisfy predicate
filterSubsets :: ([a] -> Bool) -> [a] -> Int -> [[a]]
filterSubsets p l minLen = filter p $ subsets l minLen

-- return sub-lists of l, in increasing order of size
subsets :: [a] -> Int -> [[a]]
subsets l minLen = let maxlen = length l
                       subsetsR len = if len > maxlen
                                        then []
                                        else subsetsRR l len ++ subsetsR (len+1)
                   in subsetsR minLen
subsetsRR _ 0 = [[]]
subsetsRR [] _ = []
subsetsRR (l:ls) curlen = (map (l:) $ subsetsRR ls (curlen-1))
                              ++ subsetsRR ls curlen

No need for factorials or really any numerical manipulation. This seems to operate reasonably quickly. I initially wrote it without the minLen argument since in principle it is not necessary: the predicate can test the length. But for large values of n, if you have a good idea of non-zero lower bound on subset size, it can save a lot of time, and it seems fairly natural to include minLen as an argument.

This does not allow you to immediately jump immediately to the 160,000,000,000,000,000,000,000,000,000th subset of 12 items chosen out of 10,000. But do we really want that? Supposedly we’re looking for the minimal subset, so we’ll be iterating over these subsets, from smallest to largest, looking for the first one that satisfies some criteria. The Haskell implementation implements this directly, cleanly, and elegantly. If I get a chance, I will try to set up some test scenarios, that incorporate the actual iteration over subsets and quantify the performance of the Haskell code with the Java code.

Is any of this new or novel? I doubt it. The PDF paper and blog refers to the Banker’s sequence as though it is a best-kept secret, however the recursive algorithm seems so obvious, and reminiscent of permutation generators, that I would not be surprised to find this already implemented in some library, although maybe not referring to it as Banker’s sequence. If you know of this implemented in some Haskell library, please let me know.

Addendum 5/4/2010

I noticed back in the original blog entry and comments that Calvin Miracle actually wants the maximal subset satisfying the criteria, not the minimal. So I thought about how to accommodate that without a lot of repeated code. This seems to do the trick, and allows you to specify a minimum or maximum size to start with, depending upon which direction you are coming from. Still seems like duplicated code, could probably be factored out to use min, max, and delta, however it’s not clear that would be better. What do you think?

import Data.Maybe (listToMaybe)

-- return smallest sublist of l satisfying filter predicate
minFilterSubset :: ([a] -> Bool) -> [a] -> Int -> Maybe [a]
minFilterSubset p l minLen = listToMaybe $ filter p $ subsetsSmallToLarge l minLen

-- return largest sublist of l satisfying filter predicate
maxFilterSubset :: ([a] -> Bool) -> [a] -> Int -> Maybe [a]
maxFilterSubset p l maxLen = listToMaybe $ filter p $ subsetsLargeToSmall l maxLen

-- return sub-lists of l, in increasing order of size
subsetsSmallToLarge :: [a] -> Int -> [[a]]
subsetsSmallToLarge l minLen = concatMap (subsetsR l) [minLen..length l]

-- return sub-lists of l, in decreasing order of size, from maxLen to 0
subsetsLargeToSmall :: [a] -> Int -> [[a]]
subsetsLargeToSmall l maxLen = concatMap (subsetsR l) [maxLen, maxLen-1..0]

subsetsR _ 0 = [[]]
subsetsR [] _ = []
subsetsR (l:ls) curlen = (map (l:) $ subsetsR ls (curlen-1)) ++ subsetsR ls curlen
Advertisement

Actions

Information

One response

30 06 2010
Calvin Miracle

Hi Alex,

I figure that a *maximal subset* that IS selected is the same as a *minimal subset* that is NOT selected.. In other words, invert the bits in the boolean vector.. This reflects the symmetry of the C(n, k) function, so your function can count either UP or DOWN..

– Cal in Louisville

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s




Follow

Get every new post delivered to your Inbox.