At Rapleaf, we periodically have a need to copy a large set of files F_1, F_2, ... , F_k of sizes S_1, S_2, ..., S_k, respectively, in a large distributed filesystem. To speed this process up, we assign M identical machines to work on the copying.
The nature of our filesystem setup lets us assume that the filesystem reads and writes are not slowed down due to filesystem load (so the time it takes us to copy a file is a constant that doesn't depend on the number of files being processed concurrently). It also lets us assume that the time it takes to copy a file is exactly proportional to its size.
Each machine can only copy one file at a time, however. We want to find a way to allocate file-copying jobs to machines that keeps the total time we have to wait until all jobs are finished low.
Describe an approach to solving the problem. Note that since we have a large number of files of many different sizes, the speed of the algorithm that allocates copy-jobs to machines must run quickly enough to be feasible (and certainly in polynomial time with respect to k and M). It's okay for your algorithm to be suboptimal as long as you have an argument as to why it is likely to be "pretty good" in practice.
It is sometimes important to have a performance guarantee on the time it takes to finish all the copying jobs. If we let T be the time it takes to finish all the copy-jobs when they are allocated optimally, and T' the time it takes to finish when they are allocated using the algorithm you described in (a), can you find some upper bound for T'/T ? Ideally this ratio would be a small constant, say, less than 3.
Please send your solution (and your resume if you have one ready) to challenge@rapleaf.com.

Follow