Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.