You are going on a one-way indirect flight trip that includes billions an unknown very large number of transfers.
If you assume a joinable list structure that can store everything (probably on disk):
O(n)
time. As for space, the birthday paradox (or something much like it) will keep your data set a lot smaller than the full set. In the bad luck case where it still gets to large (worst case is O(n)
), you can evict random runs from the hash table and insert them at the end of the processing queue. Your speed could go to pot but as long as you can far excede the threashold for expecting collisions (~O(sqrt(n))
) you should expect to see your dataset (the tables and input queue combined) regularly shrink.
Construct a hashtable and add each airport into the hash table.
<key,value> = <airport, count>
Count for the airport increases if the airport is either the source or the destination. So for every airport the count will be 2 ( 1 for src and 1 for dst) except for the source and the destination of your trip which will have the count as 1.
You need to look at each ticket at least once. So complexity is O(n).
Put in two Hashes: to_end = src -> des; to_beg = des -> src
Pick any airport as a starting point S.
while(to_end[S] != null)
S = to_end[S];
S is now your final destination. Repeat with the other map to find your starting point.
Without properly checking, this feels O(N), provided you have a decent Hash table implementation.
Construct two hash tables (or tries), one keyed on src and the other on dst. Choose one ticket at random and look up its dst in the src-hash table. Repeat that process for the result until you hit the end (the final destination). Now look up its src in the dst-keyed hash table. Repeat the process for the result until you hit the beginning.
Constructing the hash tables takes O(n) and constructing the list takes O(n), so the whole algorithm is O(n).
EDIT: You only need to construct one hash table, actually. Let's say you construct the src-keyed hash table. Choose one ticket at random and like before, construct the list that leads to the final destination. Then choose another random ticket from the tickets that have not yet been added to the list. Follow its destination until you hit the ticket you initially started with. Repeat this process until you have constructed the entire list. It's still O(n) since worst case you choose the tickets in reverse order.
Edit: got the table names swapped in my algorithm.
Each airport is a node
. Each ticket is an edge
. Make an adjacency matrix to represent the graph. This can be done as a bit field to compress the edges. Your starting point will be the node that has no path into it (it's column will be empty). Once you know this you just follow the paths that exist.
Alternately you could build a structure indexable by airport. For each ticket you look up it's src
and dst
. If either is not found then you need to add new airports to your list. When each is found you set a the departure airport's exit pointer to point to the destination, and the destination's arrival pointer to point to the departure airport. When you are out of tickets you must traverse the entire list to determine who does not have a path in.
Another way would be to have a variable length list of mini-trips that you connect together as you encounter each ticket. Each time you add a ticket you see if the ends of any existing mini-trip match either the src or dest of you ticket. If not, then your current ticket becomes it's own mini-trip and is added to the list. If so then the new ticket is tacked on to the end(s) of the existing trip(s) that it matches, possibly splicing two existing mini-trips together, in which case it would shorten the list of mini-trips by one.
A hash table won't work for large sizes (such as the billions in the original question); anyone who has worked with them knows that they're only good for small sets. You could instead use a binary search tree, which would give you complexity O(n log n).
The simplest way is with two passes: The first adds them all to the tree, indexed by src. The second walks the tree and collects the nodes into an array.
Can we do better? We can, if we really want to: we can do it in one pass. Represent each ticket as a node on a liked list. Initially, each node has null values for the next pointer. For each ticket, enter both its src and dest in the index. If there's a collision, that means that we already have the adjacent ticket; connect the nodes and delete the match from the index. When you're done, you'll have made only one pass, and have an empty index, and a linked list of all the tickets in order.
This method is significantly faster: it's only one pass, not two; and the store is significantly smaller (worst case: n/2 ; best case: 1; typical case: sqrt(n)), enough so that you might be able to actually use a hash instead of a binary search tree.