I came across the following implementation:
For square matrices:
for n = 0 to N - 2
for m = n + 1 to N - 1
swap A(n,m) with A(m,n)
For rectangular matrices:
for each length>1 cycle C of the permutation
pick a starting address s in C
let D = data at s
let x = predecessor of s in the cycle
while x ≠ s
move data from x to successor of x
let x = predecessor of x
move data from D to successor of s
For more info, one can refer here: http://en.wikipedia.org/wiki/In-place_matrix_transposition