We are running nearly 700k jobs a day (mostly test projects), due to hardware issue like unstable machines, the jobs or test projects gets failed.
How can I write Gro