what changes when your input is giga/terabyte sized?

后端未结

关注

 4  713

无人共我 2021-01-30 14:54

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for severa

4条回答

忘掉有多难 (楼主)

2021-01-30 15:25
In a nutshell, the main differences IMO:
1. You should know beforehand what your likely bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure to address this. I/O quite frequently is the bottleneck.
2. Choice and fine-tuning of an algorithm often dominates any other choice made.
3. Even modest changes to algorithms and access patterns can impact performance by orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be system-dependent.
4. Talk to your colleagues and other scientists to profit from their experiences with these data sets. A lot of tricks cannot be found in textbooks.
5. Pre-computing and storing can be extremely successful.
Bandwidth and I/O

Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".

Clusters

Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.

Algorithms

In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.

Computer, Language, Specialized Tools

If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).

Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.

The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...