hpc

How Do I Attain Peak CPU Performance With Dot Product?

走远了吗. 提交于 2019-12-09 14:47:40
问题 Problem I have been studying HPC, specifically using matrix multiplication as my project (see my other posts in profile). I achieve good performance in those, but not good enough. I am taking a step back to see how well I can do with a dot product calculation. Dot Product vs. Matrix Multiplication The dot product is simpler, and will allow me to test HPC concepts without dealing with packing and other related issues. Cache blocking is still an issue, which forms my second question. Algorithm

Efficiently computing floating-point arithmetic hundreds of thousands of times in Bash

半城伤御伤魂 提交于 2019-12-08 21:46:00
问题 Background I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours

R: any faster R function than “tcrossprod” for symmetric dense matrix multiplication?

僤鯓⒐⒋嵵緔 提交于 2019-12-08 13:20:35
问题 Let x = matrix(rnorm(1000000), nrow = 5000) I would like to compute matrix multiplication with its transpose: x %*% t(x) . After googling I found a possible faster way of doing the above is tcrossprod(x) And time taken is user system elapsed 2.975 0.000 2.960 Is there is any other R-function which can do the task faster than the above function? 回答1: No. At R level this is already the fastest. But internally it calls level-3 BLAS routine dsyrk . So if you can have a high performance BLAS

Can a Parallel Processing Efficiency become > 1?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-08 09:12:11
问题 I read about efficiency in parallel computing, but never got an clear idea about it, also I read about achieving efficiency >1 and conclude that it's possible when it's a super linear. Is that correct and possible? If yes, then can anybody tell me how and provide an example for that? Or, if it is not, then why? 回答1: Let's agree on a few terms first: A set of processes may get scheduled for execution under several different strategies -- [SERIAL] - i.e. execute one after another has finished,

how can Python see 12 cpus on a cluster where I got allocated 4 cores by LSF?

你。 提交于 2019-12-08 07:53:20
问题 I access a Linux cluster where resources are allocated using LSF, which I think is a common tool and comes from Scali (http://www.scali.com/workload-management/high-performance-computing). In an interactive queue, I asked for and got the maximum number of cores: 4. But if I check how many cpus does Python's multiprocessing module see, the number is 12, the number of physical cores the node I was allocated to has. It looks like the multiprocessing module has problems respecting the bounds that

C# Property System

给你一囗甜甜゛ 提交于 2019-12-08 03:12:08
问题 Update Sorry. I didn't mean the whole reflection library was off limits. I just meant the insanely slow *.Invoke() stuff. Hi, I need to implement a property system in C# that allows both normal property access [property_attribute()] return_type Property { get; set; } and access by string SetProperty(string name, object value); object GetProperty(string name); However, I do not want to register each property individually. I do not want to use reflection I do not want to access properties

C# Property System

淺唱寂寞╮ 提交于 2019-12-08 01:30:25
Update Sorry. I didn't mean the whole reflection library was off limits. I just meant the insanely slow *.Invoke() stuff. Hi, I need to implement a property system in C# that allows both normal property access [property_attribute()] return_type Property { get; set; } and access by string SetProperty(string name, object value); object GetProperty(string name); However, I do not want to register each property individually. I do not want to use reflection I do not want to access properties through a dictionary (i.e. no PropertyTable["abc"]=val; ) This scheme is required for a cluster computing

MPI_Isend and MPI_Irecv seem to be causing a deadlock

只谈情不闲聊 提交于 2019-12-08 01:26:42
问题 I'm using non-blocking communication in MPI to send various messages between processes. However, I appear to be getting a deadlock. I have used PADB (see here) to look at the message queues and have got the following output: 1:msg12: Operation 1 (pending_receive) status 0 (pending) 1:msg12: Rank local 4 global 4 1:msg12: Size desired 4 1:msg12: tag_wild 0 1:msg12: Tag desired 16 1:msg12: system_buffer 0 1:msg12: Buffer 0xcaad32c 1:msg12: 'Receive: 0xcac3c80' 1:msg12: 'Data: 4 * MPI_FLOAT' --

run Rmpi on cluster, specify library path

大憨熊 提交于 2019-12-07 15:33:28
I'm trying to run an analysis in parallel on our computing cluster. Unfortunately I've had to set up Rmpi myself and may not have done so properly. Because I had to install all necessary packages into my home folder, I always have to call .libPaths('/home/myfolder/Rlib'); before I can load packages. However, it appears that doMPI attempts to load itself, before I can set the library path. .libPaths('/home/myfolder/Rlib'); cat("Step 1") library(doMPI) cl <- startMPIcluster() registerDoMPI(cl) cat("Step 2") Children_mcmc1 = foreach(i=1:2) %dopar% { cat("Step 3") .libPaths('/home/myfolder/Rlib');

Submit job with python code (mpi4py) on HPC cluster

↘锁芯ラ 提交于 2019-12-07 10:51:53
问题 I am working a python code with MPI (mpi4py) and I want to implement my code across many nodes (each node has 16 processors) in a queue in a HPC cluster. My code is structured as below: from mpi4py import MPI comm = MPI.COMM_WORLD size = comm.Get_size() rank = comm.Get_rank() count = 0 for i in range(1, size): if rank == i: for j in range(5): res = some_function(some_argument) comm.send(res, dest=0, tag=count) I am able to run this code perfectly fine on the head node of the cluster using the