Spark processing columns in parallel
问题 I've been playing with Spark, and I managed to get it to crunch my data. My data consists of flat delimited text file, consisting of 50 columns and about 20 millions of rows. I have scala scripts that will process each column. In terms of parallel processing, I know that RDD operation run on multiple nodes. So, every time I process a column, they are processed in parallel, but the column itself is processed sequentially. A simple example: if my data is 5 column text delimited file and each