bigdata

Would Spark preserve key order with this sortByKey/map/collect sequence?

别等时光非礼了梦想. 提交于 2019-12-23 01:55:35
问题 Let us say, we have this. val sx = sc.parallelize(Array((0, 39), (4, 47), (3, 51), (1, 98), (2, 61))) And we later call this. val sy = sx.sortByKey(true) Which would make sy = RDD[(0, 39), (1, 98), (2, 61), (3, 51), (4, 47)] And then we do collected = sy.map(x => (x._2 / 10, x._2)).collect Would we always get the following. I mean, would the original key order be preserved, despite changing the key values? collected = [(3, 39), (9, 98), (6, 61), (5, 51), (4, 47)] 回答1: Applying the map()

Find connected components of big Graph using Boost

旧城冷巷雨未停 提交于 2019-12-23 01:25:10
问题 I have wrote a code for finding connected component of a very big Graph (80 million edges) but it doesn't work, When the edges number is close to 40 million it crashed. int main(){ using namespace boost; { int node1,node2; typedef adjacency_list <vecS, vecS, undirectedS> Graph; Graph G; std::ifstream infile("pairs.txt"); std::string line; while (std::getline(infile,line)) { std::istringstream iss(line); iss >> node1 >> node2; add_edge(node1, node2, G);} cout <<"writing file"<<endl; int j = 0;

dcast efficiently large datasets with multiple variables

江枫思渺然 提交于 2019-12-23 00:24:32
问题 I am trying to dcast a large dataset (millions of rows). I have one row for arrival time and origin, and another row for departure time and destination. There is an id to identify the unit in both cases. It looks similar to this: id time movement origin dest 1 10/06/2011 15:54 ARR 15 15 1 10/06/2011 16:14 DEP 15 29 2 10/06/2011 17:59 ARR 73 73 2 10/06/2011 18:10 DEP 73 75 2 10/06/2011 21:10 ARR 75 75 2 10/06/2011 21:20 DEP 75 73 3 10/06/2011 17:14 ARR 17 17 3 10/06/2011 18:01 DEP 17 48 4 10

clusterExport to single thread in R parallel

送分小仙女□ 提交于 2019-12-22 13:59:32
问题 I would like to split a large data.frame into chunks and pass each individually to the different members of the cluster. Something like: library(parallel) cl <- makeCluster(detectCores()) for (i in 1:detectCores()) { clusterExport(cl, mydata[indices[[i]]], <extra option to specify a thread/process>) } Is this possible? 回答1: Here is an example that uses clusterCall inside a for loop to send a different chunk of the data frame to each of the workers: library(parallel) cl <- makeCluster

Casting date in Talend Data Integration

﹥>﹥吖頭↗ 提交于 2019-12-22 13:56:21
问题 In a data flow from one table to another, I would like to cast a date. The date leaves the source table as a string in this format: "2009-01-05 00:00:00:000 + 01:00". I tried to convert this to a date using a tConvertType, but that is not allowed apparently. My second option is to cast this string to a date using a formula in a tMap component. At the moment I tried these formulas: - TalendDate.formatDate("yyyy-MM-dd",row3.rafw_dz_begi); - TalendDate.formatDate("yyyy-MM-dd HH:mm:ss",row3.rafw

How to replace null NAN or Infinite values to default value in Spark Scala

那年仲夏 提交于 2019-12-22 11:08:05
问题 I'm reading in csvs into Spark and I'm setting the schema to all DecimalType(10,0) columns. When I query the data, I get the following error: NumberFormatException: Infinite or NaN If I have NaN/null/infinite values in my dataframe, I would like to set them to 0. How do I do this? This is how I'm attempting to load the data: var cases = spark.read.option("header",false). option("nanValue","0"). option("nullValue","0"). option("positiveInf","0"). option("negativeInf","0"). schema(schema). csv(

Fastest way to load huge .dat into array

陌路散爱 提交于 2019-12-22 10:23:45
问题 I have extensively searched in stackexchange a neat solution for loading a huge (~2GB) .dat file into a numpy array, but didn't find a proper solution. So far I managed to load it as a list in a really fast way (<1 min): list=[] f = open('myhugefile0') for line in f: list.append(line) f.close() Using np.loadtxt freezes my computer and takes several minutes to load (~ 10 min). How can I open the file as an array without the allocating issue that seems to bottleneck np.loadtxt? EDIT: Input data

Bigtable performance influence column families

£可爱£侵袭症+ 提交于 2019-12-22 08:06:22
问题 We are currently investigating the influence of using multiple column families on the performance of our bigtable queries. We found that splitting the columns into multiple column families does not increase the performance. Does anyone have had similar experiences? Some more details about our benchmark setup. At this moment each row in our production table contains around 5 columns, each containing between 0,1 to 1 KB of data. All columns are stored into one column family. When performing a

key validation class type in cassandra UTF8 or LongType?

[亡魂溺海] 提交于 2019-12-22 08:06:20
问题 Using cassandra, I want to store 20 million+ of row key in column family. my question is: Is there a REAL performance difference between long and utf8 rowKey keys? any,row key storage size problem? my userkey look like this rowKey=>112512462152451 rowKey=>135431354354343 rowKey=>145646546546463 rowKey=>154354354354354 rowKey=>156454343435435 rowKey=>154435435435745 回答1: Cassandra stores all data on disk (including row key values) as a hex byte array. In terms of performance, the datatype of

Time-based drilldowns in Power BI powered by Azure Data Warehouse

血红的双手。 提交于 2019-12-22 07:57:29
问题 I have designed a simple Azure Data Warehouse where I want to track stock of my products on periodic basis. Moreover I want to have an ability to see that data grouped by month, weeks, days and hours with ability to drill down from top to bottom. I have defined 3 dimensions: DimDate DimTime DimProduct I have also defined a Fact table to track product stocks: FactStocks - DateKey (20160510, 20160511, etc) - TimeKey (0..23) - ProductKey (Product1, Product2) - StockValue (number, 1..9999) My