bigdata | 易学教程

Inserting json date obeject in mongodb from R

阅读更多关于 Inserting json date obeject in mongodb from R

I am trying to insert forecasted values from a forecasting model along with timestamps in mongodb from. The following code converts the R dataframe into json and then bson. However,when the result is inserted into mongodb, the timestamp is not recognized as date object. mongo1 <-mongo.create(host = "localhost:27017",db = "test",username = "test",password = "test") rev<-data.frame(ts=c("2017-01-06 05:30:00","2017-01-06 05:31:00","2017-01-06 05:32:00","2017-01-06 05:33:00","2017-01-06 05:34:00"),value=c(10,20,30,40,50)) rev$ts<-as.POSIXct(strptime(rev$ts,format = "%Y-%m-%d %H:%M:%S",tz=""))

Using Rowcounter in Hbase table

阅读更多关于 Using Rowcounter in Hbase table

I am trying to calculate the no of rows in a Hbase table. Can do that with scannner but it is a bulky process.Want to use RowCounter to fetch the row number from Hbase table.Is there any way by which I can use that in Java Code. Is there any example or code snippet available. Directly using rowcounter is plain simple by using the command :- /hbase org.apache.hadoop.hbase.mapreduce.RowCounter [TABLE_NAME] Please provide any code snippet to use the same in Java code. Thanks You can find the source code of the above here . To get the row count, we have to scan the hbase table. there is no other

Would Spark preserve key order with this sortByKey/map/collect sequence?

阅读更多关于 Would Spark preserve key order with this sortByKey/map/collect sequence?

Let us say, we have this. val sx = sc.parallelize(Array((0, 39), (4, 47), (3, 51), (1, 98), (2, 61))) And we later call this. val sy = sx.sortByKey(true) Which would make sy = RDD[(0, 39), (1, 98), (2, 61), (3, 51), (4, 47)] And then we do collected = sy.map(x => (x._2 / 10, x._2)).collect Would we always get the following. I mean, would the original key order be preserved, despite changing the key values? collected = [(3, 39), (9, 98), (6, 61), (5, 51), (4, 47)] Applying the map() transformation and calling collect() does not change the ordering of the array elements returned by collect() .

how to compare two data frames in scala

阅读更多关于 how to compare two data frames in scala

I have two exactly same dataframes for comparison test df1 ------------------------------------------ year | state | count2 | count3 | count4| 2014 | NJ | 12332 | 54322 | 53422 | 2014 | NJ | 12332 | 53255 | 55324 | 2015 | CO | 12332 | 53255 | 55324 | 2015 | MD | 14463 | 76543 | 66433 | 2016 | CT | 14463 | 76543 | 66433 | 2016 | CT | 55325 | 76543 | 66433 | ------------------------------------------ df2 ------------------------------------------ year | state | count2 | count3 | count4| 2014 | NJ | 12332 | 54322 | 53422 | 2014 | NJ | 65333 | 65555 | 125 | 2015 | CO | 12332 | 53255 | 55324 | 2015

How can I efficiently create a user graph based on transaction data using Python?

阅读更多关于 How can I efficiently create a user graph based on transaction data using Python?

I'm attempting to create a graph of users in Python using the networkx package. My raw data is individual payment transactions, where the payment data includes a user, a payment instrument, an IP address, etc. My nodes are users, and I am creating edges if any two users have shared an IP address. From that transaction data, I've created a Pandas dataframe of unique [user, IP] pairs. To create edges, I need to find [user_a, user_b] pairs where both users share an IP. Let's call this DataFrame 'df' with columns 'user' and 'ip'. I keep running into memory problems, and have tried a few different

Number of MapReduce tasks

阅读更多关于 Number of MapReduce tasks

I need some help about how it is possible to get the correct number of Map and Reduce tasks in my application. Is there any way to discover this number? Thanks It is not possible to get the actual number of map and reduce tasks for an application before its execution, since the factors of task failures followed by re-attempts and speculative execution attempts cannot be accurately determined prior to execution, an approximate number tasks can be derived. The total number of Map tasks for a MapReduce job depends on its Input files and their FileFormat. For each input file, splits are computed

HBase Scan TimeRange Does not Work in Scala

阅读更多关于 HBase Scan TimeRange Does not Work in Scala

I write scala code to retrieve data based on its time range. Here're my code : object Hbase_Scan_TimeRange { def main(args: Array[String]): Unit = { //===Basic Hbase (Non Deprecated)===Start Logger.getLogger(this.getClass) Logger.getLogger("org").setLevel(Level.ERROR) BasicConfigurator.configure() val conf = HBaseConfiguration.create() val connection = ConnectionFactory.createConnection(conf) val admin = connection.getAdmin() //===Basic Hbase (Non Deprecated)===End val scan = new Scan() val _min = 1470387596203L val _max = 1470387596204L scan.setTimeRange(1470387596203L,1470387596204L) val

PRS三剑合璧，大数据利器出鞘［Python＋R＋Sublime］

阅读更多关于 PRS三剑合璧，大数据利器出鞘［Python＋R＋Sublime］

R是一种专门用于数据分析的语言，得到很多科研工作者的青睐，随着“大数据”概念的升温，R也是炙手可热了。python自然不用多说了，简单实用，无出其右。这两个运行环境在Sublime里都有很好的支持。当这三者遇到了遇到一起，那真是“干柴遇烈火”啊！ R和Python的控制台功能自然是强悍，但是也有很多不便，毕竟只是一个Console嘛！现在我们把R和Python的运行定义一个快捷键。打开Sublime->Presferences->Key Bindings-User，把下面的文本粘贴进去。 [ {"keys":["shift+ctrl+p"], "caption": "SublimeREPL: Python - RUN current file", "command": "run_existing_window_command", "args": { "id": "repl_python_run", "file": "config/Python/Main.sublime-menu" } },{"keys":["shift+ctrl+n"], "caption": "SublimeREPL: Python", "command": "run_existing_window_command", "args": { "id": "repl_python", "file": "config

Cassandra slowed down with more nodes

阅读更多关于 Cassandra slowed down with more nodes

I set up a Cassandra cluster on AWS. What I want to get is increased I/O throughput (number of reads/writes per second) as more nodes are added (as advertised). However, I got exactly the opposite. The performance is reduced as new nodes are added. Do you know any typical issues that prevents it from scaling? Here is some details: I am adding a text file (15MB) to the column family. Each line is a record. There are 150000 records. When there is 1 node, it takes about 90 seconds to write. But when there are 2 nodes, it takes 120 seconds. I can see the data is spread to 2 nodes. However, there

operating with big.matrix

阅读更多关于 operating with big.matrix

I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix: # create big.matrix object x <- as.big.matrix( matrix( sample(1:10, 20, replace=TRUE), 5, 4, dimnames=list( NULL, c("a", "b", "c", "d")) ) ) > x An object of class "big.matrix" Slot "address": <pointer: 0x00000000141beee0> The corresponding matrix object is: # create matrix object x2<-x[,] > x2 a b c d [1,] 6 9 5 3 [2,] 3 6 10 8 [3,] 7 1 2 8 [4,] 7 8 4 10 [5,] 6 3 6 4 If I compute this operations with the matrix object, it works: sqrt(slam::col_sums(x2*x2)) > sqrt(slam::col_sums