bigdata | 易学教程

轻松搞定你的IPython + Notebook 基于云的科学计算环境（含详细步骤）

阅读更多关于轻松搞定你的IPython + Notebook 基于云的科学计算环境（含详细步骤）

IPython + Notebook 提供了一种基于云的科学计算开发环境。它既能够使开发者享受到云计算的强大计算能力，也能够使开发者无需在自己本地安装任何软件就能有良好的开发界面。此外，从本地到云端，带宽要求极低。准备工作：只需要你的本地浏览器即可！！注册一个云计算帐号这里我们建议注册超能云（SuperVessel Cloud）（注册网址： http://www.ptopenlab.com ）。原因有两个：超能云是OpenPOWER基金会支持下构建的，完全面向开发者免费的云平台。超能云目前已经有了支持IPython + Notebook的镜像，免费供开发者使用。 (关于超能云SuperVessel 本身，可参看链接： http://my.oschina.net/u/1431433/blog/380504 ) 注册步骤很简单（如果您已经有了帐号，可以直接到下面一节）进入网址： http://www.ptopenlab.com 。点击右上方的“注册”按钮。在弹出的对话框填写你的有效邮箱地址，还有密码。邮箱地址必须有效，因为超能云（SuperVessel）会给您发一封激活的邮件。 3. 进入你注册使用的邮箱，有一封发自admin@ptopenlab.com的邮件，点击里面的激活链接，你的帐号就可以被激活使用了。建立一个支持IPython + Notebook的虚拟机 1.

How to most efficiently increase values at a specified range in a large array and then find the largest value

阅读更多关于 How to most efficiently increase values at a specified range in a large array and then find the largest value

问题 So I just had a programming test for an interview and I consider myself a decent programmer, however I was unable to meet time constraints on the online test (and there was no debugger allowed). Essentially the question was give a range of indices [low, high] and a value to increase these indices by, after doing this M times to the array, find me the largest value. So if you had an array of size 5 [0, 0, 0, 0, 0] and you were given instructions [0, 3] 143 [2, 4] 100 and [2,2] 100 the array

R foverlaps equivalent in Python

阅读更多关于 R foverlaps equivalent in Python

问题 I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3. What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

阅读更多关于 Determining optimal number of Spark partitions based on workers, cores and DataFrame size

There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node ( sparkDriverCount ) The number of worker nodes available to a Spark cluster ( numWorkerNodes ) The number of Spark executors ( numExecutors ) The DataFrame being operated on by all workers/executors, concurrently ( dataFrame ) The number of rows in the dataFrame ( numDFRows ) The number of partitions on the dataFrame ( numPartitions ) And finally, the number of CPU cores available on each worker nodes (

Load a small random sample from a large csv file into R data frame

阅读更多关于 Load a small random sample from a large csv file into R data frame

问题 The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame? 回答1: You can also just do it in the terminal with perl. perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the

How to quickly export data from R to SQL Server

阅读更多关于 How to quickly export data from R to SQL Server

问题 The standard RODBC package's sqlSave function even as a single INSERT statement (parameter fast = TRUE ) is terribly slow for large amounts of data due to non-minimal loading. How would I write data to my SQL server with minimal logging so it writes much more quickly? Currently trying: toSQL = data.frame(...); sqlSave(channel,toSQL,tablename="Table1",rownames=FALSE,colnames=FALSE,safer=FALSE,fast=TRUE); 回答1: By writing the data to a CSV locally and then using a BULK INSERT (not readily

Load data into Hive with custom delimiter

阅读更多关于 Load data into Hive with custom delimiter

问题 I'm trying to create an internal (managed) table in hive that can store my incremental log data. The table goes like this: CREATE TABLE logs (foo INT, bar STRING, created_date TIMESTAMP) ROW FORMAT DELIMITED FIELDS TERMINATED BY '<=>' STORED AS TEXTFILE; I need to load data into this table periodically. LOAD DATA INPATH '/user/foo/data/logs' INTO TABLE logs; But the data is not getting inserted into the table properly. There might be some problem with the delimiter.Can't find why. Example log

Strategies for reading in CSV files in pieces?

阅读更多关于 Strategies for reading in CSV files in pieces?

I have a moderate-sized file (4GB CSV) on a computer that doesn't have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix. Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time, subset it down to the rows and columns I need, and then read in the next third? Thanks to commenters for

Operation Time Out Error in cqlsh console of cassandra

阅读更多关于 Operation Time Out Error in cqlsh console of cassandra

问题 I have a three nodes Cassandra Cluster and I have created one table which has more than 2,000,000 rows. When I execute this ( select count(*) from userdetails ) query in cqlsh, I got this error: OperationTimedOut: errors={}, last_host=192.168.1.2 When I run count function for less row or with limit 50,000 it works fine. 回答1: count(*) actually pages through all the data. So a select count(*) from userdetails without a limit would be expected to timeout with that many rows. Some details here:

Calculate Euclidean distance matrix using a big.matrix object

阅读更多关于 Calculate Euclidean distance matrix using a big.matrix object

问题 I have an object of class big.matrix in R with dimension 778844 x 2 . The values are all integers (kilometres). My objective is to calculate the Euclidean distance matrix using the big.matrix and have as a result an object of class big.matrix . I would like to know if there is an optimal way of doing that. The reason for my choice of using the class big.matrix is memory limitation. I could transform my big.matrix to an object of class matrix and calculate the Euclidean distance matrix using