bigdata | 易学教程

Why Cassandra COUNT(*) on a specific partition takes really long on relatively small datasets

阅读更多关于 Why Cassandra COUNT(*) on a specific partition takes really long on relatively small datasets

问题 I have a table defined like: Keyspace : CREATE KEYSPACE messages WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true; Table : CREATE TABLE messages.textmessages ( categoryid int, date timestamp, messageid timeuuid, message text, userid int, PRIMARY KEY ((categoryid, date), messageid) ) WITH CLUSTERING ORDER BY (messageid ASC); The goal is to have a wide row time-series storage such that categoryid and date (beginning of day) constitutes my

Send executable jar to hadoop cluster and run as “hadoop jar”

阅读更多关于 Send executable jar to hadoop cluster and run as “hadoop jar”

问题 I commonly make a executable jar package with a main method and run by the commandline "hadoop jar Some.jar ClassWithMain input output" In this main method, Job and Configuration may be configured and Configuration class has a setter to specify mapper or reducer class like conf.setMapperClass(Mapper.class). However, In the case of submitting job remotely, I should set jar and Mapper or more classes to use hadoop client api. job.setJarByClass(HasMainMethod.class); job.setMapperClass(Mapper

How to install and launch mahout for spark?

阅读更多关于 How to install and launch mahout for spark?

问题 I am interested in learning machine learning algorithms for big data, and for that purpose I want to learn how to code in Mahout for Spark. Now I have posted my original question in here, but nobody answered, so I am modifying my question now. If anyone knows detailed procedures how to install LATEST Spark in Ubuntu 14.04 and how to integrate MAHOUT for it, I will be really grateful. Thanks in advance. 回答1: Currently Mahout uses: Spark 1.6.2 Scala 2.10.4 You can try to build your own version

ERROR tool.BaseSqoopTool: Error parsing arguments for job: Sqoop i have tried to create a job in sqoop, but the following error occured

阅读更多关于 ERROR tool.BaseSqoopTool: Error parsing arguments for job: Sqoop i have tried to create a job in sqoop, but the following error occured

问题 sqoop job --create myjob --import --connect "jdbc:mysql://localhost/classicmodels" --username root --password 123 --table customers -m 1 --taget-dir /manoj280217/sqoop Error: 17/02/28 08:56:18 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 17/02/28 08:56:18 ERROR tool.BaseSqoopTool: Error parsing arguments for job: 17/02/28 08:56:18 ERROR tool.BaseSqoopTool: Unrecognized argument: --import 17/02/28 08:56:18 ERROR tool.BaseSqoopTool: Unrecognized argument: --connect 17/02/28 08:56:18 ERROR

R code runs too slow,how to rewrite this code

阅读更多关于 R code runs too slow,how to rewrite this code

问题 The input.txt contains 8000000 rows and 4 columns. The first 2 columns is text.The last 2 columns is number. The number of unique symbols (e.g., "c33") in columns 1 and 2 is not fixed. The value of columns 3 and 4 is the number of unique symbols of columns 1 and 2 after splitting by "]" respectively. Each row of input.txt file is like this: c33]c21]c5]c7]c8]c9 TPS2]MIC17]ERG3]NNF1]CIS3]CWP2 6 6 **The desired result: row[ , ] represents characters like "c33 c21 c5 c7 c8 c9" or "TPS2 MIC17 ERG3

How to get count of invalid data during parse

阅读更多关于 How to get count of invalid data during parse

问题 We are using spark to parse a big csv file, which may contain invalid data. We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data. I am wondering how we can do this in spark, what's the standard approach when reading data? My current approach uses Accumulator , but it's not accurate due to how Accumulator works in spark. // we define case class CSVInputData: all fields are defined as string val csvInput = spark.read.option(

Problems to use the EclairJS Server

阅读更多关于 Problems to use the EclairJS Server

问题 I tried to use EclairJS Server following the instructions available here: https://github.com/EclairJS/eclairjs/tree/master/server after executing: mvn package got the following error: Tests run: 293, Failures: 8, Errors: 9, Skipped: 0 [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 04:51 min [INFO] Finished at: 2018-04-10T07:13:41+00:00 [INFO]

Columnbind ff data frames in R

阅读更多关于 Columnbind ff data frames in R

问题 i try to work with the ff package. In this context i try to cbind two ff dataframes. I found a solution to combine a ffdf with a ff vector but how do i combine to ffdf. Here my code for combining ffdf with ff vector: library(ff) ## read Bankfull flow## setwd(wd) bf <- read.csv.ffdf(file="G_BANKFULL_km3month.csv",header=TRUE) ## read river discharge global, monthly vlaues 1971-2000## memory.limit(size=16000) # increase working memory dis <- read.table.ffdf(file='RIVER_AVAIL_7100_WG22.txt',

Creating spark tasks from within tasks (map functions) on the same application

阅读更多关于 Creating spark tasks from within tasks (map functions) on the same application

问题 Is it possible to do a map from a mapper function (i.e from tasks) in pyspark? In other words, is it possible to open "sub tasks" from a task? If so - how do i pass the sparkContext to the tasks - just as a variable? I would like to have a job that is composed from many tasks - each of these tasks should create many tasks as well, without going back to the driver. My use case is like this: I am doing a code porting of an application that was written using work queues - to pyspark. In my old

Is there a package like bigmemory in R that can deal with large list objects?

阅读更多关于 Is there a package like bigmemory in R that can deal with large list objects?

问题 I know that the R package bigmemory works great in dealing with large matrices and data frames. However, I was wondering if there is any package or any ways to efficiently work with large list. Specifically, I created a list with its elements being vectors. I have a for loop and during each iteration, multiple values were appended to a selected element in that list (a vector). At first, it runs fast, but when the iteration is over maybe 10000, it slows down gradually (one iteration takes