bigdata

Get a list of subdirectories

大城市里の小女人 提交于 2019-12-12 03:47:26
问题 I know I can do this: data = sc.textFile('/hadoop_foo/a') data.count() 240 data = sc.textFile('/hadoop_foo/*') data.count() 168129 However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that? In other words, what I want is something like this: subdirectories = magicFunction() for subdir in subdirectories: data sc.textFile(subdir) data.count() I tried with: In [9]: [x[0] for x in os.walk("/hadoop_foo/")] Out[9]: [] but I think that fails, because

Microsoft Windows Azure storage: the remote server returned an error 404 not found

折月煮酒 提交于 2019-12-12 03:45:13
问题 I am constantly getting an error "404 not found". I have created cluster and storage account and container. Detailed error that I get is: Unhandled Exception: System.AggregateException: One or more errors occurred. --- Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (404) Not Found. System.Net.WebException: The remote server returned an error: (404) Not Found. This is my code: public static void ConnectToAzureCloudServer() { HadoopJobConfiguration

Quickly sampling large number of rows from large dataframes in python

霸气de小男生 提交于 2019-12-12 03:44:40
问题 I have a very large dataframe (about 1.1M rows) and I am trying to sample it. I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe. This is what Ive tried so far but all these methods are taking way too much time: Method 1 - Using pandas : sample = pandas.read_csv("data.csv", index_col = 0).reset_index() sample = sample[sample['Id'].isin(sample_index_array)] Method 2 : I tried to write all the sampled lines to another csv. f = open("data.csv",'r')

Python pandas error while removing extra white space

爱⌒轻易说出口 提交于 2019-12-12 03:39:55
问题 I am trying to clean a column in data frame of extra white space using command. The data frame has close to 8 million records datt2.My_variable=datt2.My_variable.str.replace('\s+', ' ') I end up getting below error MemoryError Traceback (most recent call last) <ipython-input-10-158a51cfaa3d> in <module>() ----> 1 datt2.My_variable=datt2.My_variable.str.replace('\s+', ' ') c:\python27\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags) 1504 def replace(self,

Compare two strings and find how closely they are related by meaning

点点圈 提交于 2019-12-12 03:38:31
问题 Problem: I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell". One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc. Question: Please suggest a better alternative,

Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

有些话、适合烂在心里 提交于 2019-12-12 03:28:02
问题 I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1 at org.apache.spark.util.Utils$.tryOrIOException(Utils

Couldn`t join two files with one key via Cascading

眉间皱痕 提交于 2019-12-12 03:06:27
问题 Lets see what we have. First file [Interface Class]: list arrayList list linkedList Second file[Class1 amount]: arrayList 120 linkedList 4 I would like to join this two files by key[Class] and get count per each Interface: list arraylist 120 list linkedlist 4 Code: public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; String doc2Path = args[ 2 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass(

Unable to launch hive with mysql metastore

让人想犯罪 __ 提交于 2019-12-12 02:56:16
问题 When I use hive with derby metastore, it works fine. I want to use mysql metastore so I followed this link {https://dzone.com/articles/how-configure-mysql-metastore}. Now when I launch hive by typing "hive" command on terminal I get many errors i.e. {Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient} {Unable to open a test connection to the given database. JDBC url = jdbc:mysql

R Memory “Cannot allocate vector of size N”

北战南征 提交于 2019-12-12 02:49:14
问题 I am trying to run the ExtremeBound package on R and it is crashing when I run it because the memory seems to be too small... Here is the error message: Error: cannot allocate vector of size 2.6 Gb In addition: Warning messages: 1: In colnames(vif.satisfied) <- colnames(include) <- colnames(weight) <- colnames(cdf.mu.generic) <- vars.labels : Reached total allocation of 16296Mb: see help(memory.size) 2: In colnames(vif.satisfied) <- colnames(include) <- colnames(weight) <- colnames(cdf.mu

How execute a mapreduce programs in oozie with hadoop-2.2

自古美人都是妖i 提交于 2019-12-12 02:49:00
问题 2.2.0 and oozie-4.0.0 in ubuntu. I am cant able to execute mapreduce programs in oozie. i am uisng resource manager port number for jobtracker 8032 in oozie. while scheduling in oozie to goes to running state and running in yarn also after some time i am getting error like this(below) in hadoop logs and still running in oozie logs Error: 2014-05-30 10:38:14,322 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application