bigdata | 易学教程

Get a list of subdirectories

阅读更多关于 Get a list of subdirectories

问题 I know I can do this: data = sc.textFile('/hadoop_foo/a') data.count() 240 data = sc.textFile('/hadoop_foo/*') data.count() 168129 However, I would like to count the size of the data of every subdirectory of "/hadoop_foo/". Can I do that? In other words, what I want is something like this: subdirectories = magicFunction() for subdir in subdirectories: data sc.textFile(subdir) data.count() I tried with: In [9]: [x[0] for x in os.walk("/hadoop_foo/")] Out[9]: [] but I think that fails, because

Microsoft Windows Azure storage: the remote server returned an error 404 not found

阅读更多关于 Microsoft Windows Azure storage: the remote server returned an error 404 not found

问题 I am constantly getting an error "404 not found". I have created cluster and storage account and container. Detailed error that I get is: Unhandled Exception: System.AggregateException: One or more errors occurred. --- Microsoft.WindowsAzure.Storage.StorageException: The remote server returned an error: (404) Not Found. System.Net.WebException: The remote server returned an error: (404) Not Found. This is my code: public static void ConnectToAzureCloudServer() { HadoopJobConfiguration

Quickly sampling large number of rows from large dataframes in python

阅读更多关于 Quickly sampling large number of rows from large dataframes in python

问题 I have a very large dataframe (about 1.1M rows) and I am trying to sample it. I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe. This is what Ive tried so far but all these methods are taking way too much time: Method 1 - Using pandas : sample = pandas.read_csv("data.csv", index_col = 0).reset_index() sample = sample[sample['Id'].isin(sample_index_array)] Method 2 : I tried to write all the sampled lines to another csv. f = open("data.csv",'r')

Python pandas error while removing extra white space

阅读更多关于 Python pandas error while removing extra white space

问题 I am trying to clean a column in data frame of extra white space using command. The data frame has close to 8 million records datt2.My_variable=datt2.My_variable.str.replace('\s+', ' ') I end up getting below error MemoryError Traceback (most recent call last) <ipython-input-10-158a51cfaa3d> in <module>() ----> 1 datt2.My_variable=datt2.My_variable.str.replace('\s+', ' ') c:\python27\lib\site-packages\pandas\core\strings.pyc in replace(self, pat, repl, n, case, flags) 1504 def replace(self,

Compare two strings and find how closely they are related by meaning

阅读更多关于 Compare two strings and find how closely they are related by meaning

问题 Problem: I have two strings, say, "Billie Jean" and "Thriller". I need to programmatically compare them and find how closely they are related. Those are both songs of the same artist, hence, they should give a higher score (probability, percentage etc) than say, "Brad Pitt" and "Jamaican Farewell". One way of doing this is an open source Java tool named WikipediaMiner which compares using the Wikipedia data dump, checking links, descriptions etc. Question: Please suggest a better alternative,

Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

阅读更多关于 Failed to get broadcast_1_piece0 of broadcast_1 in Spark Streaming job

问题 I am running spark jobs on yarn in cluster mode. The job get the messages from kafka direct stream. I am using broadcast variables and checkpointing every 30 seconds. When I start the job first time it runs fine without any issue. If I kill the job and restart it throws below exception in executor upon receiving a message from kafka: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1 at org.apache.spark.util.Utils$.tryOrIOException(Utils

Couldn`t join two files with one key via Cascading

阅读更多关于 Couldn`t join two files with one key via Cascading

问题 Lets see what we have. First file [Interface Class]: list arrayList list linkedList Second file[Class1 amount]: arrayList 120 linkedList 4 I would like to join this two files by key[Class] and get count per each Interface: list arraylist 120 list linkedlist 4 Code: public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; String doc2Path = args[ 2 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass(

Unable to launch hive with mysql metastore

阅读更多关于 Unable to launch hive with mysql metastore

问题 When I use hive with derby metastore, it works fine. I want to use mysql metastore so I followed this link {https://dzone.com/articles/how-configure-mysql-metastore}. Now when I launch hive by typing "hive" command on terminal I get many errors i.e. {Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient} {Unable to open a test connection to the given database. JDBC url = jdbc:mysql

R Memory “Cannot allocate vector of size N”

阅读更多关于 R Memory “Cannot allocate vector of size N”

问题 I am trying to run the ExtremeBound package on R and it is crashing when I run it because the memory seems to be too small... Here is the error message: Error: cannot allocate vector of size 2.6 Gb In addition: Warning messages: 1: In colnames(vif.satisfied) <- colnames(include) <- colnames(weight) <- colnames(cdf.mu.generic) <- vars.labels : Reached total allocation of 16296Mb: see help(memory.size) 2: In colnames(vif.satisfied) <- colnames(include) <- colnames(weight) <- colnames(cdf.mu

How execute a mapreduce programs in oozie with hadoop-2.2

阅读更多关于 How execute a mapreduce programs in oozie with hadoop-2.2

问题 2.2.0 and oozie-4.0.0 in ubuntu. I am cant able to execute mapreduce programs in oozie. i am uisng resource manager port number for jobtracker 8032 in oozie. while scheduling in oozie to goes to running state and running in yarn also after some time i am getting error like this(below) in hadoop logs and still running in oozie logs Error: 2014-05-30 10:38:14,322 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for application