bigdata

How to change sqoop metastore?

血红的双手。 提交于 2019-12-17 19:05:04
问题 I am using sqoop 1.4.2 version. I am trying to change the sqoop metastore from default hsqldb to mysql. I have configured following properties in sqoop-site.xml file. <property> <name>sqoop.metastore.client.enable.autoconnect</name> <value>false</value> <description>If true, Sqoop will connect to a local metastore for job management when no other metastore arguments are provided. </description> </property> <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:mysql://ip

How can I tell when my dataset in R is going to be too large?

你离开我真会死。 提交于 2019-12-17 17:24:35
问题 I am going to be undertaking some logfile analyses in R (unless I can't do it in R), and I understand that my data needs to fit in RAM (unless I use some kind of fix like an interface to a keyval store, maybe?). So I am wondering how to tell ahead of time how much room my data is going to take up in RAM, and whether I will have enough. I know how much RAM I have (not a huge amount - 3GB under XP), and I know how many rows and cols my logfile will end up as and what data types the col entries

is there any way to import a json file(contains 100 documents) in elasticsearch server.?

感情迁移 提交于 2019-12-17 15:34:31
问题 Is there any way to import a JSON file (contains 100 documents) in elasticsearch server? I want to import a big json file into es-server.. 回答1: You should use Bulk API. Note that you will need to add a header line before each json document. $ cat requests { "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } } { "field1" : "value1" } $ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo {"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version

PySpark DataFrames - way to enumerate without converting to Pandas?

若如初见. 提交于 2019-12-17 04:07:08
问题 I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just indexes=[2,3,6,7] df[indexes] Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: Enumerating all the objects in the original dataframe by: indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes)

Access virtual box hbase from windows java application

那年仲夏 提交于 2019-12-14 04:12:20
问题 Hi i am new to hbase and trying to practice it. First of all i would like to describe about system configuration. BACKGROUND: I am using Windows 7 and installed Oracle Virtual Box. Then installed ubuntu server on Virtual Box after that I installed hbase0.98-hadoop2-bin.tar.gz on ubuntu. I have configured hbase in standalone mode. My hbase-site.xml file is like: <Configuration> <property> <name>hbase.rootdir</name> <value>file:///home/abc/hbase</value> </property> <property> <name>hbase

Hadoop WordCount sorted by word occurrences

孤街醉人 提交于 2019-12-14 04:10:00
问题 I need to run WordCount which will give me all the words and their occurrences but sorted by the occurrences and not by the alphabet I understand that I need to create two jobs for this and run one after the other I used the mapper and the reducer from Sorted word count using Hadoop MapReduce package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapreduce

How can I apply ffdf to non-atomic data frames?

爱⌒轻易说出口 提交于 2019-12-14 03:44:43
问题 Many posts (such as this) claim the ff package is superior to bigmemory because it can handle objects w/ atomic and nonatomic components, but how? For example: UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300) STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE', 'ACTIVE','ACTIVE','INACTIVE') TERMINATED <- as.Date(c('1999-07-06','2008-12-05','2000-08-18','2000-08-18','2000-08-18', '2008-08-18','2008-08-18','2006-09-19','2006-09-19','2006-09-19

C++ buffered file reading

陌路散爱 提交于 2019-12-14 03:42:30
问题 I wonder if reading a large text file line by line (e.g., std::getline or fgets) can be buffered with predefined read buffer size, or one must use special bytewise functions? I mean reading very large files with I/O operations number optimization (e.g., reading 32 MB from the HDD at a time). Of course I can handcraft buffered reading, but I thought standard file streams had that possibility. 回答1: Neither line-by-line, nor special byte-wise functions. Instead, the following should do your job:

read from line to line yelp dataset by python

只谈情不闲聊 提交于 2019-12-14 03:33:35
问题 I want to change this code to specifically read from line 1400001 to 1450000. What is modification? file is composed of a single object type, one JSON-object per-line. I want also to save the output to .csv file. what should I do? revu=[] with open("review.json", 'r',encoding="utf8") as f: for line in f: revu = json.loads(line[1400001:1450000) 回答1: If it is JSON per line: revu=[] with open("review.json", 'r',encoding="utf8") as f: # expensive statement, depending on your filesize this might #

Number of buckets in LSH

社会主义新天地 提交于 2019-12-14 01:26:50
问题 In LSH, you hash slices of the documents into buckets. The idea is that these documents that fell into the same buckets will be potentially similar, thus a nearest neighbor, possibly. For 40.000 documents, what is a good value (pretty much) for the number of buckets? I have it as: number_of_buckets = 40.000/4 now, but I feel it can be reduced more. Any ideas, please ? Relative: How to hash vectors into buckets in Locality Sensitive Hashing (using jaccard distance)? 回答1: A common starting