bigdata

Hierarchical Clustering Large Sparse Distance Matrix R

我只是一个虾纸丫 提交于 2020-01-04 07:56:40
问题 I am attempting to perform fastclust on a very large set of distances, but running into a problem. I have a very large csv file (about 91 million rows so a for loop takes too long in R) of similarities between keywords (about 50,000 unique keywords) that when I read into a data.frame looks like: > df kwd1 kwd2 similarity a b 1 b a 1 c a 2 a c 2 It is a sparse list and I can convert it into a sparse matrix using sparseMatrix(): > myMatrix a b c a . . . b 1 . . c 2 . . However, when I attempt

How to remove rows after a particular observation is seen for the first time

我怕爱的太早我们不能终老 提交于 2020-01-04 07:54:38
问题 I have a dataset wherein I have account number and "days past due" with every observation. For every account number, as soon as the "days past due" column hits a code like "DLQ3", I want to remove rest of the rows for that account (even if DLQ3 is the first observation for that account). My dataset looks like: Obs_month Acc_No OS_Bal Days_past_due 201005 2000000031 3572.68 NORM 201006 2000000031 4036.78 NORM 200810 2000000049 39741.97 NORM 200811 2000000049 38437.54 DLQ3 200812 2000000049

How to remove rows after a particular observation is seen for the first time

a 夏天 提交于 2020-01-04 07:54:11
问题 I have a dataset wherein I have account number and "days past due" with every observation. For every account number, as soon as the "days past due" column hits a code like "DLQ3", I want to remove rest of the rows for that account (even if DLQ3 is the first observation for that account). My dataset looks like: Obs_month Acc_No OS_Bal Days_past_due 201005 2000000031 3572.68 NORM 201006 2000000031 4036.78 NORM 200810 2000000049 39741.97 NORM 200811 2000000049 38437.54 DLQ3 200812 2000000049

hbase as database in web application

让人想犯罪 __ 提交于 2020-01-04 03:15:14
问题 A big question about using hadoop or related technologies in a real web application. I just want to find out how a web app can use hbase as its database. I mean is it the thing the big data apps do or they use normal databases and just use these sort of technologies for analysis? Is it ok to have a online store with Hbase database or something like this? 回答1: Yes it is perfectly fine to have hbase as your backend. What I am doing to get this done,( I have a online community and forum running

Can Spool Dir of flume be in remote machine?

烈酒焚心 提交于 2020-01-04 02:45:06
问题 I was trying to fetch files from a remote machine to my hdfs whenever a new file has arrived into a particular folder. I came across the concept of spool dir in flume, and it was working fine if the spool dir is in the same machine where the flume agent is running. Is there any method to configure a spool dir in a remote machine ?? Please help. 回答1: You might be aware that flume can spawn multiple instances, i.e. you can install several flume instances which pass the data between them. So to

How to get all table definitions in a database in Hive?

Deadly 提交于 2020-01-03 16:45:46
问题 I am looking to get all table definitions in Hive. I know that for single table definition I can use something like - describe <<table_name>> describe extended <<table_name>> But, I couldn't find a way to get all table definitions. Is there any table in megastore similar to Information_Schema in mysql or is there command to get all table definitions ? 回答1: You can do this by writing a simple bash script and some bash commands. First, write all table names in a database to a text file using:

Using Rowcounter in Hbase table

本秂侑毒 提交于 2020-01-03 04:16:04
问题 I am trying to calculate the no of rows in a Hbase table. Can do that with scannner but it is a bulky process.Want to use RowCounter to fetch the row number from Hbase table.Is there any way by which I can use that in Java Code. Is there any example or code snippet available. Directly using rowcounter is plain simple by using the command :- /hbase org.apache.hadoop.hbase.mapreduce.RowCounter [TABLE_NAME] Please provide any code snippet to use the same in Java code. Thanks 回答1: You can find

Connection refused to quickstart.cloudera:8020

旧城冷巷雨未停 提交于 2020-01-03 02:41:08
问题 I'm using Cloudera-quickstart 5.5.0 virtualbox Trying to run this on terminal. As you can below, there is an exception. I've searched for solution to solve this on internet and found something. 1-) configuring core-site.xml file. https://datashine.wordpress.com/2014/09/06/java-net-connectexception-connection-refused-for-more-details-see-httpwiki-apache-orghadoopconnectionrefused/ But I can only open this file readable and haven't been able to change it. It seems I need to be root or hdfs user

HDFS space usage on fresh install

一个人想着一个人 提交于 2020-01-02 18:57:16
问题 I just installed HDFS and launched the service, and there is already more than 800MB of used space. What does it represent ? $ hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://quickstart.cloudera:8020 54.5 G 823.7 M 43.4 G 1% 来源: https://stackoverflow.com/questions/43165646/hdfs-space-usage-on-fresh-install

Cassandra slowed down with more nodes

孤街浪徒 提交于 2020-01-02 09:12:15
问题 I set up a Cassandra cluster on AWS. What I want to get is increased I/O throughput (number of reads/writes per second) as more nodes are added (as advertised). However, I got exactly the opposite. The performance is reduced as new nodes are added. Do you know any typical issues that prevents it from scaling? Here is some details: I am adding a text file (15MB) to the column family. Each line is a record. There are 150000 records. When there is 1 node, it takes about 90 seconds to write. But