bigdata

How to install Big Data Hadoop on Ubuntu 14.04 64 bit VM on Windows 7 32 bit

这一生的挚爱 提交于 2019-12-25 17:44:54
问题 I have a Windows 7 32-bit laptop and I wanted to practice Hadoop on Ubuntu 64 bit. I tried many ways but was not able to install/run Hadoop since it requires 64 bit Ubuntu OS. How can I install it on my Windows 32-bit laptop? 回答1: Good day, Finally I have managed to run Ubuntu 64 bit VM on my Windows 7 32 bit and installed Hadoop in Ubuntu. This thread is just for those who wants to practice hadoop on Ubuntu 64 bit in Windows 7 32 bit computer... Please follow below steps/links i have

How to convert annual data to monthly data using R?

孤人 提交于 2019-12-25 14:12:45
问题 I have year-wise annual data of GDP of 15 years from 2000-2015. I want to convert this data to monthly data, which only having month and year. I just want to copy the value of that year to all the months. How can I do this in R. e.g. in year 2010 value is 1708. I want to copy the same value for all the months of 2010. my data : > str(gdpnew) 'data.frame': 16 obs. of 3 variables: $ X : int 1 2 3 4 5 6 7 8 9 10 ... $ Date : Date, format: "2000-12-31" "2001-12-31" "2002-12-31" ... $ Value: num

Fastest way to compare two huge csv files in python(numpy)

巧了我就是萌 提交于 2019-12-25 12:53:44
问题 I am trying find the intesect sub set between two pretty big csv files of phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help import pandas as pd import numpy as np df_dnc = pd.read_csv('dncTest.csv', names = ['phone']) df_test = pd

What should be the better solution for store and manage big data?

夙愿已清 提交于 2019-12-25 09:13:14
问题 I have a PHP application which request a API in every half hour and get chunk of data. The amount of data approximately 30 million every month. So this data is flooded database every day. And at the end of the year it cross approximately (30*12) = 360 million data or more. I just store those data and when user request for a specific data I get from it. So what could be the better solution for this kind of application for store those data and get a specific data? I am now using MYSQL. Is MYSQL

Kafka streams - First example WordCount doesn't count correctly the first lap

╄→гoц情女王★ 提交于 2019-12-25 09:09:02
问题 I'm studying Kafka Streams and I have a problem with the first example of WordCount in Java 8, taken from the documentation. Using the latest available versions of kafka streams, Kafka Connect and WordCount lambda expressions example. I follow the following steps: I create an input topic in Kafka, and an output one. Start the app streaming and then uploading the input topic by inserting some words from a .txt file On the first count, in the output topic I see the words grouped correctly, but

Retrieve any three random qualifier in hbase using java

青春壹個敷衍的年華 提交于 2019-12-25 08:57:15
问题 My hbase table looks like this: hbase(main):040:0> scan 'TEST' ROW COLUMN+CELL 4 column=data:108, timestamp=1399972960190, value=-240.0 4 column=data:112, timestamp=1399972960138, value=-160.0 4 column=data:12, timestamp=1399972922979, value=2 4 column=data:120, timestamp=1399972960124, value=-152.0 4 column=data:144, timestamp=1399972960171, value=-240.0 4 column=data:148, timestamp=1399972960152, value=-240.0 4 column=data:16, timestamp=1399972909606, value=9 4 column=data:8, timestamp

What is the best way to pass data into pycharm?

天大地大妈咪最大 提交于 2019-12-25 07:41:14
问题 I uploaded data into MySQL and from there am using PyCharm and the plotly.offline library to pass in that data. My end goal is to create a scatter plot of the US with information on places of a certain latitude and longitude. This is what I am trying to pass in: checkin_data = pd.read_sql('select bus.business_id, bus.latitude, bus.longitude, sum(chk.checkin_count ) as checkin_count from yelp.business bus inner join yelp.checkin chk ON bus.business_id=chk.business_id group by bus.business_id,

What is the fastest procedure to remove duplicates from a big table in MySQL

僤鯓⒐⒋嵵緔 提交于 2019-12-25 05:31:26
问题 I have a table in MySQL (50 million rows) new data keep inserting periodically. This table has following structure CREATE TABLE values ( id double NOT NULL AUTO_INCREMENT, channel_id int(11) NOT NULL, val text NOT NULL, date_time datetime NOT NULL, PRIMARY KEY (id), KEY channel_date_index (channel_id,date_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Two rows must never have duplicate channel_id and date_time, but if such insert occurs it is important to keep the newest value. Is there a

What is the fastest procedure to remove duplicates from a big table in MySQL

淺唱寂寞╮ 提交于 2019-12-25 05:31:22
问题 I have a table in MySQL (50 million rows) new data keep inserting periodically. This table has following structure CREATE TABLE values ( id double NOT NULL AUTO_INCREMENT, channel_id int(11) NOT NULL, val text NOT NULL, date_time datetime NOT NULL, PRIMARY KEY (id), KEY channel_date_index (channel_id,date_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Two rows must never have duplicate channel_id and date_time, but if such insert occurs it is important to keep the newest value. Is there a

Customizing InputFormat in Hadoop

牧云@^-^@ 提交于 2019-12-25 04:06:57
问题 I am trying to read form a very big databse which consists of geo-referenced time series data. SO I have the file in the following format: latitude,longitude,value@time1,value@time2,....value@timeN. So this is the data for the entire earth. Now for my work I need to get the latitude,longitude as the key and the time series values as the value. As far as I know hadoop has KeyValueInputFormat but it considers first tab as the delimiter. Is there a way to customize it. Really need a solution for