bigdata | 易学教程

How to install Big Data Hadoop on Ubuntu 14.04 64 bit VM on Windows 7 32 bit

阅读更多关于 How to install Big Data Hadoop on Ubuntu 14.04 64 bit VM on Windows 7 32 bit

问题 I have a Windows 7 32-bit laptop and I wanted to practice Hadoop on Ubuntu 64 bit. I tried many ways but was not able to install/run Hadoop since it requires 64 bit Ubuntu OS. How can I install it on my Windows 32-bit laptop? 回答1: Good day, Finally I have managed to run Ubuntu 64 bit VM on my Windows 7 32 bit and installed Hadoop in Ubuntu. This thread is just for those who wants to practice hadoop on Ubuntu 64 bit in Windows 7 32 bit computer... Please follow below steps/links i have

How to convert annual data to monthly data using R?

阅读更多关于 How to convert annual data to monthly data using R?

问题 I have year-wise annual data of GDP of 15 years from 2000-2015. I want to convert this data to monthly data, which only having month and year. I just want to copy the value of that year to all the months. How can I do this in R. e.g. in year 2010 value is 1708. I want to copy the same value for all the months of 2010. my data : > str(gdpnew) 'data.frame': 16 obs. of 3 variables: $ X : int 1 2 3 4 5 6 7 8 9 10 ... $ Date : Date, format: "2000-12-31" "2001-12-31" "2002-12-31" ... $ Value: num

Fastest way to compare two huge csv files in python(numpy)

阅读更多关于 Fastest way to compare two huge csv files in python(numpy)

问题 I am trying find the intesect sub set between two pretty big csv files of phone numbers(one has 600k rows, and the other has 300mil). I am currently using pandas to open both files and then converting the needed columns into 1d numpy arrays and then using numpy intersect to get the intersect. Is there a better way of doing this, either with python or any other method. Thanks for any help import pandas as pd import numpy as np df_dnc = pd.read_csv('dncTest.csv', names = ['phone']) df_test = pd

What should be the better solution for store and manage big data?

阅读更多关于 What should be the better solution for store and manage big data?

问题 I have a PHP application which request a API in every half hour and get chunk of data. The amount of data approximately 30 million every month. So this data is flooded database every day. And at the end of the year it cross approximately (30*12) = 360 million data or more. I just store those data and when user request for a specific data I get from it. So what could be the better solution for this kind of application for store those data and get a specific data? I am now using MYSQL. Is MYSQL

Kafka streams - First example WordCount doesn't count correctly the first lap

阅读更多关于 Kafka streams - First example WordCount doesn't count correctly the first lap

问题 I'm studying Kafka Streams and I have a problem with the first example of WordCount in Java 8, taken from the documentation. Using the latest available versions of kafka streams, Kafka Connect and WordCount lambda expressions example. I follow the following steps: I create an input topic in Kafka, and an output one. Start the app streaming and then uploading the input topic by inserting some words from a .txt file On the first count, in the output topic I see the words grouped correctly, but

Retrieve any three random qualifier in hbase using java

阅读更多关于 Retrieve any three random qualifier in hbase using java

问题 My hbase table looks like this: hbase(main):040:0> scan 'TEST' ROW COLUMN+CELL 4 column=data:108, timestamp=1399972960190, value=-240.0 4 column=data:112, timestamp=1399972960138, value=-160.0 4 column=data:12, timestamp=1399972922979, value=2 4 column=data:120, timestamp=1399972960124, value=-152.0 4 column=data:144, timestamp=1399972960171, value=-240.0 4 column=data:148, timestamp=1399972960152, value=-240.0 4 column=data:16, timestamp=1399972909606, value=9 4 column=data:8, timestamp

What is the best way to pass data into pycharm?

阅读更多关于 What is the best way to pass data into pycharm?

问题 I uploaded data into MySQL and from there am using PyCharm and the plotly.offline library to pass in that data. My end goal is to create a scatter plot of the US with information on places of a certain latitude and longitude. This is what I am trying to pass in: checkin_data = pd.read_sql('select bus.business_id, bus.latitude, bus.longitude, sum(chk.checkin_count ) as checkin_count from yelp.business bus inner join yelp.checkin chk ON bus.business_id=chk.business_id group by bus.business_id,

What is the fastest procedure to remove duplicates from a big table in MySQL

阅读更多关于 What is the fastest procedure to remove duplicates from a big table in MySQL

问题 I have a table in MySQL (50 million rows) new data keep inserting periodically. This table has following structure CREATE TABLE values ( id double NOT NULL AUTO_INCREMENT, channel_id int(11) NOT NULL, val text NOT NULL, date_time datetime NOT NULL, PRIMARY KEY (id), KEY channel_date_index (channel_id,date_time) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Two rows must never have duplicate channel_id and date_time, but if such insert occurs it is important to keep the newest value. Is there a

What is the fastest procedure to remove duplicates from a big table in MySQL

阅读更多关于 What is the fastest procedure to remove duplicates from a big table in MySQL

Customizing InputFormat in Hadoop

阅读更多关于 Customizing InputFormat in Hadoop

问题 I am trying to read form a very big databse which consists of geo-referenced time series data. SO I have the file in the following format: latitude,longitude,value@time1,value@time2,....value@timeN. So this is the data for the entire earth. Now for my work I need to get the latitude,longitude as the key and the time series values as the value. As far as I know hadoop has KeyValueInputFormat but it considers first tab as the delimiter. Is there a way to customize it. Really need a solution for