bigdata

NumPy: 3-byte, 6-byte types (aka uint24, uint48)

别等时光非礼了梦想. 提交于 2019-12-22 06:39:19
问题 NumPy seems to lack built-in support for 3-byte and 6-byte types, aka uint24 and uint48 . I have a large data set using these types and want to feed it to numpy. What I currently do (for uint24): import numpy as np dt = np.dtype([('head', '<u2'), ('data', '<u2', (3,))]) # I would like to be able to write # dt = np.dtype([('head', '<u2'), ('data', '<u3', (2,))]) # dt = np.dtype([('head', '<u2'), ('data', '<u6')]) a = np.memmap("filename", mode='r', dtype=dt) # convert 3 x 2byte data to 2 x

When to use dynamoDB -UseCases

拥有回忆 提交于 2019-12-22 06:28:28
问题 I've tried to figure out what will be the best use cases that suit for Amazon dynamoDB. When I googled most of the blogs says DyanmoDb will be used only for a large amount of data (BigData). I'm having a background of relational DB. NoSQL DB is new for me.So when I've tried to relate this to normal relation DB knowledge. Most of the concepts related to DynamoDb is to create a schema-less table with partition keys/sort keys. And try to query them based on the keys.Also, there is no such

What is the best way to load huge result set in memory?

对着背影说爱祢 提交于 2019-12-22 05:50:09
问题 I am trying to load 2 huge resultsets(source and target) coming from different RDBMS but the problem with which i am struggling is getting those 2 huge result set in memory. Considering below are the queries to pull data from source and target: Sql Server - select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn Oracle - select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn Records in Source : 12377200 Records in Target : 12266800 Following are the

Hive ParseException - cannot recognize input near 'end' 'string'

≯℡__Kan透↙ 提交于 2019-12-22 04:04:05
问题 I am getting the following error when trying to create a Hive table from an existing DynamoDB table: NoViableAltException(88@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:9123) at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:30750) ...more stack trace... FAILED: ParseException line 1:77 cannot recognize input near 'end' 'string' ',' in column specification The query looks like this (simplified to

How to load xls data from multiple xls file into hive?

Deadly 提交于 2019-12-22 00:33:07
问题 I am learning to use Hadoop for performing Big Data related operations. I need to perform some queries on a collection of data sets split across 8 xls files. Each xls file has multiple sheets and the query concerns only one of the sheets. The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html I am not using any commerical distro of hadoop for my tasks, just have one master and a slave VM set up in VmWare with Hadoop, Hive, Pig in them. I am a

custom inputformat for reading json in hadoop

冷暖自知 提交于 2019-12-21 21:39:55
问题 i am a beginner of hadoop,i have been told to create a custom inputformat class to read json data,i have googled up and learnt how to create a custom inputformat class to read data from the file.but i am stuck on parsing the json data. my json data looks like this [ { "_count": 30, "_start": 0, "_total": 180, "values": [ { "attachment": { "contentDomain": "techcarnival2013.eventbrite.com", "contentUrl": "http://techcarnival2013.eventbrite.com/", "imageUrl": "http://ebmedia.eventbrite.com/s3

Cassandra data model for time series

北慕城南 提交于 2019-12-21 20:15:44
问题 I am working on a Cassandra data model for storing time series (I'm a Cassandra newbie). I have two applications: intraday stock data and sensor data. The stock data will be saved with a time resolution of one minute. Seven datafields build one timeframe: Symbol, Datetime, Open, High, Low, Close, Volume I will query the data mostly by Symbol and Date. e.g. give me all data for AAPL between 2013-01-01 and 2013-01-31 ordered by Datetime. The recommendation for cassandra queries is to query

R, issue with a Hierarchical clustering after a Multiple correspondence analysis

▼魔方 西西 提交于 2019-12-21 15:45:35
问题 I want to cluster a dataset (600000 observations), and for each cluster I want to get the principal components. My vectors are composed by one email and by 30 qualitative variables. Each quantitative variable has 4 classes: 0,1,2 and 3. So first thing I'm doing is to load the library FactoMineR and to load my data: library(FactoMineR) mydata = read.csv("/home/tom/Desktop/ACM/acm.csv") Then I'm setting my variables as qualitative (I'm excluding the variable 'email' though): for(n in 1:length

R ff package ffsave 'zip' not found

余生长醉 提交于 2019-12-21 12:11:12
问题 Reproduceable Example: library("ff") m <- matrix(1:12, 3, 4, dimnames=list(c("r1","r2","r3"), c("m1","m2","m3","m4"))) v <- 1:3 ffm <- as.ff(m) ffv <- as.ff(v) d <- data.frame(m, v) ffd <- ffdf(ffm, v=ffv, row.names=row.names(ffm)) ffsave(ffd,file="C:\\Users\\R.wd\\ff\\ffd") ## Error in system(cmd, input = filelist, intern = TRUE) : 'zip' not found System: Windows 7 64bit, R 15.2 64bit Rtools installed zip 300xn-x64 and unzip 600xn folders set to windows Path already cmd line working, type

Working with a big CSV file in MATLAB

拜拜、爱过 提交于 2019-12-21 05:07:23
问题 I have to work with a big CSV file, up to 2GB. More specifically I have to upload all this data to the mySQL database, but before I have to make a few calculation on that, so I need to do all this thing in MATLAB (also my supervisor want to do in MATLAB because he familiar just with MATLAB :( ). Any idea how can I handle these big files? 回答1: You should probably use textscan to read the data in in chunks and then process. This will probably be more efficient than reading a single line at a