bigdata

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

限于喜欢 提交于 2019-12-10 13:54:19
问题 df = df.groupby(df.index).sum() I have a dataframe with 3.8 million rows (single column), and I'm trying to group them by index. But it takes forever to finish the computation. Are there any alternative ways to deal with a very large data set? Thanks in advance!!!! I'm writing in Python. The data looks like as below. The index is the customer ID. I want to group the qty_liter by the Index . df = df.groupby(df.index).sum() But this line of code is taking toooo much time..... the info about

extracting n grams from huge text

家住魔仙堡 提交于 2019-12-10 12:36:20
问题 For example we have following text: "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..." I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this: ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...] twos : ['Spark is', 'is a', 'a

sparql join query explanation hows its working?

帅比萌擦擦* 提交于 2019-12-10 12:35:11
问题 My query: select ?x ?z where { ?x <http://purl.uniprot.org/core/name> ?y . ?x <http://purl.uniprot.org/core/volume> ?z . ?x <http://purl.uniprot.org/core/pages> "176-186" . } I required to make custom parser for this query. When I do this query on jena model, it returns one record. Can anyone explain this query implementation? I split out this query into three parts: select ?x ?y where { ?x <http://purl.uniprot.org/core/name> ?y . } Total Records Found : 3034 select ?x ?z where { ?x <http:/

Filter DataFrame based on words in array in Apache Spark

爱⌒轻易说出口 提交于 2019-12-10 12:21:09
问题 I am trying to Filter a Dataset by getting only those rows that contains words in array. I am using contains method,it works for string but not working for array. Below is code val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop.cache() val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect() val newData = dataSet.select("*").filter(col("_source.raw_text").contains(threats_path)).show() It is not working becuase threats

cost of keys in JSON document database (mongodb, elasticsearch)

柔情痞子 提交于 2019-12-10 11:59:16
问题 I would like if someone had any experience with speed or optimization effects on the size of JSON keys in a document store database like mongodb or elasticsearch. So for example: I have 2 documents doc1: { keeeeeey1: 'abc', keeeeeeey2: 'xyz') doc2: { k1: 'abc', k2: 'xyz') Lets say I have 10 million records, then to store data in doc1 format would mean more db file size than to store in doc2. Other than that would are the disadvantages or negative effects in terms of speed or RAM or any other

Deleting Columns in HBase

不羁的心 提交于 2019-12-10 11:53:09
问题 In HBase, calling DeleteColumn() method i.e., essentially a schema change to a column family or deleting column families will result in downtime of HBase Cluster? 回答1: The deleteColumn method on a Delete mutation of HBase deletes specific column(s) from a specific row this is not a schema change since HBase does not retain a schema-level knowledge of columns of each row (and each row can have a different number and types of columns - think about it as a thinly populated matrix). The same is

GAS API implementation and usage

你说的曾经没有我的故事 提交于 2019-12-10 11:20:05
问题 I'm trying to learn and use the GAS API to implement a Random Walk over my database, associating every visited vertex to the starting vertex. I'm having some issues understanding how I can manage to do this; I've been reviewing the PATHS, BFS, PR, and other GAS classes as examples, but I'm not quite sure how to start. I think my implementation should extend BaseGASProgram and implement the needed methods. Also, as iterative, the frontier contains all the vertexes of the current iteration. The

Split really large file into smaller files in Python - Too many open files

寵の児 提交于 2019-12-10 11:01:24
问题 I have a really large csv file (close to a Terabyte) that I want to split into smaller csv files, based on info in each row. Since there is no way to do that in memory, my intended approach was to read each line, decide which file it should go into, and append it there. This however takes ages, since opening and closing takes too long. My second approach was to keep all files (about 3000) open - this however does not work since I can't have so many files open in parallel. Additional details

operating with big.matrix

坚强是说给别人听的谎言 提交于 2019-12-10 10:43:52
问题 I have to work with big.matrix objects and I can’t compute some functions. Let's consider the following big.matrix: # create big.matrix object x <- as.big.matrix( matrix( sample(1:10, 20, replace=TRUE), 5, 4, dimnames=list( NULL, c("a", "b", "c", "d")) ) ) > x An object of class "big.matrix" Slot "address": <pointer: 0x00000000141beee0> The corresponding matrix object is: # create matrix object x2<-x[,] > x2 a b c d [1,] 6 9 5 3 [2,] 3 6 10 8 [3,] 7 1 2 8 [4,] 7 8 4 10 [5,] 6 3 6 4 If I

HBase Scan TimeRange Does not Work in Scala

放肆的年华 提交于 2019-12-10 10:39:44
问题 I write scala code to retrieve data based on its time range. Here're my code : object Hbase_Scan_TimeRange { def main(args: Array[String]): Unit = { //===Basic Hbase (Non Deprecated)===Start Logger.getLogger(this.getClass) Logger.getLogger("org").setLevel(Level.ERROR) BasicConfigurator.configure() val conf = HBaseConfiguration.create() val connection = ConnectionFactory.createConnection(conf) val admin = connection.getAdmin() //===Basic Hbase (Non Deprecated)===End val scan = new Scan() val