bigdata

Best way to delete millions of rows by ID

被刻印的时光 ゝ 提交于 2019-11-26 12:22:09
问题 I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id\'s from an ID table, delete where IN that list, delete from ids table the 100 I selected). I tried: DELETE FROM tbl WHERE id IN (select * from ids) That\'s taking forever, too. Hard to

Determining optimal number of Spark partitions based on workers, cores and DataFrame size

核能气质少年 提交于 2019-11-26 12:09:53
问题 There are several similar-yet-different concepts in Spark-land surrounding how work gets farmed out to different nodes and executed concurrently. Specifically, there is: The Spark Driver node ( sparkDriverCount ) The number of worker nodes available to a Spark cluster ( numWorkerNodes ) The number of Spark executors ( numExecutors ) The DataFrame being operated on by all workers/executors, concurrently ( dataFrame ) The number of rows in the dataFrame ( numDFRows ) The number of partitions on

Is Spark's KMeans unable to handle bigdata?

*爱你&永不变心* 提交于 2019-11-26 10:01:42
问题 KMeans has several parameters for its training, with initialization mode defaulted to kmeans||. The problem is that it marches quickly (less than 10min) to the first 13 stages, but then hangs completely , without yielding an error! Minimal Example which reproduces the issue (it will succeed if I use 1000 points or random initialization): from pyspark.context import SparkContext from pyspark.mllib.clustering import KMeans from pyspark.mllib.random import RandomRDDs if __name__ == \"__main__\":

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

有些话、适合烂在心里 提交于 2019-11-26 07:57:26
问题 What\'s the difference between spark.sql.shuffle.partitions and spark.default.parallelism ? I have tried to set both of them in SparkSQL , but the task number of the second stage is always 200. 回答1: From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDD s returned by transformations like join , reduceByKey , and parallelize when not

How do I output the results of a HiveQL query to CSV?

纵然是瞬间 提交于 2019-11-26 07:55:59
问题 we would like to put the results of a Hive query to a CSV file. I thought the command should look like this: insert overwrite directory \'/home/output.csv\' select books from table; When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way? Thanks! 回答1: Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let

python - Using pandas structures with large csv(iterate and chunksize)

こ雲淡風輕ζ 提交于 2019-11-26 07:34:14
问题 I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally: df = pd.read_csv(\'Check400_900.csv\', sep=\'\\t\') doesn\'t work so I found iterate and chunksize in a similar post so I used df = pd.read_csv(\'Check1_900.csv\', sep=\'\\t\', iterator=True, chunksize=1000) All good, i can for example print df.get_chunk(5) and search the whole file with just for chunk in df: print

hadoop map reduce secondary sorting

这一生的挚爱 提交于 2019-11-26 05:28:39
问题 Can any one explain me how secondary sorting works in hadoop ? Why must one use GroupingComparator and how does it work in hadoop ? I was going through the link given below and got doubt on how groupcomapator works. Can any one explain me how grouping comparator works? http://www.bigdataspeak.com/2013/02/hadoop-how-to-do-secondary-sort-on_25.html 回答1: Grouping Comparator Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records

How to create a large pandas dataframe from an sql query without running out of memory?

孤街醉人 提交于 2019-11-26 05:27:06
问题 I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This works: import pandas.io.sql as psql sql = \"SELECT TOP 1000000 * FROM MyTable\" data = psql.read_frame(sql, cnxn) ...but this does not work: sql = \"SELECT TOP 2000000 * FROM MyTable\" data = psql.read_frame(sql, cnxn) It returns this error: File \"inference.pyx\", line 931, in pandas.lib.to_object

Working with big data in python and numpy, not enough ram, how to save partial results on disc?

北城以北 提交于 2019-11-26 02:51:32
问题 I am trying to implement algorithms for 1000-dimensional data with 200k+ datapoints in python. I want to use numpy, scipy, sklearn, networkx and other usefull libraries. I want to perform operations such as pairwise distance between all of the points and do clustering on all of the points. I have implemented working algorithms that perform what I want with reasonable complexity but when I try to scale them to all of my data I run out of ram. Of course I do, creating the matrix for pairwise

Calculating and saving space in PostgreSQL

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-25 23:07:35
问题 I have a table in pg like so: CREATE TABLE t ( a BIGSERIAL NOT NULL, -- 8 b b SMALLINT, -- 2 b c SMALLINT, -- 2 b d REAL, -- 4 b e REAL, -- 4 b f REAL, -- 4 b g INTEGER, -- 4 b h REAL, -- 4 b i REAL, -- 4 b j SMALLINT, -- 2 b k INTEGER, -- 4 b l INTEGER, -- 4 b m REAL, -- 4 b CONSTRAINT a_pkey PRIMARY KEY (a) ); The above adds up to 50 bytes per row. My experience is that I need another 40% to 50% for system overhead, without even any user-created indexes to the above. So, about 75 bytes per