bigdata

Python Replace one line in >20GB text file

天大地大妈咪最大 提交于 2019-12-11 11:40:27
问题 I am fully aware that there were many approaches to this problem. What I need is a simple Python script that would replace only 1 line in a large text file. It is always the fourth line from the beginning. As the file (actually, files) is bigger than 20GB, I don't want to load it to memory or create a copy, just replace one line efficiently. I'll be glad for any help in this regard. A. PS. I know vi can do it, but I need it as a script, so that someone non-vi-compatible would be able to do it

Pentaho Frame size (17727647) larger than max length (16384000)!

为君一笑 提交于 2019-12-11 11:34:50
问题 In pentaho , when I run a cassandra input step that get around 50,000 rows , I get this exception : Is there a way to control the query result size in pentaho ? or is there a way to stream the query result and not get it all in bulk? 2014/10/09 15:14:09 - Cassandra Input.0 - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : Unexpected error 2014/10/09 15:14:09 - Cassandra Input.0 - ERROR (version 5.1.0.0, build 1 from 2014-06-19_19-02-57 by buildguy) : org.pentaho.di

Chunked UrlDataSource For Solr DataImportHandler

不想你离开。 提交于 2019-12-11 11:13:51
问题 I'm looking into chunking my data source for optimial data import into solr and was wondering if it was possible to use a master url that chunked data into sections. For example File 1 may have <chunks> <chunk url="http://localhost/chunker?start=0&stop=100" /> <chunk url="http://localhost/chunker?start=100&stop=200" /> <chunk url="http://localhost/chunker?start=200&stop=300" /> <chunk url="http://localhost/chunker?start=300&stop=400" /> <chunk url="http://localhost/chunker?start=400&stop=500"

Python (pyspark) Error = ValueError: could not convert string to float: “17”

风格不统一 提交于 2019-12-11 10:46:30
问题 I am working with Python on Spark and reading my dataset from a .csv file whose first a few rows are: 17 0.2 7 17 0.2 7 39 1.3 7 19 1 7 19 0 7 When I read from the file line by line with the code below: # Load and parse the data def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')] return LabeledPoint(values[0], values[1:]) I get the this error: Traceback (most recent call last): File "<stdin>", line 3, in parsePoint ValueError: could not convert string to float

Error in running livy spark server in hue

佐手、 提交于 2019-12-11 10:46:07
问题 When I run following command hue livy_server Following error is shown Failed to run spark-submit executable: java.io.IOException: Cannot run program "spark-submit": error=2, No such file or directory I have set SPARK_HOME=/home/amandeep/spark 回答1: If you run Livy on local mode it will except to find the spark-submit script in its environment. Check your shell PATH variable. 来源: https://stackoverflow.com/questions/31014656/error-in-running-livy-spark-server-in-hue

Hadoop M/R secondary sort not working, bases on last name of the user

本小妞迷上赌 提交于 2019-12-11 10:42:50
问题 I want to sort the output based on lastname of the user, the key being used is firstName. Following are the classes which i am using but i am not getting sorted output based on lastName. I am new to hadoop,this i wrote using help from various internet sources. Main Class :- public class WordCount { public static class Map extends Mapper<LongWritable, Text, CustomKey, Text> { public static final Log log = LogFactory.getLog(Map.class); private final static IntWritable one = new IntWritable(1);

MonetDB refresh data in background best strategy with active connections making queries

感情迁移 提交于 2019-12-11 10:28:58
问题 I'm testing MonetDB and getting an amazing performance while querying millions of rows on my laptop. I expect to work with billions in production and I need to update the data as often as possible, let say each 1 minute or 5 minutes worst case. Just updating existing records or adding new ones, deletion can be scheduled once a day. I've seen a good performance for the updates on my tests, but i'm a bit worried about same operations over three of four times more data. About BULK insert, got 1

Counting filtered items on dataframe SPARK

拈花ヽ惹草 提交于 2019-12-11 10:28:16
问题 I have the following dataframe :df In some point I need to filter out items base on timestamps(milliseconds). However it is important to me to save how much records werefiltered(In case it is too many I want to fail the job) Naively I can do: ======Lots of calculations on df ====== val df_filtered = df.filter($"ts" >= startDay && $"ts" <= endDay) val filtered_count = df.count - df_filtered.count However it feels like complete overkill since SPARK will perform the whole execution tree, 3 times

Is there any better method than collect to read an RDD in spark?

最后都变了- 提交于 2019-12-11 09:54:19
问题 So, I want to read and RDD into an array. For that purpose, I could use the collect method. But that method is really annoying as in my case it keeps on giving kyro buffer overflow errors. If I set the kyro buffer size too much, it starts to have its own problems. On the other hand, I have noticed that if I just save the RDD into a file using the saveAsTextFile method, I get no errors. So, I was thinking, there must be some better method of reading an RDD into an array which isn't as

How to set up Spark cluster on Windows machines?

不打扰是莪最后的温柔 提交于 2019-12-11 08:48:02
问题 I am trying to set up a Spark cluster on Windows machines. The way to go here is using the Standalone mode, right? What are the concrete disadvantages of not using Mesos or YARN? And how much pain would it be to use either one of those? Does anyone have some experience here? 回答1: FYI, I got an answer in the user-group: https://groups.google.com/forum/#!topic/spark-users/SyBJhQXBqIs The standalone mode is indeed the way to go. Mesos does not work under Windows and YARN probably neither. 回答2: