How to read specific lines from sparkContext

半城伤御伤魂 提交于 2019-12-07 03:19:30

问题


Hi I am trying to read specific lines from a text file using spark.

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile("data.txt");
String firstLine = lines.first();

It can used the .first() command to fetch the first line of the data.text document. How can I access Nth line of the document? I need java solution.


回答1:


Apache Spark RDDs are not meant to be used for lookups. The most "efficient" way to get the nth line would be lines.take(n + 1).get(n). Every time you do this, it will read the first n lines of the file. You could run lines.cache to avoid that, but it will still move the first n lines over the network in a very inefficient dance.

If the data can fit on one machine, just collect it all once, and access it locally: List<String> local = lines.collect(); local.get(n);.

If the data does not fit on one machine, you need a distributed system which supports efficient lookups. Popular examples are HBase and Cassandra.

It is also possible that your problem can be solved efficiently with Spark, but not via lookups. If you explain the larger problem in a separate question, you may get a solution like that. (Lookups are very common in single-machine applications, but distributed algorithms have to think differently.)




回答2:


I think this as fast as it gets

def getNthLine(n: Long) = 
  lines.zipWithIndex().filter(_._2 == n).first



回答3:


Like @Daniel Darabos said, RDDs are not indexed for line look ups, so an alternative method is to give it an index:

lines.zipWithIndex.filter(_._2==n).map(_._1).first()

Give it an index and then use the spark context first again, but this method is some what inefficient and silly for when the size of your RDD is small. BUT when the size of your RDD is very large, collecting it to the master becomes inefficient (and possible mem restraint), and this method becomes the better alternative.



来源:https://stackoverflow.com/questions/35221033/how-to-read-specific-lines-from-sparkcontext

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!