sequencefile

Extend SequenceFileInputFormat to include file name+offset

我怕爱的太早我们不能终老 提交于 2019-12-03 03:24:33
I would like to be able to create a custom InputFormat that reads sequence files, but additionally exposes the file path and offset within that file where the record is located. To take a step back, here's the use case: I have a sequence file containing variably-sized data. The keys are mostly irrelevant, and the values are up to a couple megabytes containing a variety of different fields. I would like to index some of these fields in elasticsearch along with the file name and offset. This way, I can query those fields from elasticsearch, and then use the file name and offset to go back to the

Hadoop HDFS: Read sequence files that are being written

半世苍凉 提交于 2019-12-02 04:45:37
I am using Hadoop 1.0.3. I write logs to an Hadoop sequence file into HDFS, I call syncFS() after each bunch of logs but I never close the file (except when I am performing daily rolling). What I want to guarantee is that the file is available to readers while the file is still being written. I can read the bytes of the sequence file via FSDataInputStream, but if I try to use SequenceFile.Reader.next(key,val), it returns false at the first call. I know the data is in the file since I can read it with FSDataInputStream or with the cat command and I am 100% sure that syncFS() is called. I

How can I use Mahout's sequencefile API code?

喜欢而已 提交于 2019-12-01 08:06:45
There exists in Mahout a command for create sequence file as bin/mahout seqdirectory -c UTF-8 -i <input address> -o <output address> . I want use this command as code API. Julian Ortega You can do something like this: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path outputPath = new Path("c:\\temp"); Text key = new Text(); // Example, this can be another type of

Reading Sequence File in PySpark 2.0

馋奶兔 提交于 2019-11-29 12:44:09
I have a sequence file whose values look like (string_value, json_value) I don't care about the string value. In Scala I can read the file by val reader = sc.sequenceFile[String, String]("/path...") val data = reader.map{case (x, y) => (y.toString)} val jsondata = spark.read.json(data) I am having a hard time converting this to PySpark. I have tried using reader= sc.sequenceFile("/path","org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text") data = reader.map(lambda x,y: str(y)) jsondata = spark.read.json(data) The errors are cryptic but I can provide them if that helps. My question is, is

using pyspark, read/write 2D images on hadoop file system

会有一股神秘感。 提交于 2019-11-29 09:36:38
问题 I want to be able to read / write images on an hdfs file system and take advantage of the hdfs locality. I have a collection of images where each image is composed of 2D arrays of uint16 basic additional information stored as an xml file. I want to create an archive over hdfs file system, and use spark for analyzing the archive. Right now I am struggling over the best way to store the data over hdfs file system in order to be able to take full advantage of spark+hdfs structure. From what I

hadoop mapreduce: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

六眼飞鱼酱① 提交于 2019-11-29 02:20:26
I am trying to write a snappy block compressed sequence file from a map-reduce job. I am using hadoop 2.0.0-cdh4.5.0, and snappy-java 1.0.4.1 Here is my code: package jinvestor.jhouse.mr; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.OutputStream; import java.util.Arrays; import java.util.List; import jinvestor.jhouse.core.House; import jinvestor.jhouse.core.util.HouseAvroUtil; import jinvestor.jhouse.download.HBaseHouseDAO; import org.apache.commons.io.IOUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org

Reading Sequence File in PySpark 2.0

天大地大妈咪最大 提交于 2019-11-28 06:47:55
问题 I have a sequence file whose values look like (string_value, json_value) I don't care about the string value. In Scala I can read the file by val reader = sc.sequenceFile[String, String]("/path...") val data = reader.map{case (x, y) => (y.toString)} val jsondata = spark.read.json(data) I am having a hard time converting this to PySpark. I have tried using reader= sc.sequenceFile("/path","org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text") data = reader.map(lambda x,y: str(y)) jsondata =

Write and read raw byte arrays in Spark - using Sequence File SequenceFile

≯℡__Kan透↙ 提交于 2019-11-28 03:37:47
问题 How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again? 回答1: Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes val rdd: RDD[Array[Byte]] = ??? // To write rdd.map(bytesArray => (NullWritable.get()

hadoop mapreduce: java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

若如初见. 提交于 2019-11-27 16:36:39
问题 I am trying to write a snappy block compressed sequence file from a map-reduce job. I am using hadoop 2.0.0-cdh4.5.0, and snappy-java 1.0.4.1 Here is my code: package jinvestor.jhouse.mr; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.OutputStream; import java.util.Arrays; import java.util.List; import jinvestor.jhouse.core.House; import jinvestor.jhouse.core.util.HouseAvroUtil; import jinvestor.jhouse.download.HBaseHouseDAO; import org.apache.commons.io