Multiple Inputs with MRJob

前端 未结 5 913
失恋的感觉
失恋的感觉 2020-12-29 14:58

I\'m trying to learn to use Yelp\'s Python API for MapReduce, MRJob. Their simple word counter example makes sense, but I\'m curious how one would handle an application invo

5条回答
  •  执念已碎
    2020-12-29 15:29

    In my understanding, you would not be using MrJob unless you wanted to leverage Hadoop cluster or Hadoop services from Amazon, even if the example utilizes running on local files.

    MrJob in principal uses "Hadoop streaming" to submit the job.

    This means that all inputs specified as files or folders from Hadoop is streamed to mapper and subsequent results to reducer. All mapper obtains a slice of input and considers all input to be schematically the same so that it uniformly parses and processes key,value for each data slice.

    Deriving from this understanding, the inputs are schematically the same to the mapper. Only way possible to include two different schematic data is to interleave them in the same file in such a manner that the mapper can understand which is vector data and which is matrix data.

    You are actually doing it already.
    

    You can simply improve that by having some specifier if a line is matrix data or a vector data. Once you see a vector data then the preceding matrix data is applied to it.

    matrix, 1, 2, ...
    matrix, 2, 4, ...
    vector, 3, 4, ...
    matrix, 1, 2, ...
    .....
    

    But the process that you have mentioned works well. You have to have all schematic data in a single file.

    This still has issues though. K,V map reduce works better when complete schema is present in a single line and contains a complete single processing unit.

    From my understanding, you are already doing it correctly but I guess Map-Reduce is not a suitable mechanism for this kind of data. I hope some one clarifies this even further than I could.

提交回复
热议问题