How to pre-package external libraries when using Spark on a Mesos cluster

前端未结

关注

 4  1954

According to the Spark on Mesos docs one needs to set the spark.executor.uri pointing to a Spark distribution:

val conf = new SparkConf()
  .set


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-12-10 05:33
              
            
            
                                                                       
Yeah, you can copy the dependencies out to the workers and put them in the system-wide jvm lib directory in order to get them on the classpath.  

Then you can mark those dependencies as provided in your sbt build, and they won't be included in the assembly. This does speed up assembly and transfer time.

I haven't tried this on mesos specifically, but have used it on spark standalone for things that are in every job and rarely change.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-12-10 05:42
              
            
            
                                                                       
When you say pre-package do you really mean distribute to all the slaves and set up the jobs to use those packages so that you don't need to download those every time? That might be an option, however it sounds a bit cumbersome because distributing everything to the slaves and keeping all the packages up to date is not an easy task.

How about breaking your .tar.gz into smaller pieces, so that instead of a single fat file your jobs fetch several smaller files? In this case it should be possible to leverage the Mesos Fetcher Cache. So you'll see bad performance when the agent cache is cold, but once it warms up (i.e. once one job runs and downloads the common files locally) consecutive jobs will complete faster. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-12-10 05:48
              
            
            
                                                                       
Create sample maven project with your all dependencies and then use maven plugin maven-shade-plugin. It will create one shade jar in your target folder.

Here is sample pom

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com</groupId>
    <artifactId>test</artifactId>
    <version>0.0.1</version>
    <properties>
        <java.version>1.7</java.version>
        <hadoop.version>2.4.1</hadoop.version>
        <spark.version>1.4.0</spark.version>
        <version.spark-csv_2.10>1.1.0</version.spark-csv_2.10>
        <version.spark-avro_2.10>1.0.0</version.spark-avro_2.10>
    </properties>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <!-- <minimizeJar>true</minimizeJar> -->
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                                <exclude>org/bdbizviz/**</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <finalName>spark-${project.version}</finalName>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <dependencies>
        <dependency> <!-- Hadoop dependency -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <artifactId>servlet-api</artifactId>
                    <groupId>javax.servlet</groupId>
                </exclusion>
                <exclusion>
                    <artifactId>guava</artifactId>
                    <groupId>com.google.guava</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>joda-time</groupId>
            <artifactId>joda-time</artifactId>
            <version>2.4</version>
        </dependency>

        <dependency> <!-- Spark Core -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark SQL -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark CSV -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-csv_2.10</artifactId>
            <version>${version.spark-csv_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Avro -->
            <groupId>com.databricks</groupId>
            <artifactId>spark-avro_2.10</artifactId>
            <version>${version.spark-avro_2.10}</version>
        </dependency>
        <dependency> <!-- Spark Hive -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency> <!-- Spark Hive thriftserver -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.10</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>
</project>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2020-12-10 05:56
              
            
            
                                                                       
After I discovered the Spark JobServer project, I decided that this is the most suitable one for my use case. 

It supports dynamic context creation via a REST API, as well as adding JARs to the newly created context manually/programmatically. It also is capable of runnign low-latency synchronous jobs, which is exactly what I need.

I created a Dockerfile so you can try it out with the most recent (supported) versions of Spark (1.4.1), Spark JobServer (0.6.0) and buit-in Mesos support (0.24.1):


https://github.com/tobilg/docker-spark-jobserver
https://hub.docker.com/r/tobilg/spark-jobserver/


References:


https://github.com/spark-jobserver/spark-jobserver#features
https://github.com/spark-jobserver/spark-jobserver#context-configuration

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复