问题
Use case description: I want to use ignite embedded to maintain in-memory cache to speed up my spark jobs.
1) How does TCP SPI discovery work in ignite embedded mode? The documentation states, that in ignite embedded the life cycle of ignite nodes is managed by spark, and the nodes are started and killed from inside the spark job itself. Since, ignite nodes are bound to YARN containers, so is it still necessary to pass the SPI configuration? or does service discovery happen automatically/dynamically?
2) building on first question: How do we go about launching a spark job with say, 4 spark executors but fire only 2 ignite nodes?
3) I am providing a sample code below which I developed, and my job was killed as it exceeded the memory. I have already gone through the capacity planning page as specified in the original documentation. My data is around 300 MB, and I expect it to consume around 1.5 GB memory in worst case scenario, with no replication, and index on one integer field.
my cluster configuration: 1 master- 24 GB memory, 2 Core CPU and 2 slaves- 8GB memory, 2 Core CPU
import java.io.Serialisable
import org.apache.spark._
import org.apache.ignite.spark._
import org.apache.ignite.configuration._
import org.apache.ignite.spi.discovry.tcp.ipfinder.vm.TCPDiscoveryVmIpFinder
import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi
import java.util.Arrays
import org.apache.spark.sql._
import org.apache.ignite.cache.query.annotations.QuerySqlField
import scala.annotation.meta.field
import org.apache.ignite._
val ic= new IgniteContext(sc, () => {
new discoveryspi= new TcpDiscoverySpi()
val finder= new TCPDiscoveryVmIpFinder()
finder.setAddresses(Arrays.asList("127.0.0.1:47500")) //and more ip address in my cluster
discoveryspi.setIpFinder(finder)
val dataStorage= new DataStorageConfiguration()
dataStorage.getDefaultDataRegionConfiguration().setMaxSize(16L*1024*1024*1024) //16GB
val cfg= new IgniteConfiguration()
cfg.setDiscoverySpi(discoveryspi)
cfg.setDataStorageConfiguration(dataStorage)
cfg}, false)
case class User(
@(QuerySqlField @field)(index=true) id: Int,
@(QuerySqlField @field) gender: String,
@(QuerySqlField @field) marks: Int
) extends Serialisable
val cacheCfg= new CacheConfiguration[Int, User]("sharedRDD")
cacheCfg.setIndexedTypes(classOf[Int], classOf[User])
cacheCfg.setCacheMode(CahceMode.PARTITIONED)
val df2= spark.sql("select * from db_name.User") //read data from hive table
val df2_rdd= df2.rdd
val data_rdd= df2_rdd.map(x=> User(x.getInt(0), x.getString(1), x.getInt(2)))
val tcache: IgniteRDD[Int, User]= ic.fromCache(cacheCfg)
tcache.savePairs(data_rdd.map(x=> (x.id, x)))
val result= tcache.sql("select * from User u1 left join User u2 on(u1.id=u2.id)") //test query for self join
This program works fine until I do the self join. Simple queries like "select * from User limit 5" work perfectly fine.
Error Log:
WARN TcpDiscoverySpi: Failed to connect to any addres frim IP finder (will retry to join topology every 2 secs)
WARN TCPCommunicationSpi: Connect timed out
WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limit. Consider boosting spark.yarn.executor.memoryOverhead
I had already increased the spark.yarn.executor.memoryOverhead parameter to 2 GB, and executor memeory to 6GB. I am however not able to figure out still what I am missing here, considering that my data size is only a mere 300 MB
回答1:
- Discovery Spi working in embedded mode as usual Apache Ignite node. So you need to properly configure Discovery Spi (especially Ip Finder) anyway. More details about node discovering you can found there and choose more applicable for your case.
- If you don't need to instance Apache Ignite in spark job, just don't create a IgniteContext object.
- I think your JVM consume significant part of this memory. You need to check JVM settings.
来源:https://stackoverflow.com/questions/47381037/ignite-tcp-spi-discovery-and-memory-management-in-ignite-embedded