String similarity with OR condition in MinHash Spark ML

…衆ロ難τιáo~ 提交于 2019-12-02 06:43:43

I don't think that it is possible to set two input columns (one dataString column for each used element a' or b') and then use OR while computing but you can transform dataset1 to represent both x' + y' + a' and x' + y' + b' variants and then do the distance computation. It won't give you exactly the same answer as if you were selecting a' or b' based on the corresponding row in dataset2 (I think you know how to do that expensive operation) but still give some sense of similarity.

val dataset1splitted =
    dataset1
    .withColumn( "a", explode( array( "a'", "b'" ) ) )
    .drop( "a'", "b'", "dataString" )
    .withColumn( "dataString", concat_ws( "|", $"x'", $"y'", $"a" ) )
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!