Similarity algorithm advice, using two dimensional associative array

假装没事ソ 提交于 2021-02-08 12:10:38

问题


The main goal of this algorithm is to find similar titles of news articles from different sources of web and group them, let's say above 55.55% similarity.

My current approach of the algorithm consist of following steps:

  • Feed data from MYSQL database into a two-dimensional array ex. $arrayOne.
  • Make another copy of that array into ex. $arrayTwo.
  • Create a clean array which will only contain similar titles and other content ex. $array_smlr.
  • Loop, foreach $arrayOne article_title check for similarity with $arrayTwo article_title
  • If similarity of between two titles is above 55% and if the article is not from the same news source (this way I don't check same articles from the same source) add it to $array_smlr
  • Sort the $array_smlr based on percentages of similarity, this way I end up grouping titles that are similar.

Below is my code for the above tasks mentioned.

$result = mysqli_query($conn,"SELECT id_articles,article_img,article_title,LEFT(article_content , 200),psource, date_fetched FROM project.articles WHERE " . rtrim($values,' or') . " ORDER BY date_fetched DESC LIMIT 70");

$arrayOne=array();
$arrayTwo=array();

while($row = mysqli_fetch_assoc($result)){
    $arrayOne[] = $row;
}
$arrayTwo = $arrayOne;
$array_smlr=array();
foreach ($arrayOne as $rowOne) {
    foreach($arrayTwo as $rowTwo){
        $compare = similar_text($rowOne['article_title'], $rowTwo['article_title'], $p);
        if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
            $data =  array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
            $array_smlr[]=$data; 
        }
    }
}
array_multisort($array_smlr);
foreach($array_smlr as $row3){
    echo $row3['percentage'] . $row3['article_title'] . $row3['psource'] . $row3['id_articles'] . $row3['date_fetched'] . "<br><br>";
}

This would work with limited functionality, only if I had two similar titles, but let's say if I had 3 similar titles, it would include duplicated rows of data in $array_smlr.

I would appreciate if you have any suggestions on optimization of this algorithm in order to improve the performance.

Thanks,


回答1:


You don't really need 2 arrays instead of the foreach loop without $key wildcard you can use it with $key and skip the solver when the $key is the same. Then you also avoid dupes.

foreach ($arrayOne as $key => $rowOne) {
   foreach($arrayOne as $ikey => $rowTwo){
      if ($ikey != $key) {
        $compare = similar_text($rowOne['article_title'],$rowTwo['article_title'], $p);
        if ( round($p,2) >= 55.50 and $rowOne['psource'] != $rowTwo['psource'] ){
            $data =  array('percentage' => round($p,2), 'article_title' => $rowTwo['article_title'], 'psource' => $rowTwo['psource'], 'id_articles' => $rowTwo['id_articles'], 'date_fetched' =>$rowTwo['date_fetched']);
            $array_smlr[$rowTwo['id_articles']]=$data; 
        }
    }
}


来源:https://stackoverflow.com/questions/30312819/similarity-algorithm-advice-using-two-dimensional-associative-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!