Delete files older than 10days on HDFS

前端 未结 5 2074
难免孤独
难免孤独 2020-12-08 22:15

Is there a way to delete files older than 10 days on HDFS?

In Linux I would use:

find /path/to/directory/ -type f -mtime +10 -name \'*.txt\' -execdir         


        
相关标签:
5条回答
  • 2020-12-08 22:25

    I attempted to implement the accepted solution above.

    Unfortunately, it only partially worked for me. I ran into 3 real world problems.

    First, hdfs didn't have enough RAM to load up and print all the files.

    Second, even when hdfs could print all the files awk could only handle ~8300 records before it broke.

    Third, the performance was abysmal. When implemented it was deleting ~10 files per minute. This wasn't useful because I was generating ~240 files per minute.

    So my final solution was this:

    tmpfile=$(mktemp)
    HADOOP_CLIENT_OPTS="-Xmx2g" hdfs dfs -ls /path/to/directory    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=35*24*60; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print $3}}; close(cmd);' > $tmpfile
    hdfs dfs -rm -r $(cat $tmpfile)
    rm "$tmpfile"
    

    I don't know if there are additional limits on this solution but it handles 50,000+ records in a timely fashion.

    EDIT: Interestingly, I ran into this issue again and on the remove, I had to batch my deletes as hdfs rm statement couldn't take more than ~32,000 inputs.

    0 讨论(0)
  • 2020-12-08 22:29

    Yes, you can try with HdfsFindTool:

    hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar \
      org.apache.solr.hadoop.HdfsFindTool \
      -find /pathhodir -mtime +10 -name ^.*\.txt$ \
      | xargs hdfs dfs -rm -r -skipTrash
    
    0 讨论(0)
  • 2020-12-08 22:37
    hdfs dfs -ls -t /file/Path|awk -v dateA="$date" '{if ($6" "$7 < {target_date}) {print ($8)}}'|xargs -I% hdfs dfs -rm "%" /file/Path
    
    0 讨论(0)
  • 2020-12-08 22:42

    Solution 1: Using multiple commands as answered by daemon12

    hdfs dfs -ls /file/Path    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'
    

    Solution 2: Using Shell script

    today=`date +'%s'`
    hdfs dfs -ls /file/Path/ | grep "^d" | while read line ; do
    dir_date=$(echo ${line} | awk '{print $6}')
    difference=$(( ( ${today} - $(date -d ${dir_date} +%s) ) / ( 24*60*60 ) ))
    filePath=$(echo ${line} | awk '{print $8}')
    
    if [ ${difference} -gt 10 ]; then
        hdfs dfs -rm -r $filePath
    fi
    done
    
    0 讨论(0)
  • 2020-12-08 22:48

    How about this:

    hdfs dfs -ls /tmp    |   tr -s " "    |    cut -d' ' -f6-8    |     grep "^[0-9]"    |    awk 'BEGIN{ MIN=14400; LAST=60*MIN; "date +%s" | getline NOW } { cmd="date -d'\''"$1" "$2"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-WHEN; if(DIFF > LAST){ print "Deleting: "$3; system("hdfs dfs -rm -r "$3) }}'
    

    A detailed description is here.

    0 讨论(0)
提交回复
热议问题