Finding directories older than N days in HDFS

前端 未结 5 819
慢半拍i
慢半拍i 2020-12-09 19:47

Can hadoop fs -ls be used to find all directories older than N days (from the current date)?

I am trying to write a clean up routine to find and delete all directori

相关标签:
5条回答
  • 2020-12-09 20:37

    If you happen to be using CDH distribution of Hadoop, it comes with a very useful HdfsFindTool command, which behaves like Linux's find command.

    If you're using the default parcels information, here's how you'd do it:

    hadoop jar /opt/cloudera/parcels/CDH/jars/search-mr-*-job.jar \
    org.apache.solr.hadoop.HdfsFindTool -find PATH -mtime +N
    

    Where you'd replace PATH with the search path and N with number of days.

    0 讨论(0)
  • 2020-12-09 20:37

    For real clusters it is not a good idea, to use ls. If you have admin rights, it is more suitable to use fsimage.

    I modify script above to illustrate idea.

    first, fetch fsimage

    curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
    

    convert it to text (same output as lsr gives)

    hdfs oiv -i img.dump -o fsimage.txt
    

    Script:

    #!/bin/bash
    usage="Usage: dir_diff.sh [days]"
    
    if [ ! "$1" ]
    then
      echo $usage
      exit 1
    fi
    
    now=$(date +%s)
    curl "http://localhost:50070/getimage?getimage=1&txid=latest" > img.dump
    hdfs oiv -i img.dump -o fsimage.txt
    cat fsimage.txt | grep "^d" | while read f; do 
      dir_date=`echo $f | awk '{print $6}'`
      difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
      if [ $difference -gt $1 ]; then
        echo $f;
      fi
    done
    
    0 讨论(0)
  • 2020-12-09 20:38

    hdfs dfs -ls /hadoop/path/*.txt|awk '$6 < "2017-10-24"'

    0 讨论(0)
  • 2020-12-09 20:44

    I didn't have the HdfsFindTool, nor the fsimage from curl, and I didn't much like the ls to grep with while loop using date awk and hadoop and awk again. But I appreciated the answers.

    I felt like it could be done with just one ls, one awk, and maybe an xargs.

    I also added the options to list the files or summarize them before choosing to delete them, as well as choose a specific directory. Lastly I leave the directories and only concern myself about the files.

    #!/bin/bash
    USAGE="Usage: $0 [N days] (list|size|delete) [path, default /tmp/hive]"
    if [ ! "$1" ]; then
      echo $USAGE
      exit 1
    fi
    AGO="`date --date "$1 days ago" "+%F %R"`"
    
    echo "# Will search for files older than $AGO"
    if [ ! "$2" ]; then
      echo $USAGE
      exit 1
    fi
    INPATH="${3:-/tmp/hive}"
    
    echo "# Will search under $INPATH"
    case $2 in
      list)
        hdfs dfs -ls -R "$INPATH" |\
          awk '$1 ~ /^[^d]/ && ($6 " " $7) < '"\"$AGO\""
      ;;
      size)
        hdfs dfs -ls -R "$INPATH" |\
          awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {
               sum += $5 ; cnt += 1} END {
               print cnt, "Files with total", sum, "Bytes"}'
      ;;
      delete)
        hdfs dfs -ls -R "$INPATH" |\
          awk '$1 ~ /^[^d]/ && ($6 " " $7) < "'"$AGO"'" {print $8}' | \
          xargs hdfs dfs -rm -skipTrash
      ;;
      *)
        echo $USAGE
        exit 1
      ;;
    esac
    

    I hope others find this useful.

    0 讨论(0)
  • 2020-12-09 20:52

    This script lists all the directories that are older than [days] :

    #!/bin/bash
    usage="Usage: $0 [days]"
    
    if [ ! "$1" ]
    then
      echo $usage
      exit 1
    fi
    
    now=$(date +%s)
    hadoop fs -lsr | grep "^d" | while read f; do 
      dir_date=`echo $f | awk '{print $6}'`
      difference=$(( ( $now - $(date -d "$dir_date" +%s) ) / (24 * 60 * 60 ) ))
      if [ $difference -gt $1 ]; then
        echo $f;
      fi
    done
    
    0 讨论(0)
提交回复
热议问题