Too many attributes for ARFF format in Weka

风流意气都作罢 提交于 2019-12-20 02:58:42

问题


I am working with a data-set of dimension more than 10,000. To use Weka I need to convert text file into ARFF format, but since there are too many attributes even after using sparse ARFF format file size is too large. Is there any similar method as for data to avoid writing so many attribute identifier as in header of ARFF file.

for example :
@attribute A1 NUMERICAL
@attribute A2 NUMERICAL
...
...
@attribute A10000 NUMERICAL


回答1:


I coded a script in AWK to format the following lines (in a TXT file) to an ARFF

example.txt source:

Att_0 | Att_1 | Att_2 | ... | Att_n
1 | 2 | 3 | ... | 999

My script (to_arff), you can change FS value depending on the separator used in the TXT file:

#!/usr/bin/awk -f
# ./<script>.awk data.txt > data.arff

BEGIN {
    FS = "|";
    # WEKA separator
    separator = ",";
}

# The first line
NR == 1 {
    # WEKA headers
        split(FILENAME, relation, ".");
        # the relation's name is the source file's name
    print "@RELATION "relation[1]"\n";
    # attributes are "numeric" by default
    # types available: numeric, <nominal> {n1, n2, ..., nN}, string and date [<date-format>]
    for (i = 1; i <= NF; i++) {
        print "@ATTRIBUTE "$i" NUMERIC";
    }
    print "\n@DATA";
}

NR > 1 {
    s = "";
    first = 1;
    for (i = 1; i <= NF; i++) {
        if (first)
            first = 0;
        else
            s = s separator;
        s = s $i;
    }
    print s;
}

Output:

@RELATION example

@ATTRIBUTE Att_0 NUMERIC
@ATTRIBUTE Att_1 NUMERIC
@ATTRIBUTE Att_2 NUMERIC
@ATTRIBUTE Att_n NUMERIC

@DATA
1,2,3,9999


来源:https://stackoverflow.com/questions/9234232/too-many-attributes-for-arff-format-in-weka

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!