I am working with an extremely large data set in a sparse matrix format.
The data has the filing format (3 tab separated columns, where the string in the first column corresponds to a row, the string in the second column corresponds to the attribute and the value in the third column is a weighted score).
church place 3
church institution 6
man place 86
man food 63
woman book 37
I would like to convert this to arff format using awk (if possible) so that using the above as an input, I can obtain the following output:
@relation 'filename'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string
@data
3,6,0,0,church
86,0,63,0,man
0,0,0,37,woman
I have seen this awk file done HERE, that produces a result quite similar to what I need. However, the input is a bit different. I tried to manipulate the code provided by changing the FS = "|" to "\t", but it does not produce the desired results. Does anyone have a suggestion as to how I can manipulate this awk code to convert my input to my desired output?
I've no idea what arff is (nor do I need to know to help you transpose your text to a different format) so let's start with this:
$ cat tst.awk
BEGIN{ FS="\t" }
NR==1 { printf "@relation '%s'\n", FILENAME }
{
row = $1
attr = $2
if (!seenRow[row]++) {
rows[++numRows] = row
}
if (!seenAttr[attr]++) {
printf "@attribute \"%s\" string\n", attr
attrs[++numAttrs] = attr
}
score[row,attr] = $3
}
END {
print "\n\n@data"
for (rowNr=1; rowNr<=numRows; rowNr++) {
row = rows[rowNr]
for (attrNr=1;attrNr<=numAttrs;attrNr++) {
attr = attrs[attrNr]
printf "%d,", score[row,attr]
}
print row
}
}
$
$ cat file
church place 3
church institution 6
man place 86
man food 63
woman book 37
$
$ awk -f tst.awk file
@relation 'file'
@attribute "place" string
@attribute "institution" string
@attribute "food" string
@attribute "book" string
@data
3,6,0,0,church
86,0,63,0,man
0,0,0,37,woman
Now, tell us what's wrong with that and we can go from there.
来源:https://stackoverflow.com/questions/19046438/converting-sparse-matrix-to-arff-using-awk