I have a text file that contains tweets per line, that need to be altered for a machine learning format. Im using python and basic unix text manipulation (regex) to achieve
In awk:
awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0) # This will ignore `!`. Other rules can be added.
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
gsub line to handle special cases.)[jaypal:~/Temp] cat lookup
:)
cool
happy
fun
[jaypal:~/Temp] cat tweets
this has been a fun day :)
i find python cool! it makes me happy
[jaypal:~/Temp] awk '
NR==FNR {
a[$1];
next
}
{
gsub(/!/, "", $0)
for (i=1;i<=NF;i++) {
if ($i in a) {
printf "1 "
}
else {
printf "0 "
}
}
print ""
}' lookup tweets
0 0 0 0 1 0 1
0 0 0 1 0 0 0 1