I don\'t want to use CountVectorizer but try to reproduce it\'s way of tokenizing. I know it removes some special characters and puts them in lowercase. I tried this regex r