Efficient string matching in Apache Spark
问题 Using an OCR tool I extracted texts from screenshots (about 1-5 sentences each). However, when manually verifying the extracted text, I noticed several errors that occur from time to time. Given the text \"Hello there 😊! I really like Spark ❤️!\", I noticed that: 1) Letters like \"I\", \"!\", and \"l\" get replaced by \"|\". 2) Emojis are not correctly extracted and replaced by other characters or are left out. 3) Blank spaces are removed from time to time. As a result, I might end up with a