extracting Arabic text in c# by using itextsharp

后端 未结 1 1320

I have this code and I\'m using it to take the text of a PDF. It\'s great for a PDF in English but when I\'m trying to extract the text in Arabic it shows me something like

相关标签:
1条回答
  • 2020-12-18 14:43

    I had to change the strategy like this

    var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
    var te = Convert(t);
    

    and this function to reverse the Arabic words and keep the English

      private string Convert(string source)
      {
           string arabicWord = string.Empty;
           StringBuilder sbDestination = new StringBuilder();
    
           foreach (var ch in source)
           {
               if (IsArabic(ch))
                   arabicWord += ch;
               else
               {
                   if (arabicWord != string.Empty)
                        sbDestination.Append(Reverse(arabicWord));
    
                   sbDestination.Append(ch);
                   arabicWord = string.Empty;
                }
            }
    
            // if the last word was arabic    
            if (arabicWord != string.Empty)
                sbDestination.Append(Reverse(arabicWord));
    
            return sbDestination.ToString();
         }
    
    
         private bool IsArabic(char character)
         {
             if (character >= 0x600 && character <= 0x6ff)
                 return true;
    
             if (character >= 0x750 && character <= 0x77f)
                 return true;
    
             if (character >= 0xfb50 && character <= 0xfc3f)
                 return true;
    
             if (character >= 0xfe70 && character <= 0xfefc)
                 return true;
    
             return false;
         }
    
         // Reverse the characters of string
         string Reverse(string source)
         {
              return new string(source.ToCharArray().Reverse().ToArray());
         }
    
    0 讨论(0)
提交回复
热议问题