I have some pdf files, Using pdfbox i have converted them into text and stored into text files, Now from the text files i want to remove
You can use iText for do such things
//iText imports
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
for example:
try {
PdfReader reader = new PdfReader(INPUTFILE);
int n = reader.getNumberOfPages();
String str=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
System.out.println(str);
reader.close();
} catch (Exception e) {
System.out.println(e);
}
another one
try {
PdfReader reader = new PdfReader("c:/temp/test.pdf");
System.out.println("This PDF has "+reader.getNumberOfPages()+" pages.");
String page = PdfTextExtractor.getTextFromPage(reader, 2);
System.out.println("Page Content:\n\n"+page+"\n\n");
System.out.println("Is this document tampered: "+reader.isTampered());
System.out.println("Is this document encrypted: "+reader.isEncrypted());
} catch (IOException e) {
e.printStackTrace();
}
the above examples can only extract the text, but you need to do some more to remove hyperlinks, bullets, heading & numbers.