do not remove extra lines while preprocessing the crawled text

南楼画角 提交于 2021-02-11 14:47:01

问题


while crawling with nutch, it is removing all the extra lines from the crawled text. I want to keep the text and whatever the new lines are present on the website. for example: on crawling this page https://www.modernfamilydental.net/, the expected output is :

\n\n\n\nSan Francisco, CA Dentist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWould you like to switch to the accessible version of this site?\nGo to accessible site\n\nClose modal window\n\n\n\n\n\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\n\n\n\n\n\n\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n(415) 752-5244\n\n\n \n\n\n\n\n\n\n\n\n\nMenu\n\n\n\n\nHome\n\n\nServices\n \nLatest Equipment\n\n\nInsurance\n\n\nTeeth Whitening\n\n\nCrowns & Bridges\n\n\nSmile Makeovers\n\n\nResin Composite Bonding\n\n\nVeneers\n\n\nImplant Retained Dentures\n\n\nNight Guards\n\n\nMetal-Free Restoration\n\n\nInvisalign\n\n\nDental Examination

but the output from nutch is :

San Francisco, CA Dentist\nWould you like to switch to the accessible version of this site?\nGo to accessible site\nClose modal window\nDon\'t need the accessible version of this site?\nHide the accessibility button\n\nClose modal window\nAccessibility View\n\n\nClose toolbar\n\n\n\n\nJavascript must be enabled for the correct page display\nModern Family Dental Hao Tran, DMD\nDentist located in Laurel Heights, San Francisco, CA\n(415) 752-5244\nMenu\nHome\nServices\nLatest Equipment\nInsurance\nTeeth Whitening\nCrowns & Bridges\nSmile Makeovers\n\n\nResin Composite Bonding\nVeneers\nImplant Retained Dentures\nNight Guards\nMetal-Free Restoration\nInvisalign\nDental Examination

What configuration changes do I need to make in the config file?

p.s. pardon me if my question is silly, since I am new to nutch.

来源:https://stackoverflow.com/questions/63700213/do-not-remove-extra-lines-while-preprocessing-the-crawled-text

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!