Handle o:p tag in BeautifulSoup

折月煮酒 提交于 2021-01-29 09:14:46

问题


I was extracting some disease information from : http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html

but the data was contained inside a tag which I don't know how to handle.

One way I found was using find_all function but is there any way to do it as tr.td.span.[o:p or something] ??


<td width="584" nowrap="" valign="top" style="width:438.0pt;padding:0in 5.4pt 0in 5.4pt;
  height:12.75pt">
  <p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;">UMLS:C0008031_pain
  chest
<o:p>&nsp</o:p>
</span>
</p>
  </td>


回答1:


import pandas as pd

df = pd.read_html(
    "http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")[0]

df.to_csv("out.csv", index=False, header=False)

Output: view-online

that's in case if you want full table.

but for your requirement.

Use:

import pandas as pd

df = pd.read_html(
    "http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")[0]

print(df[2][1:].values.tolist())

For bs4

use

import requests
from bs4 import BeautifulSoup

r = requests.get(
    "http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html")


soup = BeautifulSoup(r.text, 'html.parser')

for item in soup.findAll("p", {'class': 'MsoNormal'}):
    item = item.get_text(strip=True)
    if item.startswith("UMLS"):
        print(item)


来源:https://stackoverflow.com/questions/59658390/handle-op-tag-in-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!