How to extract data from tags which are child of another tag through scrapy and python?

浪尽此生 提交于 2021-02-05 12:23:01

问题


This is the html code from which i want to extract data. But whenever i run i am getting some random values. Please can anyone help me out with this.

I want to extract the following:Mumbai, Maharastra, 1958, government, UGC and Indian Institute of Technology, Bombay.

HTML:

<div class="instituteInfo">
    <ul class="clg-info">
        <li>
            <a href="link here" target="_blank">Mumbai</a>,
            <a href="link here" target="_blank">Maharashtra</a>
        </li>
        <li>Estd : <span>1958</span></li>
        <li>Ownership : <span>Government</span></li>
        <li>Approved By : <span>UGC</span></li>
        <li>Affiliated To : <a href="link">Indian Institute of Technology, Bombay</a></li>
</div>

This is the code of parse function in scrapy

def parse(self, response):
    for students in response.css('div.topBlockInstituteInfoBottom'):
    yield {
        'name': students.css('h1.titleNameCol::text').extract(),
        'city': students.css('ul.clg-info li a::text')[0].extract(),
        'state': students.css('ul.clg-info li a::text')[1].extract(),
        'estd': students.css('ul.clg-info li span::text')[0].extract(),
        'ownership': students.css('ul.clg-info li span::text')[3].extract(),
        'approved by': students.css('ul.clg-info li span::text')[4].extract(),
        'affiliation': students.css('ul.clg-info li a::text')[5].extract(),
    }

回答1:


Even though you haven't provided much detail that pertains to the issue at hand such as the output that you're getting or the website and question, I'm willing to put money that the problem is the Rangers that you put in your itemization...

Is the return values that you say are random/incorrect come from when running spider... as in your crawling the sites entire directory of different ... institutes? meaning that the html snippet is just one of the many other pages your scrapping?? If so...

Then your issue is for sure a range issue... your using ranges to select a tag from others in the same node... but what happens if the pages in the site are not all the same? Range remains the same but the position of you content does not so your going to get either none values.

In the for loop..

def parse(self, response):
    for students in response.css('div.topBlockInstituteInfoBottom'):

The css value that your using I cannot verify if its correct due to the fact I dont know the site in question BUT taking the HTML snippet you showed... the css value in your loop is wrong...

<div class="instituteInfo">
    <ul class="c...

and again...

for students in response.css('div.topBlockInstituteInfoBottom'):

see where im going with this?? should be using .instituteInfo for you css value in loop...

Written with StackEdit.


Updated answer


As I suspected, I checked out the site and of cours...not all personal pages in regards to a profile of a institution is goin to have all the same info there for they will be structured slightly different but one node off is enough to make garbage of your data.

Very subtle difference... thats all it takes

Its because opf this exact thing that using xpath is down-right crucia, if your not familiar... essentially you can use special syntax to furthurpin-target your data with out any if, ands or buts(If your doing it right that is...)

Thing of it as regular expression for structired data such as html, xml, markdown etc... You can use .. arguments? such as contains... so I want to further state WHICH link in a path (thats the "a" tag for href of course) that has the word "Next" the I would use ...

//html/body/..../a[contains(., "next")]/@href

you can just go straight to using

//a[contains(., "next")]/@href

But the more specific you can get to stating thepath to your content, or the node, the less likely you are to run into any confusion.. rather, yyour parse.

In your case.... Do this. exactly, though Im not going to take all thje fun away thatisis to learn something new lol... heres what oneshouldlooklike..

response.xpath("//ul@class='clg-info'/li[contains(.,'Ownership')]/span/text().extract()

You dont need tobe in scrapy shell to check out what it outputs, if you use any browser dev tool and then inside do a ctrl+f or a search... it should allow for xpath but yeah...the out put is "Private".... because I stated that basically that in the path/node-level of 'clg-info' that I am looking for the upcoming li that contains theplain text word Ownership... doenst have to be the full word either but yea... then had to maneuver on span over and there... you just have to look at thehtml while doing it and it obvious.

... PRO TIP... that example I gave earlier about finding a link ... /a tag, that contains the wordnext... couldyou think of how that can be usefull?? =) Navihating through webpages can be such a pain but know your xpath and regex and there no content you cant parse... shoot... once you getgood, you can start really understanding how to de-obfuscate js in web pages.. BY HAND ... one of that sissy jnice studfflol



来源:https://stackoverflow.com/questions/44692759/how-to-extract-data-from-tags-which-are-child-of-another-tag-through-scrapy-and

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!