Parsing web page in python using Beautiful Soup

后端未结

关注

 2  783

名媛妹妹 2021-02-15 10:26

I have some troubles with getting the data from the website. The website source is here:

view-source:http://release24.pl/wpis/23714/%22La+mer+a+boire%22+%282011


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   刺人心
                                             
                
                
                (楼主)
            
              
              
                2021-02-15 10:56
              

            
            
                        
The secret of using BeautifulSoup is to find the hidden patterns of your HTML document. For example, your loop

for ul in soup.findAll('p') :
    print(ul)


is in the right direction, but it will return all paragraphs, not only the ones you are looking for. The paragraphs you are looking for, however, have the helpful property of having a class i. Inside these paragraphs one can find two spans, one with the class i and another with the class vi. We are lucky because those spans contains the data you are looking for:


    Tytuł............................................
    : La mer à boire



So, first get all the paragraphs with the given class:

>>> ps = soup.findAll('p', {'class': 'i'})
>>> ps
[Tytuł...  ...pan>]


Now, using list comprehensions, we can generate a list of pairs, where each pair contains the first and the second span from the paragraph:

>>> spans = [(p.find('span', {'class': 'i'}), p.find('span', {'class': 'vi'})) for p in ps]
>>> spans
[(Tyt... ..., : La mer à boire), 
 (Ocena... ..., : IMDB - 6.3/10 (24)),
 (Produkcja.. ..., : Francja),
 # and so on
]


Now that we have the spans, we can get the texts from them:

>>> texts = [(span_i.text, span_vi.text) for span_i, span_vi in spans]
>>> texts
[(u'Tytu\u0142............................................', u': La mer \xe0 boire'),
 (u'Ocena.............................................', u': IMDB - 6.3/10 (24)'),
 (u'Produkcja.........................................', u': Francja'), 
  # and so on
]


Those texts are not ok still, but it is easy to correct them. To remove the dots from the first one, we can use rstrip():

>>> u'Produkcja.........................................'.rstrip('.')
u'Produkcja'


The : string can be removed with lstrip():

>>> u': Francja'.lstrip(': ')
u'Francja'


To apply it to all content, we just need another list comprehension:

>>> result = [(text_i.rstrip('.'), text_vi.replace(': ', '')) for text_i, text_vi in texts]
>>> result
[(u'Tytu\u0142', u'La mer \xe0 boire'),
 (u'Ocena', u'IMDB - 6.3/10 (24)'),
 (u'Produkcja', u'Francja'),
 (u'Gatunek', u'Dramat'),
 (u'Czas trwania', u'98 min.'),
 (u'Premiera', u'22.02.2012 - \u015awiat'),
 (u'Re\u017cyseria', u'Jacques Maillot'),
 (u'Scenariusz', u'Pierre Chosson, Jacques Maillot'),
 (u'Aktorzy', u'Daniel Auteuil, Maud Wyler, Yann Trégouët, Alain Beigel'),
 (u'Wi\u0119cej na', u':'),
 (u'Trailer', u':Obejrzyj zwiastun')]


And that is it. I hope this step-by-step example can make the use of BeautifulSoup clearer for you.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复