Check if string in list, depending on last two characters

*爱你&永不变心* 提交于 2019-12-11 01:18:58

问题


Set-up

I am using Scrapy to scrape housing ads. Per ad I retrieve a postal code which consists of four numbers followed by 2 letters, e.g. 1053ZM.

I have a excel sheet linking districts to postal codes in the following way,

district    postcode_min    postcode_max
   A           1011AB           1011BD
   A           1011BG           1011CE
   A           1011CH           1011CZ

So, the second row states that postcodes ranging from 1011AB, 1011AC,..., 1011AZ, 1011BA,...,1011BD belong to district A.

The actual list contains 1214 rows.


Problem

I'd like to match each ad with its respective district, using its postal code and the list.

I am not sure what would be the best way to do this, and how to do this.

I've come up with two different approaches:

  1. Create all postcodes between postcode_min and postcode_max, assign all postcodes and their respective districts to a dictionary to subsequently match using a loop.

I.e. create,

 d = {'A': ['1011AB','1011AC',...,'1011BD',
            '1011BG','1011BH',...,'1011CE',
            '1011CH','1011CI',...,'1011CZ'],
      'B': [...],           
      }

and then,

found = False
for distr in d.keys(): # loop over districts
     for code in d[distr]: # loop over district's postal codes
         if postal_code in code: # assign if ad's postal code in code                 
             district = distr
             found = True
             break
         else:
             district = 'unknown'
     if found:
         break
  1. Make Python understand there is a range between the postcode_min and the postcode_max, assign ranges and their respective districts to a dictionary, and match using a loop.

I.e. something like,

d = {'A': [range(1011AB,1011BD), range(1011BG,1011CE),range(1011CH,1011CZ)],
     'B': [...]
    }

and then,

found = False
for distr in d.keys(): # loop over districts
     for range in d[distr]: # loop over district's ranges
         if postal_code in range: # assign if ad's postal code in range                 
             district = distr
             found = True
             break
         else:
             district = 'unknown'
     if found:
         break

Issues

For approach 1:

  • How do I create all the postal codes and assign them to a dictionary?

For approach 2:

I used range() for explanatory purpose but I know range() does not work like this.

  • What do I need to effectively have a range() as in the example above?
  • How do I correctly loop over these ranges?

I think my preference lies with approach 2, but I am happy to work with either one. Or with another solution if you have one.


回答1:


You can just collect the values in excel like this

d = {'A': ['1011AB', '1011BD', '1011BG', '1011CE',  '1011CH', '1011CZ'],
     'B': ['1061WB', '1061WB'],
     }

def is_in_postcode_range(current_postcode, min, max):
    return min <= current_postcode <= max

def get_district_by_post_code(postcode):
    for district, codes in d.items():
        first_code = codes[0]
        last_code = codes[-1]
        if is_in_postcode_range(postcode, first_code, last_code):
            if any(is_in_postcode_range(postcode, codes[i], codes[i+1]) for i in range(0, len(codes), 2)):
                return district
            else:
                return None

usage:

print get_district_by_post_code('1011AC'): A
print get_district_by_post_code('1011BE'): None
print get_district_by_post_code('1061WB'): B



回答2:


You can use intervaltree to achieve much better lookup speed, and interpret the postal code as a number in base 36 (10 digits and 26 letters).

from intervaltree import IntervalTree
t = IntervalTree()
for district,postcode_min,postcode_max in your_district_table:
    # We read the postcode as a number in base 36
    postcode_min = int(postcode_min, 36)
    postcode_max = int(postcode_max, 36)
    t[postcode_min:postcode_max] = district

If the postcodes are inclusive (include the "max" postcode), then use this instead:

    t[postcode_min:postcode_max+1] = district

Finally, you can look up districts by post_code like this:

def get_district(post_code):
    intervals = t[int(post_code, 36)]
    if not intervals:
        return None
    # I assume you have only one district that matches a postal code
    return intervals[0][2] # The value of the first interval on the list


来源:https://stackoverflow.com/questions/43975616/check-if-string-in-list-depending-on-last-two-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!