Slow loop python to search data in antoher data frame in python

前端 未结 2 415
無奈伤痛
無奈伤痛 2020-12-19 18:34

I have two data frames : one with all my data (called \'data\') and one with latitudes and longitudes of different stations where each observation starts and ends (called \'

相关标签:
2条回答
  • 2020-12-19 19:12

    This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.

    1. Prepare your dataset

    In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.

    2. Optimize your script

    Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.

    You can also consider splitting the work over multiple threads if appropriate.

    As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.

    3. Consider using distributed storage and computing

    This is a subject in itself that is way too big to be all explained here.

    Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.

    It aims at doing everything in parallel. It relies on a concept named MapReduce.

    The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.

    In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.

    Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

    0 讨论(0)
  • 2020-12-19 19:31

    This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.

    # create series mappings from info
    s_lat = info.set_index('station')['latitude']
    s_lon = info.set_index('station')['latitude']
    
    # calculate Boolean mask on year
    mask = data['year'] == '2018'
    
    # apply mappings, if no map found use fillna to retrieve original data
    data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
                                     .fillna(data.loc[mask, 'latitude'])
    
    data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
                                      .fillna(data.loc[mask, 'longitude'])
    
    0 讨论(0)
提交回复
热议问题