dask

Repartition Dask DataFrame to get even partitions

谁说胖子不能爱 提交于 2020-08-24 06:37:37
问题 I have a Dask DataFrames that contains index which is not unique ( client_id ). Repartitioning and resetting index ends up with very uneven partitions - some contains only a few rows, some thousands. For instance the following code: for p in range(ddd.npartitions): print(len(ddd.get_partition(p))) prints out something like that: 55 17 5 41 51 1144 4391 75153 138970 197105 409466 415925 486076 306377 543998 395974 530056 374293 237 12 104 52 28 My DataFrame is one-hot encoded and has over 500

Pandas dataframes too large to append to dask dataframe?

不羁岁月 提交于 2020-08-20 02:37:24
问题 I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers? Below is the basic process I used: import pandas as pd import

Pandas dataframes too large to append to dask dataframe?

最后都变了- 提交于 2020-08-20 02:37:05
问题 I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers? Below is the basic process I used: import pandas as pd import

【谎言大揭秘】Modin真的比pandas运行更快吗?

痴心易碎 提交于 2020-08-11 00:57:33
最近看了某公众号文章,推荐了所谓的神器,据说读取速度吊打pandas,可谓牛逼。 抱着学习的精神,网上搜了文章,发现了一些端倪,事实真是这样吗?来一起揭秘真相。 首先安装包 # pip install ray # pip install dask # pip install modin 安装版本 Successfully installed aiohttp-3.6.2 async-timeout-3.0.1 google-2.0.3 multidict-4.7.6 py-spy-0.3.3 ray-0.8.5 redis-3.4.1 yarl-1.4.2 Requirement already satisfied: dask in /Applications/anaconda3/lib/python3.7/site-packages (2.11 .0) Successfully installed modin -0.7.3 pandas-1.0.3 Successfully uninstalled ray -0.8.5 Successfully installed pyarrow -0.16.0 ray-0.8.4 导入包测试 import modin.pandas as pd # ImportError: Please `pip install modin[ray]` to

第四范式:分布式机器学习框架与高维实时推荐系统

荒凉一梦 提交于 2020-08-07 16:45:46
导读: 随着互联网的高速发展和信息技术的普及,企业经营过程中产生的数据量呈指数级增长,AI 模型愈发复杂,在摩尔定律已经失效的今天,AI 的落地面临着各种各样的困难。本次分享的主题是分布式机器学习框架如何助力高维实时推荐系统。 机器学习本质上是一个高维函数的拟合,可以通过概率转换做分类和回归。而推荐的本质是二分类问题,推荐或者不推荐,即筛选出有意愿的用户进行推荐。 本文将从工程的角度,讲述推荐系统在模型训练与预估上面临的挑战,并介绍第四范式分布式机器学习框架 GDBT 是如何应对这些工程问题的。 主要内容包括: 推荐系统对于机器学习基础架构的挑战 大规模分布式机器学习场景下,不同算法的性能瓶颈和解决思路 第四范式分布式机器学习框架 GDBT 面临的网络压力及优化方向 01 推荐系统对于机器学习基础架构的挑战 1. 海量数据+高维特征带来极致效果 传统的推荐系统中,我们只用简单的模型或者规则来拟合数据,就可以得到一个很好的效果 ( 因为使用复杂的模型,很容易过拟合,效果反而越来越差 )。但是当数据量增加到一定的数量级时,还用简单的模型或者规则来拟合数据,并不能充分的利用数据的价值,因为数据量增大,推荐的效果上限也随之提升。这时,为了追求精准的效果,我们会把模型构建的越来越复杂,对于推荐系统而言,由于存在大量的离散特征,如用户 ID、物品 ID 以及各种组合

How to speed up import of large xlsx files?

最后都变了- 提交于 2020-08-02 19:32:09
问题 I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import file format is mandatory (I know that csv is faster...). How can I speed up the process of importing a large Excel file into a pandas dataframe? Would be great to get the time down to around 1-2 minutes, if possible, which would be much more

How to speed up import of large xlsx files?

和自甴很熟 提交于 2020-08-02 19:30:49
问题 I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import file format is mandatory (I know that csv is faster...). How can I speed up the process of importing a large Excel file into a pandas dataframe? Would be great to get the time down to around 1-2 minutes, if possible, which would be much more