pandas | 易学教程

Parallelizing comparisons between two dataframes with multiprocessing

阅读更多关于 Parallelizing comparisons between two dataframes with multiprocessing

问题 I've got the following function that allows me to do some comparison between the rows of two dataframes ( data and ref )and return the index of both rows if there's a match. def get_gene(row): m = np.equal(row[0], ref.iloc[:,0].values) & np.greater_equal(row[2], ref.iloc[:,2].values) & np.less_equal(row[3], ref.iloc[:,3].values) return ref.index[m] if m.any() else None Being a process that takes time (25min for 1.6M rows in data versus 20K rows in ref ), I tried to speed things up by

Parallelizing comparisons between two dataframes with multiprocessing

阅读更多关于 Parallelizing comparisons between two dataframes with multiprocessing

Reading a CSV file to pandas works in windows, not in ubuntu

阅读更多关于 Reading a CSV file to pandas works in windows, not in ubuntu

问题 I have written some scrip in python using windows and want to run it in my raspberry with Ubuntu. I am reading a csv file with line separator new line. When I load the df I use the following code: dfaux = pd.read_csv(r'/home/ubuntu/Downloads/data.csv', sep=';') which loads a df with just one row. I have also tried including the argument lineterminator = '\n\t' which throws this error message: ValueError: Only length-1 line terminators supported In windows I see the line breaks in the csv file

Reading a CSV file to pandas works in windows, not in ubuntu

阅读更多关于 Reading a CSV file to pandas works in windows, not in ubuntu

How to split datatable dataframe into train and test dataset in python

阅读更多关于 How to split datatable dataframe into train and test dataset in python

问题 I am using datatable dataframe. How can I split the dataframe into train and test dataset? Similarly to pandas dataframe, I tried to use train_test_split(dt_df,classes) from sklearn.model_selection, but it doesn't work and I get error. import datatable as dt import numpy as np from sklearn.model_selection import train_test_split dt_df = dt.fread(csv_file_path) classe = dt_df[:, "classe"]) del dt_df[:, "classe"]) X_train, X_test, y_train, y_test = train_test_split(dt_df, classe, test_size=test

Python scrape table from website?

阅读更多关于 Python scrape table from website?

问题 I'd like to scrape every treasury yield rate that is available on treasury.gov website. https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yieldAll How would I go about taking this information? I'm assuming that I'd have to use BeautifulSoup or Selenium or something like that (preferably BS4). I'd eventually like to put this data in a Pandas DataFrame. 回答1: Here's one way you can grab the data in a table using requests and beautifulsoup import

Python scrape table from website?

阅读更多关于 Python scrape table from website?

How to split datatable dataframe into train and test dataset in python

阅读更多关于 How to split datatable dataframe into train and test dataset in python

write unicode data to mssql with python?

阅读更多关于 write unicode data to mssql with python?

问题 I'm trying to write a table from a .csv file with Hebrew text in it to an sql server database. the table is valid and pandas reads the data correct (even displays the hebrew properly in pycharm), but when i try to write it to a table in the database i get question marks ( "???" ) where the Hebrew should be. this is what i've tried, using pandas and sqlalchemy: import pandas as pd from sqlalchemy import create_engine engine = create_engine('mssql+pymssql://server/test?charset=utf8') connection

Pandas: Sort a Multiindex Dataframe's multi-level column with mixed datatypes

阅读更多关于 Pandas: Sort a Multiindex Dataframe's multi-level column with mixed datatypes

问题 Below is my dataframe: In [2804]: df = pd.DataFrame({'A':[1,2,3,4,5,6], 'D':[{"value": '126', "perc": None, "unit": None}, {"value": 324, "perc": None, "unit": None}, {"value": 'N/A', "perc": None, "unit": None}, {}, {"value": '100', "perc": None, "unit": None}, np.nan]}) In [2794]: df.columns = pd.MultiIndex.from_product([df.columns, ['E']]) In [2807]: df Out[2807]: A D E E 0 1 {'value': '126', 'perc': None, 'unit': None} 1 2 {'value': 324, 'perc': None, 'unit': None} 2 3 {'value': 'N/A',