data-science | 易学教程

Unique values within Pandas group of groups

阅读更多关于 Unique values within Pandas group of groups

问题 I have a dataframe that I need to group, then subgroup. From the subgroups I need to return what the subgroup is as well as the unique values for a column. df = pandas.DataFrame({'country': pandas.Series(['US', 'Canada', 'US', 'US']), 'gender': pandas.Series(['male', 'female', 'male', 'female']), 'industry': pandas.Series(['real estate', 'shipping', 'telecom', 'real estate']), 'income': pandas.Series([1, 2, 3, 4])}) def subgroup(g): return g.groupby(['gender']) s = df.groupby(['country'])

Retrieve final hidden activation layer output from sklearn's MLPClassifier

阅读更多关于 Retrieve final hidden activation layer output from sklearn's MLPClassifier

问题 I would like to do some tests with neural network final hidden activation layer outputs using sklearn's MLPClassifier after fit ting the data. for example, If I create a classifier, assuming data X_train with labels y_train and two hidden layers of sizes (300,100) clf = MLPClassifier(hidden_layer_sizes=(300,100)) clf.fit(X_train,y_train) I would like to be able to call a function somehow to retrieve the final hidden activation layer vector of length 100 for use in additional tests. Assuming a

scikit-learn: applying an arbitary function as part of a pipeline

阅读更多关于 scikit-learn: applying an arbitary function as part of a pipeline

问题 I've just discovered the Pipeline feature of scikit-learn, and I find it very useful for testing different combinations of preprocessing steps before training my model. A pipeline is a chain of objects that implement the fit and transform methods. Now, if I wanted to add a new preprocessing step, I used to write a class that inherits from sklearn.base.estimator . However, I'm thinking that there must be a simpler method. Do I really need to wrap every function I want to apply in an estimator

ValueError: labels ['timestamp'] not contained in axis

阅读更多关于 ValueError: labels ['timestamp'] not contained in axis

问题 I am learning machine learning and I came across this code. I am trying to run the file "Recommender-Systems.py" from the above source. But it throws an error ValueError: labels ['timestamp'] not contained in axis. How can it be removed? Here's a dropbox link of u.data file. 回答1: Your data is missing the headers so it's being wrongly inferred by the first row. You need to change a little bit the Recommender-Systems.py and manually inform the headers. The right header is available in the

combine or iterate pandas rows on specific columns

阅读更多关于 combine or iterate pandas rows on specific columns

问题 I am struggling to figure this row by row iteration out in pandas. I have a dataset that contains chat conversations between 2 parties. I would like to combine the dataset to row by row conversation between Person 1 and Person 2. Sometimes people will type in multiple sentences and these will appear as multiple records within the dataframe. This is the loop that I have come back with: line_text to be combined timestamp to be updated with the latest time if the line_by show that the same

Mapping the index of the feat importances to the index of columns in a dataframe

阅读更多关于 Mapping the index of the feat importances to the index of columns in a dataframe

问题 Hello I plotted a graph using feature_importance from xgboost. However, the graph returns "f-values". I do not know which feature is being represented in the graph. One way I heard about how to solve this is mapping the index of the features within my dataframe to the index of the feature_importance "f-values" and selecting the columns manually. How do I go about in doing this? Also, if there is another way in doing this, help would truly be appreciated: Here is my code below: feature

Redirect Bigquery Data to Prediction

阅读更多关于 Redirect Bigquery Data to Prediction

问题 We are developing a POC in Google's Spreadsheets. There are some configurations, but in a nutshell it downloads data from BigQuery and redirects it to Prediction. Our Bigquery tables have over to 41Mb, with is not allowed/supported by Spreadsheets. We thought in download packages of 5Mb of data from Bigquery. Although Predicition API provides methods for insert lots of data, the update method allows to upload only one line/instance. Is there any way to redirect Bigquery data straight to

Dask Bag read_text() line order

阅读更多关于 Dask Bag read_text() line order

问题 Does dask.bag.read_text() preserve the line order? Is it still preserved when reading from multiple files? bag = db.read_text('program.log') bag = db.read_text(['program.log', 'program.log.1']) 回答1: Informally, yes, most Dask.bag operations do preserve order. This behavior is not strictly guaranteed, however I don't see any reason to anticipate a change in the near future. 来源： https://stackoverflow.com/questions/39652733/dask-bag-read-text-line-order

my CSV file have some email addresses. Some of them have incomplete address. How do I make them fully recognizable using python?

阅读更多关于 my CSV file have some email addresses. Some of them have incomplete address. How do I make them fully recognizable using python?

问题 I'm a begineer in data science with python. I'm working on a Dataset in which i've to do following tasks: Using the Python petl: a. clean the data in the clinics.csv . This involves using python and Regex to standardise email addresses so they are usable as a html link, and b. output the merged and cleaned data into a CSV file with the name clinic_locations.csv . So, far i'm able to do handle a part of point ( b ) i.e. i've easily extracted data from the xml file and combined it with the csv

random_state parameter in sklearn's train_test_split

阅读更多关于 random_state parameter in sklearn's train_test_split

问题 What difference does different values of random state makes to the output? For instance, if I set 0 and if I set 100 what difference would it make to the output? 回答1: From the docs: The random_state is the seed used by the random number generator. In general a seed is used to create reproducible outputs. In the case of train_test_split the random_state determines how your data set is split. Unless you want to create reproducible runs, you can skip this parameter. For instance, if is set 0 and