sampling

How to perform undersampling (the right way) with python scikit-learn?

柔情痞子 提交于 2020-01-04 06:00:15
问题 I am attempting to perform undersampling of the majority class using python scikit learn. Currently my codes look for the N of the minority class and then try to undersample the exact same N from the majority class. And both the test and training data have this 1:1 distribution as a result. But what I really want is to do this 1:1 distribution on the training data ONLY but test it on the original distribution in the testing data. I am not quite sure how to do the latter as there is some dict

c++ discrete distribution sampling with frequently changing probabilities

一笑奈何 提交于 2020-01-03 14:15:10
问题 Problem: I need to sample from a discrete distribution constructed of certain weights e.g. {w1,w2,w3,..}, and thus probability distribution {p1,p2,p3,...}, where pi=wi/(w1+w2+...). some of wi's change very frequently, but only a very low proportion of all wi's. But the distribution itself thus has to be renormalised every time it happens, and therefore I believe Alias method does not work efficiently because one would need to build the whole distribution from scratch every time. The method I

How to get a random (bootstrap) sample from pandas multiindex

∥☆過路亽.° 提交于 2020-01-03 12:24:29
问题 I'm trying to create a bootstrapped sample from a multiindex dataframe in Pandas. Below is some code to generate the kind of data I need. from itertools import product import pandas as pd import numpy as np df = pd.DataFrame({'group1': [1, 1, 1, 2, 2, 3], 'group2': [13, 18, 20, 77, 109, 123], 'value1': [1.1, 2, 3, 4, 5, 6], 'value2': [7.1, 8, 9, 10, 11, 12] }) df = df.set_index(['group1', 'group2']) print df The df dataframe looks like: value1 value2 group1 group2 1 13 1.1 7.1 18 2.0 8.0 20 3

Android accelerometer sampling rate/delay stabilization

て烟熏妆下的殇ゞ 提交于 2020-01-03 03:27:06
问题 I'm trying to detect the force strength of a tap by using the data from the accelerometer and with the method onTouch. As far as I know, the fastest sampling frequency for the accelerometer is 200-202Hz, but this variability is giving me problems when trying to match the timestamps for the onTouch event and the peak in the accelerometer data. Is there a way to stabilize the readings of the accelerometer to avoid this problem? Like controlling the specific thread or something? 回答1: If you want

Selecting nodes with probability proportional to trust

六眼飞鱼酱① 提交于 2020-01-02 09:38:51
问题 Does anyone know of an algorithm or data structure relating to selecting items, with a probability of them being selected proportional to some attached value? In other words: http://en.wikipedia.org/wiki/Sampling_%28statistics%29#Probability_proportional_to_size_sampling The context here is a decentralized reputation system and the attached value is therefore the value of trust one user has in another. In this system all nodes either start as friends which are completely trusted or unknowns

Reproducible splitting of data into training and testing in R

本小妞迷上赌 提交于 2020-01-01 22:15:12
问题 A common way for sampling/splitting data in R is using sample , e.g., on row numbers. For example: require(data.table) set.seed(1) population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids test <- sample(N-1, N/2, replace = F) test1 <- sample1[test, .(id)] The problem is that this isn't very robust to changes in the data. For example if we drop just one observation: sample2 <- sample1[

Efficiently picking a random element from a chained hash table?

一笑奈何 提交于 2019-12-31 08:49:47
问题 Just for practice (and not as a homework assignment) I have been trying to solve this problem (CLRS, 3rd edition, exercise 11.2-6): Suppose we have stored n keys in a hash table of size m, with collisions resolved by chaining, and that we know the length of each chain, including the length L of the longest chain. Describe a procedure that selects a key uniformly at random from among the keys in the hash table and returns it in expected time O(L * (1 + m/n)). What I thought so far is that the

Fast Poisson Disk Sampling [Robert Bridson] in Python

非 Y 不嫁゛ 提交于 2019-12-31 04:50:05
问题 First of all, I implemented the ordinary, slow, Poisson Disk Sampling algorithm in the 2D plane and it works just fine. This slow version calculates the distances between all points and checks that the point you wish to place is at least R away from all the others. The fast version by Robert Bridson, available here: https://www.cs.ubc.ca/~rbridson/docs/bridson-siggraph07-poissondisk.pdf, suggests discretizing your 2D plane into quadratic cells with length = R/sqrt(2) since each cell can at

Efficient algorithm for generating unique (non-repeating) random numbers

空扰寡人 提交于 2019-12-30 10:39:06
问题 I want to solve the following problem. I have to sample among an extremely large set, of the order of 10^20 and extracting a sample without repetitions of size about 10%-20%. Given the size of the set, I believe that an algorithm like Fisher–Yates is not feasible. I'm thinking that something like random path tree might work for doing it in O(n log n) and can't be done faster, but I want to ask if something like this has already been implemented. Thank you for your time! 回答1: I don't know how

How to get sound data sample value in c#

邮差的信 提交于 2019-12-29 09:14:09
问题 I need to get the sample values of sound data of a WAV file so that by using those sample values i need to get the amplitude values of that sound data in every second. Important: Is there any way to get audio data sample values using Naudio library or wmp library? I am getting the sample values in this way: byte[] data = File.ReadAllBytes(File_textBox.Text); var samples=new int[data.Length]; int x = 0; for (int i = 44; i <data.Length; i += 2) { samples[x] = BitConverter.ToInt16(data, i); x++;