large-data

Using dplyr for frequency counts of interactions, must include zero counts

只谈情不闲聊 提交于 2019-12-03 04:34:55
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 5 years ago . My question involves writing code using the dplyr package in R I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier ( id ), and a second with a date ( date ). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000

With Haskell, how do I process large volumes of XML?

≡放荡痞女 提交于 2019-12-02 18:19:26
I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user all ran into nasty thrashing. TagSoup import Control.Monad import Text.HTML.TagSoup userid = "83805" main = do posts <- liftM parseTags (readFile "posts.xml") print $ head $ map (fromAttrib "Id") $ filter (~== ("<row OwnerUserId=" ++ userid ++ ">")) posts hxt import Text.XML.HXT.Arrow import Text.XML.HXT.XPath userid = "83805" main = do runX $

Using dplyr for frequency counts of interactions, must include zero counts

蓝咒 提交于 2019-12-02 17:45:43
My question involves writing code using the dplyr package in R I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier ( id ), and a second with a date ( date ). At present, each row indicates the occurrence of an action (taken by the individual in the id column) on the date in the date column. There are about 300,000 unique individuals, and about 2600 unique dates. For example, the beginning of the data look like this: id date John12 2006-08-03 Tom2993 2008-10-11 Lisa825 2009-07-03 Tom2993 2008-06-12 Andrew13 2007-09-11 I'd like to

Best of breed indexing data structures for Extremely Large time-series

帅比萌擦擦* 提交于 2019-12-02 17:10:21
I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear). Two basic types of time-series exist based on the sampling/discretisation characteristic: Regular discretisation (Every sample is taken with a common frequency) Irregular discretisation(Samples are taken at arbitary time-points) Queries that will be required: All values in the time range [t0,t1] All values in the time range [t0,t1] that are greater/less than v0 All values in the time range [t0,t1] that are in the value range[v0,v1

Transpose long to wide in SAS

佐手、 提交于 2019-12-02 15:28:51
问题 This question was migrated from Cross Validated because it can be answered on Stack Overflow. Migrated 3 years ago . I have a very large data set (18 million observations) that I would like to transpose by subsetting based on one variable and creating 900 new variables out of those sub/ets. Example code and desired output format below: Example data: data long1 ; input famid year faminc ; cards ; var1 96 40000 var1 97 40500 var1 98 41000 var2 96 45000 var2 97 45400 var2 98 45800 var3 96 75000

D3: How to show large dataset

馋奶兔 提交于 2019-12-02 14:43:02
I've a large dataset comprises 10^5 data points. And now I'm considering the following question related to large dataset: Is there any efficient way to visualize very large dataset? In my case I have a user set and each user has 10^3 items. There are 10^5 items in total. I want to show all the items for each user at a time to enable quick comparison between users. Some body suggests using a list, but I don't think a list is the only choice when dealing with this big dataset. Note I want to show all the items for each user at a time. This means I want to show all the datapoints when click on a

How to pass a string larger than 200 character to a stored procedure via param

冷暖自知 提交于 2019-12-02 13:08:14
I got stuck with one problem, in my code i have to make a sum request of all article that is present in my datatable, i concatenate all article ID in one string like 'a1,a2,a3' and this is supposed to work. But i have large ID and around 150 article, so the string i try to pass to the stored procedure is around 1300 characters and this is truncate at 200 characters when it goes to the stored procedure. Do you know any solution to pass a large string to a stored procedure without SQL Server to truncate this string? I can write here the C# code or SQL stored procedure if it can help you to help

MemoryError - how to download large file via Google Drive SDK using Python

↘锁芯ラ 提交于 2019-12-02 07:13:47
问题 I'm running out of memory when downloading big file from my Google Drive. I assume that tmp = content.read(1024) does not work, but how to fix it? Thank you. def download_file(service, file_id): drive_file = service.files().get(fileId=file_id).execute() download_url = drive_file.get('downloadUrl') title = drive_file.get('title') originalFilename = drive_file.get('originalFilename') if download_url: resp, content = service._http.request(download_url) if resp.status == 200: file = 'tmp.mp4'

Designing an external memory sorting algorithm

谁说胖子不能爱 提交于 2019-12-02 06:08:12
问题 If I have a very large list stored in external memory that needs to be sorted. Asumimg this list is too large for internal memory, what major factors should be considered in designing an external sorting algorithm? 回答1: Before you go building your own external sort, you might look at the tools your operating system supplies. Windows has SORT.EXE, which works well enough on some text files, although it has ... idiosyncrasies. The GNU sort, too, works pretty well. You could give either of those

MemoryError - how to download large file via Google Drive SDK using Python

試著忘記壹切 提交于 2019-12-02 03:03:46
I'm running out of memory when downloading big file from my Google Drive. I assume that tmp = content.read(1024) does not work, but how to fix it? Thank you. def download_file(service, file_id): drive_file = service.files().get(fileId=file_id).execute() download_url = drive_file.get('downloadUrl') title = drive_file.get('title') originalFilename = drive_file.get('originalFilename') if download_url: resp, content = service._http.request(download_url) if resp.status == 200: file = 'tmp.mp4' with open(file, 'wb') as f: while True: tmp = content.read(1024) if not tmp: break f.write(tmp) return