data-scrubbing

Group duplicate columns and sum the corresponding column values using pandas [duplicate]

限于喜欢 提交于 2019-12-11 06:08:43
问题 This question already has answers here : Pandas group-by and sum (6 answers) Closed last year . I am preprocessing apache server log data. I have 3 columns ID, TIME, and BYTES. Example: ID &nbsp &nbsp TIME &nbsp &nbsp BYTES 1 &nbsp &nbsp 13:00 &nbsp &nbsp 10 2 &nbsp &nbsp 13:02 &nbsp &nbsp 30 3 &nbsp &nbsp 13:03 &nbsp &nbsp 40 4 &nbsp &nbsp 13:02 &nbsp &nbsp 50 5 &nbsp &nbsp 13:03 &nbsp &nbsp 70 I want to achieve something like this: ID &nbsp &nbsp TIME &nbsp &nbsp BYTES 1 &nbsp &nbsp 13:00

Anonymizing customer data for development or testing

狂风中的少年 提交于 2019-11-27 02:09:28
问题 I need to take production data with real customer info (names, address, phone numbers, etc) and move it into a dev environment, but I'd like to remove any semblance of real customer info. Some of the answers to this question can help me generating NEW test data, but then how do I replace those columns in my production data, but keep the other relevant columns? Let's say I had a table with 10000 fake names. Should I do a cross-join with a SQL update? Or do something like UPDATE table SET