-find top x by count from MySQL in Python?

隐身守侯 提交于 2019-12-25 14:01:25

问题


I have a csv file like this:

nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@uwaterloo.ca, 01-05-2014
nohaelprince@gmail.com, 01-05-2014

I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.

Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?

Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?

#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb

from collections import defaultdict, Counter

domain_counts = defaultdict(Counter)

# ======================== Defined Functions ======================
def get_file_path(filename):
    currentdirpath = os.getcwd()  
    # get current working directory path
    filepath = os.path.join(currentdirpath, filename)
    return filepath
# ===========================================================
def read_CSV(filepath):

    with open('emails.csv') as f:
        reader = csv.reader(f)
        for row in reader:
            domain_counts[row[0].split('@')[1].strip()][row[1]] += 1

    db = MySQLdb.connect(host="localhost", # your host, usually localhost
                         user="root", # your username
                         passwd="abcdef1234", # your password
                         db="test") # name of the data base
    cur = db.cursor()

    q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""


    for domain, data in domain_counts.iteritems():
        for email_date, email_count in data.iteritems():
             cur.execute(q, (domain, email_count, email_date))

    db.commit()

# ======================= main program =======================================
path = get_file_path('emails.csv') 
read_CSV(path) # read the input file

What is the right way to do the reporting task while using domains table.

Update:

Here is my domains table:

mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field          | Type        | Null | Key | Default | Extra          |
+----------------+-------------+------+-----+---------+----------------+
| id             | int(11)     | NO   | PRI | NULL    | auto_increment |
| domain_name    | varchar(20) | NO   |     | NULL    |                |
| cnt            | int(11)     | YES  |     | NULL    |                |
| date_of_entry  | date        | NO   |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

And here is data I have in them:

mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name   | count | date_entry |
+----+---------------+-------+------------+
|  1 | wawa.com      |     2 | 2014-04-30 |
|  2 | wawa.com      |     2 | 2014-05-01 |
|  3 | wawa.com      |     3 | 2014-05-31 |
|  4 | uwaterloo.ca  |     4 | 2014-04-30 |
|  5 | uwaterloo.ca  |     3 | 2014-05-01 |
|  6 | uwaterloo.ca  |     1 | 2014-05-31 |
|  7 | anonymous.com |     2 | 2014-04-30 |
|  8 | anonymous.com |     4 | 2014-05-01 |
|  9 | anonymous.com |     8 | 2014-05-31 |
| 10 | hotmail.com   |     4 | 2014-04-30 |
| 11 | hotmail.com   |     1 | 2014-05-01 |
| 12 | hotmail.com   |     3 | 2014-05-31 |
| 13 | gmail.com     |     6 | 2014-04-30 |
| 14 | gmail.com     |     4 | 2014-05-01 |
| 15 | gmail.com     |     8 | 2014-05-31 |
+----+---------------+-------+------------+

回答1:


Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.

Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:

((this month domain total cnt) - (last month domain total cnt))
 / (last month all domains total cnt)

SQL

SELECT  domain_name, pct_growth
FROM (

SELECT t1.domain_name,  
         # SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO  
        (Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END)               
         -
         # SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
         Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY) 
                  THEN t1.cnt ELSE 0 END) 
        ) /   
        # SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
        (SELECT SUM(t2.cnt) FROM domains t2 
          WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
         As pct_growth   

FROM domains t1
GROUP BY t1.domain_name
) As derivedTable

ORDER BY pct_growth DESC
LIMIT 50;

Python

cur = db.cursor()
sql = "SELECT * FROM ..."  # SEE ABOVE 

cur.execute(sql)

for row in cur.fetchall():
   print(row)



回答2:


If I understand correctly, you just need the ratio of the past thirty days to the total count. You can get this using conditional aggregation. So, assuming that cnt is always greater than 0:

select d.domain_name,
       sum(cnt) as CntTotal,
       sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
       (sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;


来源:https://stackoverflow.com/questions/33196264/find-top-x-by-count-from-mysql-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!