数据来源

很多网上都可以下载数据源，这里就不上传分享了

遇到难点

a)在想要对DataFrame数据进行赋值的时候会警告（不是错误，但是最好不要忽略掉）
‘SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead’
解决方法：
1.链式操作 data.loc[data.bidder=‘x’,y]=100
2.使用copy，明确的告诉程序我们用的是一个副本，而不是在原来的数据上修改

b)觉得自己pandas的基础知识看的时候还是太毛躁了，DataFrame和Series的基础概念以及很多操作不清楚，通过多加练习对这部分加强吧。概念知识对于我来说太抽象了，很难读

代码部分：

/*输出数据源中top 10，出现频率最高的前10*/ import pandas as pd import numpy as np from collections import Counter import json import seaborn as sns import matplotlib.pyplot as plt  #此函数用于生成出现时区以及其出现频率 def get_count(sequence):    counts = {}    for x in sequence:    	if x in counts:    		counts[x] += 1    	else:    		counts[x] = 1    return counts  #返回时区，数目字典的top10 def top_counts(count_dict,n=10):    value_key_pairs = [(count,tz) for tz,count in count_dict.items()]    value_key_pairs.sort()    return value_key_pairs[-n:]  #加载数据源，并将json格式转换为Python格式 path = '../datasets/example.txt' records = [json.loads(line) for line in open(path)] time_zones = [rec['tz'] for rec in records if 'tz' in rec ]  #使用自定义函数方式输出top 10 counts = get_counts(time_zones) print(top_counts(counts))  #利用collections的Counter类输出top 10 counts = Counter(time_zones) print(counts.most_common(10))  #利用DataFrame的性质输出top 10 frame = pd.DataFrame(records) print(frame['tz'][:10])  #可视化 #清理数据 clean_tz = frame['tz'].fillna('Missing') clean_tz[clean_tz == ''] = 'Unknow' tz_counts = clean_tz.value_counter() subset = tz_counts[:10] sns.barplot(y=subset.index,x=subset.values) plt.show()

exp2

import pandas as pd import numpy as np from matplotlib import pyplot as plt import json import seaborn as sns  path = '../datasets/example.txt' records = [json.loads(line) for line in open(path)] frame = pd.DataFrame(records)  #frame.a.dropna() 提取数据源 results = pd.Series([x.split()[0] for x in frame.a.dropna()]) #此处要使用copy()，用来复制一个副本，否则会引发警告 #SettingWithCopyWarning: #A value is trying to be set on a copy of a slice from a DataFrame. #Try using .loc[row_indexer,col_indexer] = value instead #解决方案 #1. 链式操作 data.loc[data.bidder='x',y]=100 #2.使用copy，明确的告诉程序我们用的是一个副本，而不是在原来的数据上修改 clean_frame = frame[frame.a.notnull().copy()] clean_frame['os'] = np.where(clean_frame['a].str.contains('Windows')，‘Windows’,'Not Windows')  by_tz_os = clean_frame.groupby(['tz','os']) #此处用来重塑 agg_counts = by_tz_os.size().unstack().fillna() #排序，然后返回他的对应索引 indexer = agg_counts.sum(1).argsort()  count_subset = agg_counts.take(indexer[-10:]) count_subset = count_subset.stack() count_subset.name = 'total' count_subset = count_subset.reset_index() sns.barplot(y='tz',x='total',hue='os',data=count_subset) plt.show()

总结

1.学习解决SettingWithCopyWarning警告
2.利用seaborn画柱状图
3.一些numpy、pandas处理函数

文章来源: https://blog.csdn.net/qq_38953503/article/details/88694068

标签

数据分析

python