数据来源
很多网上都可以下载数据源,这里就不上传分享了
遇到难点
a)在想要对DataFrame数据进行赋值的时候会警告(不是错误,但是最好不要忽略掉)
‘SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead’
解决方法:
1.链式操作 data.loc[data.bidder=‘x’,y]=100
2.使用copy,明确的告诉程序我们用的是一个副本,而不是在原来的数据上修改
b)觉得自己pandas的基础知识看的时候还是太毛躁了,DataFrame和Series的基础概念以及很多操作不清楚,通过多加练习对这部分加强吧。概念知识对于我来说太抽象了,很难读
代码部分:
/*输出数据源中top 10,出现频率最高的前10*/ import pandas as pd import numpy as np from collections import Counter import json import seaborn as sns import matplotlib.pyplot as plt #此函数用于生成出现时区以及其出现频率 def get_count(sequence): counts = {} for x in sequence: if x in counts: counts[x] += 1 else: counts[x] = 1 return counts #返回时区,数目字典的top10 def top_counts(count_dict,n=10): value_key_pairs = [(count,tz) for tz,count in count_dict.items()] value_key_pairs.sort() return value_key_pairs[-n:] #加载数据源,并将json格式转换为Python格式 path = '../datasets/example.txt' records = [json.loads(line) for line in open(path)] time_zones = [rec['tz'] for rec in records if 'tz' in rec ] #使用自定义函数方式输出top 10 counts = get_counts(time_zones) print(top_counts(counts)) #利用collections的Counter类输出top 10 counts = Counter(time_zones) print(counts.most_common(10)) #利用DataFrame的性质输出top 10 frame = pd.DataFrame(records) print(frame['tz'][:10]) #可视化 #清理数据 clean_tz = frame['tz'].fillna('Missing') clean_tz[clean_tz == ''] = 'Unknow' tz_counts = clean_tz.value_counter() subset = tz_counts[:10] sns.barplot(y=subset.index,x=subset.values) plt.show()
exp2
import pandas as pd import numpy as np from matplotlib import pyplot as plt import json import seaborn as sns path = '../datasets/example.txt' records = [json.loads(line) for line in open(path)] frame = pd.DataFrame(records) #frame.a.dropna() 提取数据源 results = pd.Series([x.split()[0] for x in frame.a.dropna()]) #此处要使用copy(),用来复制一个副本,否则会引发警告 #SettingWithCopyWarning: #A value is trying to be set on a copy of a slice from a DataFrame. #Try using .loc[row_indexer,col_indexer] = value instead #解决方案 #1. 链式操作 data.loc[data.bidder='x',y]=100 #2.使用copy,明确的告诉程序我们用的是一个副本,而不是在原来的数据上修改 clean_frame = frame[frame.a.notnull().copy()] clean_frame['os'] = np.where(clean_frame['a].str.contains('Windows'),‘Windows’,'Not Windows') by_tz_os = clean_frame.groupby(['tz','os']) #此处用来重塑 agg_counts = by_tz_os.size().unstack().fillna() #排序,然后返回他的对应索引 indexer = agg_counts.sum(1).argsort() count_subset = agg_counts.take(indexer[-10:]) count_subset = count_subset.stack() count_subset.name = 'total' count_subset = count_subset.reset_index() sns.barplot(y='tz',x='total',hue='os',data=count_subset) plt.show()
总结
1.学习解决SettingWithCopyWarning警告
2.利用seaborn画柱状图
3.一些numpy、pandas处理函数
文章来源: https://blog.csdn.net/qq_38953503/article/details/88694068