Save the “Out[]” table of a pandas dataframe as a figure

五迷三道 提交于 2019-11-28 10:49:04

Here is a somewhat hackish solution but it gets the job done. You wanted a .pdf but you get a bonus .png. :)

import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

from PySide.QtGui import QImage
from PySide.QtGui import QPainter
from PySide.QtCore import QSize
from PySide.QtWebKit import QWebPage

arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
df =pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))

h = "<!DOCTYPE html> <html> <body> <p> " + df.to_html() + " </p> </body> </html>";
page = QWebPage()
page.setViewportSize(QSize(5000,5000))

frame = page.mainFrame()
frame.setHtml(h, "text/html")

img = QImage(1000,700, QImage.Format(5))
painter = QPainter(img)
frame.render(painter)
painter.end()
a = img.save("html.png")

pp = PdfPages('html.pdf')
fig = plt.figure(figsize=(8,6),dpi=1080) 
ax = fig.add_subplot(1, 1, 1)
img2 = plt.imread("html.png")
plt.axis('off')
ax.imshow(img2)
pp.savefig()
pp.close()

Edits welcome.

It is, I believe, an HTML table that your IDE is rendering. This is what ipython notebook does.

You can get a handle to it thusly:

from IPython.display import HTML
import pandas as pd
data = pd.DataFrame({'spam':['ham','green','five',0,'kitties'],
                     'eggs':[0,1,2,3,4]})
h = HTML(data.to_html())
h

and save to an HTML file:

my_file = open('some_file.html', 'w')
my_file.write(h.data)
my_file.close()
J Richard Snape

I think what is needed here is a consistent way of outputting a table to a pdf file amongst graphs output to pdf.

My first thought is not to use the matplotlib backend i.e.

from matplotlib.backends.backend_pdf import PdfPages

because it seemed somewhat limited in formatting options and leaned towards formatting the table as an image (thus rendering the text of the table in a non-selectable format)

If you want to mix dataframe output and matplotlib plots in a pdf without using the matplotlib pdf backend, I can think of two ways.

  1. Generate your pdf of matplotlib figures as before and then insert pages containing the dataframe table afterwards. I view this as a difficult option.
  2. Use a different library to generate the pdf. I illustrate one option to do this below.

First, install xhtml2pdf library. This seems a little patchily supported, but is active on Github and has some basic usage documentation here. You can install it via pip i.e. pip install xhtml2pdf

Once you've done that, here is a barebones example embedding a matplotlib figure, then the table (all text selectable), then another figure. You can play around with CSS etc to alter the formatting to your exact specifications, but I think this fulfils the brief:

from xhtml2pdf import pisa             # this is the module that will do the work
import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

# Utility function
def convertHtmlToPdf(sourceHtml, outputFilename):
    # open output file for writing (truncated binary)
    resultFile = open(outputFilename, "w+b")

    # convert HTML to PDF
    pisaStatus = pisa.CreatePDF(
            sourceHtml,                # the HTML to convert
            dest=resultFile,           # file handle to recieve result
            path='.')                  # this path is needed so relative paths for 
                                       # temporary image sources work

    # close output file
    resultFile.close()                 # close output file

    # return True on success and False on errors
    return pisaStatus.err

# Main program
if __name__=='__main__':   

    arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
    columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
    df = pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))

    # Define your data
    sourceHtml = '<html><head>'         
    # add some table CSS in head
    sourceHtml += '''<style>
                     table, td, th {
                           border-style: double;
                           border-width: 3px;
                     }

                     td,th {
                           padding: 5px;
                     }
                     </style>'''
    sourceHtml += '</head><body>'
    #Add a matplotlib figure(s)
    plt.plot(range(20))
    plt.savefig('tmp1.jpg')
    sourceHtml += '\n<p><img src="tmp1.jpg"></p>'

    # Add the dataframe
    sourceHtml += '\n<p>' + df.to_html() + '</p>'

    #Add another matplotlib figure(s)
    plt.plot(range(70,100))
    plt.savefig('tmp2.jpg')
    sourceHtml += '\n<p><img src="tmp2.jpg"></p>'

    sourceHtml += '</body></html>'
    outputFilename = 'test.pdf'

    convertHtmlToPdf(sourceHtml, outputFilename)

Note There seems to be a bug in xhtml2pdf at the time of writing which means that some CSS is not respected. Particularly pertinent to this question is that it seems impossible to get double borders around the table


EDIT

In response comments, it became obvious that some users (well, at least @Keith who both answered and awarded a bounty!) want the table selectable, but definitely on a matplotlib axis. This is somewhat more in keeping with the original method. Hence - here is a method using the pdf backend for matplotlib and matplotlib objects only. I do not think the table looks as good - in particular the display of hierarchical column headers, but that's a matter of choice, I guess. I'm indebted to this answer and comments for the way to format axes for table display.

import numpy as np
import pandas as pd
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt

# Main program
if __name__=='__main__':   
    pp = PdfPages('Output.pdf')
    arrays = [np.hstack([ ['one']*3, ['two']*3]), ['Dog', 'Bird', 'Cat']*2]
    columns = pd.MultiIndex.from_arrays(arrays, names=['foo', 'bar'])
    df =pd.DataFrame(np.zeros((3,6)),columns=columns,index=pd.date_range('20000103',periods=3))

    plt.plot(range(20))
    pp.savefig()
    plt.close()

    # Calculate some sizes for formatting - constants are arbitrary - play around
    nrows, ncols = len(df)+1, len(df.columns) + 10
    hcell, wcell = 0.3, 1.
    hpad, wpad = 0, 0   

    #put the table on a correctly sized figure    
    fig=plt.figure(figsize=(ncols*wcell+wpad, nrows*hcell+hpad))
    plt.gca().axis('off')
    matplotlib_tab = pd.tools.plotting.table(plt.gca(),df, loc='center')    
    pp.savefig()
    plt.close()

    #Add another matplotlib figure(s)
    plt.plot(range(70,100))
    pp.savefig()
    plt.close()

    pp.close()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!