Converting XML to Pandas

旧时模样 提交于 2021-02-11 14:21:17

问题


Is there a way to convert an XML file (financial statements from IB API) to Pandas without knowing the exact column headers? The rows should reflect the different dates (there are 4 or more datapoints per column). Would also be great to get balance sheet, income statement and cash flow statement separately. I have tried to use beautiful soup but am getting frustrated because it seems like I need to look for each column header specifically and I don't know how to get the data for each date.

Im trying to get three separate dataframes (one for each financial statement). sorry I don't know how to add a table here but they should look something like this.

Df1 name = income statement

|---------------------|------------------|------------------|
|      Date           |     SREV         |  VDES            |
|---------------------|------------------|------------------|
|   2018-09-29        | 265595.000000    | 12.208930        |
|---------------------|------------------|------------------|
|  .....              |   ......         | .....            |
|---------------------|------------------|------------------|

the example is from this segment:

    <Statement Type="INC">
                        <FPHeader>
                            <PeriodLength>52</PeriodLength>
                            <periodType Code="W">Weeks</periodType>
                            <UpdateType Code="UPD">Updated Normal</UpdateType>
                            <AccountingStd/>
                            <StatementDate>2018-09-29</StatementDate>
                            <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                            <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                            <Source Date="2018-11-05">10-K</Source>
                        </FPHeader>
                        <lineItem coaCode="SREV">265595.000000</lineItem>
    (...)
                        <lineItem coaCode="VDES">12.208930</lineItem>

This is the XML file (about half due to character limit):

<?xml version="1.0" encoding="utf-8"?>
<ReportFinancialStatements Major="1" Minor="0" Revision="1">
    <CoIDs>
        <CoID Type="RepNo">05680</CoID>
        <CoID Type="CompanyName">Apple Inc.</CoID>
        <CoID Type="IRSNo">942404110</CoID>
        <CoID Type="CIKNo">0000320193</CoID>
    </CoIDs>
    <Issues>
        <Issue Desc="Common Stock" ID="1" Order="1" Type="C">
            <IssueID Type="Name">Ordinary Shares</IssueID>
            <IssueID Type="Ticker">AAPL</IssueID>
            <IssueID Type="RIC">AAPL.O</IssueID>
            <IssueID Type="DisplayRIC">AAPL.OQ</IssueID>
            <IssueID Type="InstrumentPI">331724</IssueID>
            <IssueID Type="QuotePI">7645713</IssueID>
            <Exchange Code="NASD" Country="USA">NASDAQ</Exchange>
            <MostRecentSplit Date="2014-06-09">7.0</MostRecentSplit>
        </Issue>
    </Issues>
    <CoGeneralInfo>
        <CoStatus Code="1">Active</CoStatus>
        <CoType Code="EQU">Equity Issue</CoType>
        <LastModified>2020-01-23</LastModified>
        <LatestAvailableAnnual>2019-09-28</LatestAvailableAnnual>
        <LatestAvailableInterim>2019-09-28</LatestAvailableInterim>
        <ReportingCurrency Code="USD">U.S. Dollars</ReportingCurrency>
        <MostRecentExchange Date="2020-01-22">1.0</MostRecentExchange>
    </CoGeneralInfo>
    <StatementInfo>
        <COAType Code="IND">Industry</COAType>
        <BalanceSheetDisplay Code="CUR">Differentiates</BalanceSheetDisplay>
        <CashFlowMethod Code="IND">Indirect</CashFlowMethod>
    </StatementInfo>
    <Notes>
        <CFAAvailability Code="1"/>
        <IAvailability Code="1"/>
        <ISIAvailability Code="1"/>
        <BSIAvailability Code="1"/>
        <CFIAvailability Code="1"/>
    </Notes>
    <FinancialStatements>
        <COAMap>
            <mapItem coaItem="SREV" lineID="100" precision="1" statementType="INC">Revenue</mapItem>
(...)
            <mapItem coaItem="SCTP" lineID="1050" precision="1" statementType="CAS">Cash Taxes Paid</mapItem>
        </COAMap>
        <AnnualPeriods>
            <FiscalPeriod EndDate="2019-09-28" FiscalYear="2019" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">260174.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">11.885790</lineItem>
                </Statement>
                <Statement Type="BAL">
                    <FPHeader>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ACSH">12204.000000</lineItem>
(...)
                    <lineItem coaCode="STBP">20.365340</lineItem>
                </Statement>
                <Statement Type="CAS">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2019-09-28</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2019-10-31">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ONET">55256.000000</lineItem>
(...)
                    <lineItem coaCode="SNCC">24311.000000</lineItem>
                </Statement>
            </FiscalPeriod>
            <FiscalPeriod EndDate="2018-09-29" FiscalYear="2018" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2018-09-29</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2018-11-05">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">265595.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">12.208930</lineItem>
                </Statement>
                <Statement Type="BAL">
                    <FPHeader>
                        <UpdateType Code="CLA">Reclassified Normal</UpdateType>
                        <StatementDate>2018-12-29</StatementDate>
                        <Source Date="2019-01-30">10-Q</Source>
                    </FPHeader>
                    <lineItem coaCode="ACSH">11575.000000</lineItem>
(...)
                    <lineItem coaCode="STBP">22.533610</lineItem>
                </Statement>
                <Statement Type="CAS">
                    <FPHeader>
                        <PeriodLength>52</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <StatementDate>2018-09-29</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2018-11-05">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="ONET">59531.000000</lineItem>
(...)
                    <lineItem coaCode="SNCC">5624.000000</lineItem>
                </Statement>
            </FiscalPeriod>
            <FiscalPeriod EndDate="2017-09-30" FiscalYear="2017" Type="Annual">
                <Statement Type="INC">
                    <FPHeader>
                        <PeriodLength>53</PeriodLength>
                        <periodType Code="W">Weeks</periodType>
                        <UpdateType Code="UPD">Updated Normal</UpdateType>
                        <AccountingStd/>
                        <StatementDate>2017-09-30</StatementDate>
                        <AuditorName Code="EY">Ernst &amp; Young LLP</AuditorName>
                        <AuditorOpinion Code="UNQ">Unqualified</AuditorOpinion>
                        <Source Date="2017-11-03">10-K</Source>
                    </FPHeader>
                    <lineItem coaCode="SREV">229234.000000</lineItem>
(...)
                    <lineItem coaCode="VDES">9.206750</lineItem>
                </Statement>
                <Statement Type="BAL">

回答1:


I know its not pretty but this works:

from ib_insync import *
from bs4 import BeautifulSoup as bs
import pandas as pd

ib = IB()
ib.connect('127.0.0.1', 7497, clientId=1)


security = Stock('AAPL', 'SMART', 'USD')

# request the fundamentals
fundamentals = ib.reqFundamentalData(security, reportType='ReportsFinStatements')

soup = bs(fundamentals,'xml')

bal_l = []
inc_l = []
cas_l = []


for period in soup.find_all('FiscalPeriod'):
    if period.get('Type') != "Annual":
        for statement in period.find_all('Statement'):
            if statement.find('UpdateType').get('Code') != 'CLA':
                dic = {}


                t = statement.get('Type')
                d = statement.find('Source').get('Date')
                d1 = statement.find('StatementDate').text
                dic['date'] = d
                dic['StatementDate'] = d1


                for item in statement.find_all('lineItem'):
                    dic[item.get('coaCode')] =item.text


                if t == 'BAL':
                    bal_l.append(dic)
                    print(t, d, dic)
                elif t == 'INC':
                    inc_l.append(dic)
                elif t == 'CAS':
                    cas_l.append(dic)

balancesheet = pd.DataFrame(bal_l).sort_values('date')

with pd.option_context('display.max_rows', 1000, 'display.max_columns', None):
    print(balancesheet)


来源:https://stackoverflow.com/questions/59911686/converting-xml-to-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!