Converting 900 MB .csv into ROOT (CERN) TTree

点点圈 提交于 2019-12-08 09:23:57

问题


I am new to programming and ROOT (CERN), so go easy on me. Simply, I want to convert a ~900 MB (11M lines x 10 columns) .csv file into a nicely organized .root TTree. Could someone provide the best way to go about this?

Here is an example line of data with headers (it's 2010 US census block population and population density data):

"Census County Code","Census Tract Code","Census Block Code","County/State","Block Centroid Latitude (degrees)","Block Centroid W Longitude (degrees)","Block Land Area (sq mi)","Block Land Area (sq km)","Block Population","Block Population Density (people/sq km)"

1001,201,1000,Autauga AL,32.469683,-86.480959,0.186343,0.482626154,61,126.3918241

I've pasted the what I've wrote so far below.

I particularly can’t figure out this error when running: "C:41:1: error: unknown type name ‘UScsvToRoot’”.

This may be really really stupid, but how do you read in strings in ROOT (for reading in the County/State name)? Like what is the data type? Do I just have to use char’s? I’m blanking.

#include "Riostream.h"
#include "TString.h"
#include "TFile.h"
#include "TNtuple.h"
#include "TSystem.h"

void UScsvToRoot() {

   TString dir = gSystem->UnixPathName(__FILE__);
   dir.ReplaceAll("UScsvToRoot.C","");
   dir.ReplaceAll("/./","/");
   ifstream in;
   in.open(Form("%sUSPopDens.csv",dir.Data()));

   Int_t countyCode,tractCode,blockCode;
   // how to import County/State string?
   Float_t lat,long,areaMi,areaKm,pop,popDens;
   Int_t nlines = 0;
   TFile *f = new TFile("USPopDens.root","RECREATE");
   TNtuple *ntuple = new TNtuple("ntuple","data from csv file","countyCode:tractCode:blockCode:countyState:lat:long:areaMi:areaKm:pop:popDens");

   while (1) {
      in >> countyCode >> tractCode >> blockCode >> countyState >> lat >> long >> areaMi >> areaKm >> pop >> popDens;
      if (!in.good()) break;
      ntuple->Fill(countyCode,tractCode,blockCode,countyState,lat,long,areaMi,areaKm,pop,popDens);
      nlines++;
   }

   in.close();

   f->Write();
}`

回答1:


Ok, so I am going to give this a shot, but a few comments up front:

for questions on root, you should strongly consider going to the root homepage and then to the forum. While stackoverflow is an excellent source of information, specific questions on the root framework are better suited on the root homepage.

If you are new to root, you should take a look at the tutorial page; it has many examples on how to use the various features of root.

You should also make use of the root reference guide that has documentation on all root classes.

To your code: if you look at the documentation for the class TNtuple that you are using you see that in the description it plainly says:

A simple tree restricted to a list of float variables only.

so trying to store any string into a TNtuple will not work. You need to use the more general class TTree for that.

To read your file and store the information in a tree you have two options: either you manually define the branches and then fill the tree as you loop over the file:

void UScsvToRoot() {
   TString dir = gSystem->UnixPathName(__FILE__);
   dir.ReplaceAll("UScsvToRoot.C","");
   dir.ReplaceAll("/./","/");
   ifstream in;
   in.open(Form("%sUSPopDens.csv",dir.Data()));

   Int_t countyCode,tractCode,blockCode;
   char countyState[1024];
   Float_t lat,lon,areaMi,areaKm,pop,popDens;
   Int_t nlines = 0;
   TFile *f = new TFile("USPopDens.root","RECREATE");
   TTree *tree = new TTree("ntuple","data from csv file");

   tree->Branch("countyCode",&countyCode,"countyCode/I");
   tree->Branch("tractCode",&tractCode,"tractCode/I");
   tree->Branch("blockCode",&blockCode,"blockCode/I");
   tree->Branch("countyState",countyState,"countyState/C");
   tree->Branch("lat",&lat,"lat/F");
   tree->Branch("long",&lon,"lon/F");
   tree->Branch("areaMi",&areaMi,"areaMi/F");
   tree->Branch("areaKm",&areaKm,"areaKm/F");
   tree->Branch("pop",&pop,"pop/F");
   tree->Branch("popDens",&popDens,"popDens/F");

   while (1) {
      in >> countyCode >> tractCode >> blockCode >> countyState >> lat >> lon >> areaMi >> areaKm >> pop >> popDens;
      if (!in.good()) break;
      tree->Fill();
      nlines++;
   }

   in.close();

   f->Write();
}

The command TTree::Branch basically tells root

  • the name of your branch
  • the address of the variable from which root will read the information
  • the format of the branch

The TBranch that contains the string information is of type C which if you look at the TTree documentation means

  • C : a character string terminated by the 0 character

N.B. I gave the character array a certain size, you should see yourself what size is appropriate for your data.

The other possibility that you can use is to do away with the ifstream and simply make use of the ReadFile method of the TTree which you would employ like this

#include "Riostream.h"
#include "TString.h"
#include "TFile.h"
#include "TTree.h"
#include "TSystem.h"

void UScsvToRoot() {

   TString dir = gSystem->UnixPathName(__FILE__);
   dir.ReplaceAll("UScsvToRoot.C","");
   dir.ReplaceAll("/./","/");

   TFile *f = new TFile("USPopDens.root","RECREATE");
   TTree *tree = new TTree("ntuple","data from csv file");
   tree->ReadFile("USPopDens.csv","countyCode/I:tractCode/I:blockCode/I:countyState/C:lat/F:lon/F:areaMi/F:areaKm/F:pop/F:popDens/F",',');
   f->Write();
}

You can read the section on TTress in the root users guide on for more information; among many other things it also has an example using TTree:ReadFile.

Let me know if this helps




回答2:


I think you might be better off just using root_pandas. In the comprehensive answer by @Erik you still end up specifying the variables of interest by hand (countryCode/I,…). Which has its advantages (I just list generic: you know what you'll get. error message in case an expected variable is missing, ). But on the other hand it gives you the chance of introducing typos, if you read multiple csv files you won't notice if any of them have more variables … and ultimately copying variable names and determining variable types is something a computer should be very good at.

In root_pandas your code should be something like

import pandas
df = pandas.read_csv("USPopDens.csv")
from root_pandas import readwrite
df.to_root("USPopDens.root")


来源:https://stackoverflow.com/questions/31420191/converting-900-mb-csv-into-root-cern-ttree

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!