Searching and Handling git objects

心不动则不痛 提交于 2019-12-13 00:19:37

问题


I'm trying to filter through the historical content of a file in my git repository. There is a line in some of the files that contains the string 'BEAM:A_BOOK', and in the 7th comma separated value of this line is a value I want to retrieve for further processing. I think, ideally, I'd end up with something like a dictionary with the SHA-1 hash of the commit, and this A_BOOK value for the past versions of this file.

Example of first few lines of a File. Note the value I'd hope to retrieve from this version of the file would be '56.0':

# Date: 2018-12-21 01:49:16.888 PV,SELECTED,TIMESTAMP,STATUS,SEVERITY,VALUE_TYPE,VALUE,READBACK,READBACK_VALUE,DELTA,READ_ONLY

REA_EXP:LINE,0,1544047322.881066957,NO_ALARM,NONE,enum,"JENSA~[UDF;AT-TPC;GPL;JENSA]",,"---",,true

REA_BTS19:BEAM:OPTICSFILE,0,1541798820.065952460,NO_ALARM,NONE,string,"BTS19_test3.data",,"---",,true

REA_BTS19:BEAM:A_BOOK,0,1545322510.562031883,NO_ALARM,NONE,double,"56.0",,"---",,true

Ultimately, I'll extend this to retrieve a couple values and do some math to perform more complicated filtering. More background: we store the Atomic Mass and Charge values for ion beams we deliver for nuclear physics experiments in text files under version control. These text files act as our 'save sets', and are filled with more than this mass and charge info, as they also include machine values we would restore if we wanted to run that beam again. My goal is to filter these files by the Charge:Mass ratio of the beams we ran with them.

So far, this seems to get me most of my information:

git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) | grep RFQ-JENSA_Setpoint.snp

Which spits outsomething like this:

16eca44985214b790eb6ca8241ad86728b4fd3ae:RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1531323944.085330133,NO_ALARM,NONE,double,"2.0",,"---",,true

6e585c905444f25e18edfe1eeb32ced2de72ed7c:RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1531323944.085330133,NO_ALARM,NONE,double,"2.0",,"---",,true

bc202d5f21f9829fa3701ca636657ee1b0a73e25:RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1531323944.085330133,NO_ALARM,NONE,double,"2.0",,"---",,true

etc...

However, I'd like to see something like:

<hash>:<Retrieved A_BOOK Value>

Or, based on the output I just showed, I'd hope to see something like this:

16eca44985214b790eb6ca8241ad86728b4fd3ae:2.0

6e585c905444f25e18edfe1eeb32ced2de72ed7c:2.0

bc202d5f21f9829fa3701ca636657ee1b0a73e25:2.0

etc...

And eventually include some math to show something more meaningful:

<hash>:<Retrieved Q_BOOK Value>/<Retrieved A_BOOK Value>

Is there a better way to go about this? What's a good way to retrieve this information?

Thank you!


回答1:


Given that you're interested in a particular file within each revision, consider adding -- <pathspec> to the git grep invocation. That is, instead of:

git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) | grep RFQ-JENSA_Setpoint.snp

you could start with:

git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) -- RFQ-JENSA_Setpoint.snp

You will still get the lines, but faster, since git grep can skip all the files that don't have RFQ-JENSA_Setpoint.snp in their names. (Note that a <pathspec> is not the same as a regular expression: if you really wanted to allow any character, e.g., RFQ-JENSA_SetpointXsnp and RFQ-JENSA_SetpointYsnp as file names, you'd have to use -- 'RFQ-JENSA_Setpoint?snp' here. I'm guessing your second grep was overly permissive. REs are more expressive in general than path globs, but for this particular case, even if you really did mean "any character", glob has ? to allow that.)

Complicating matters, you may find that in a large repository, $(git rev-list --all) produces enough strings to overflow argv limits. (What the argv limits are on your system is not something I can guess.) In that case, you may need to pipe git rev-list --all through xargs:

git rev-list --all | xargs -I % git grep 'BTS19:BEAM:A_BOOK' % -- RFQ-JENSA_Setpoint.snp

Annoyingly, this spawns one separate git grep for each revision, which will slow you right back down. (If you have a BSD-style xargs you can use -J instead of -I; or consider the GNU parallel command.)

To break these up and extract the 7th comma-separated value, consider replacing the : with , and using awk:

... | sed 's/:/,/' | awk -F, '{print $1 ":" $8}'

although if you need proper CSV quote handling, a separate tool is probably more appropriate. (Given your example this would print <hash>:"2.0", too, with the quotes.)



来源:https://stackoverflow.com/questions/53951431/searching-and-handling-git-objects

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!