问题
Source code tree (R) for my dissertation research software reflects traditional research workflow: "collect data -> prepare data -> analyze data -> collect results -> publish results". I use make to establish and maintain the workflow (most of the project's sub-directories contain Makefile files).
However, frequently, I need to execute individual parts of my workflow via particular Makefile targets in project's sub-directories (not via top-level Makefile). This creates a problem of setting up Makefile rules to maintain dependencies between targets from different parts of the workflow, in other words - between targets in Makefile files, located in different sub-directories.
The following represents the setup for my dissertation project:
+-- diss-floss (Project's root)
|-- import (data collection)
|-- cache (R data objects (), representing different data sources, in sub-directories)
|-+ prepare (data cleaning, transformation, merging and sampling)
|-- R modules, including 'transform.R'
|-- analysis (data analyses, including exploratory data analysis (EDA))
|-- R modules, including 'eda.R'
|-+ results (results of the analyses, in sub-directories)
|-+ eda (*.svg, *.pdf, ...)
|-- ...
|-- present (auto-generated presentation for defense)
Snippets of targets from some of my Makefile files:
"~/diss-floss/Makefile" (almost full):
# Major variable definitions
PROJECT="diss-floss"
HOME_DIR="~/diss-floss"
REPORT={$(PROJECT)-slides}
COLLECTION_DIR=import
PREPARATION_DIR=prepare
ANALYSIS_DIR=analysis
RESULTS_DIR=results
PRESENTATION_DIR=present
RSCRIPT=Rscript
# Targets and rules
all: rprofile collection preparation analysis results presentation
rprofile:
R CMD BATCH ./.Rprofile
collection:
cd $(COLLECTION_DIR) && $(MAKE)
preparation: collection
cd $(PREPARATION_DIR) && $(MAKE)
analysis: preparation
cd $(ANALYSIS_DIR) && $(MAKE)
results: analysis
cd $(RESULTS_DIR) && $(MAKE)
presentation: results
cd $(PRESENTATION_DIR) && $(MAKE)
## Phony targets and rules (for commands that do not produce files)
#.html
.PHONY: demo clean
# run demo presentation slides
demo: presentation
# knitr(Markdown) => HTML page
# HTML5 presentation via RStudio/RPubs or Slidify
# OR
# Shiny app
# remove intermediate files
clean:
rm -f tmp*.bz2 *.Rdata
"~/diss-floss/import/Makefile":
importFLOSSmole: getFLOSSmoleDataXML.R
@$(RSCRIPT) $(R_OPTS) $<
...
"~/diss-floss/prepare/Makefile":
transform: transform.R
$(RSCRIPT) $(R_OPTS) $<
...
"~/diss-floss/analysis/Makefile":
eda: eda.R
@$(RSCRIPT) $(R_OPTS) $<
Currently, I am concerned about creating the following dependency:
Data, collected by making a target from Makefile in import, always needs to be transformed by making corresponding target from Makefile in prepare before being analyzed via, for example eda.R. If I manually run make in import and then, forgetting about transformation, run make eda in analyze, things are not going too well. Therefore, my question is:
How could I use features of the make utility (in a simplest way possible) to establish and maintain rules for dependencies between targets from Makefile files in different directories?
回答1:
The problem with your use of makefile right now is that you are only listing the code as dependencies, not the data. That's where a lot of the magic happens. If the "analyze" knew what files it was going to use and could list those as dependencies, it could look back to see how they were made and what dependencies they had. And if an earlier file in the pipeline was updated, then it could run all the necessary steps to bring the file up to date. For example
import: rawdata.csv
rawdata.csv:
scp remoteserver:/rawdata.csv .
transform: tansdata.csv
transdata.csv: gogo.pl rawdata.csv
perl gogo.pl $< > $@
plot: plot.png
plot.png: plot.R transdata.csv
Rscript plot.R
So if I do a make import it will download a new csv file. Then if I run make plot, it will try to make plot.png but that depends on transdata.csv and that depends on rawdata.csv and since rawdata.csv was updated, it will need to have to update transdata.csv and then it will be ready to run the R script. If you don't explicitly set a lot of the file dependencies, you're missing out on a lot of the power of make. But to be fail, it can be tricky sometimes to get all the right dependencies in there (especially if you produce multiple output from one step).
回答2:
The following are my thoughts (with some ideas from @MrFlick's answer - thank you) on adding my research workflow's data dependencies to the project's current make infrastructure (with snippets of code). I have also tried to reflect the desired workflow by specifying dependencies between make targets.
import/Makefile:
importFLOSSmole: getFLOSSmoleDataXML.R FLOSSmole.RData
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
(similar targets for other data sources)
prepare/Makefile:
IMPORT_DIR=../import
prepare: import \
transform \
cleanup \
merge \
sample
import: $IMPORT_DIR/importFLOSSmole.done # and/or other flag files, as needed
transform: transform.R import
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
cleanup: cleanup.R transform
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
merge: merge.R cleanup
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
sample: sample.R merge
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
analysis/Makefile:
PREP_DIR=../prepare
analysis: prepare \
eda \
efa \
cfa \
sem
prepare: $PREP_DIR/transform.done # and/or other flag files, as needed
eda: eda.R prepare
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
efa: efa.R eda
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
cfa: cfa.R efa
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
sem: sem.R cfa
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
The contents of Makefile files in directories results and present are still TBD.
I would appreciate your thoughts and advice on the above!
来源:https://stackoverflow.com/questions/23910056/creating-make-rules-for-dependencies-across-targets-in-projects-sub-directories