learning nextflow after putting my eggs in the snakemake basket
hope the nextflow guy doesnt turn up in my mentions again
what the fuck are channels
i remain openly sceptical
i quite like snakemake's file-orientated approach; proliferating wildcards in filenames (or looking them up in a simple data structure) makes for very fast and straightforward prototyping, but it can struggle to scale. nextflow's queue approach overcomes this
by the way a channel is just a stupid name for a queue
dont @ me
finding the groovy syntax quite jarring to be quite honest
i really didn't miss groovy from my jenkins CI days
nf tutorial is very easy to understand although i don't feel like it has explained how to get results "out" other than stdout. with snakemake you define the end of the pipeline back to the start, which is counter-intuitive at first, but quite easy once you get the hang of it
kinda sad to compare the money and backing that has gone into nextflow in comparison to snakemake. i really think snakemake is excellent at a lot of things and has great potential for future improvements. but it cannot innovate as quickly as nextflow which has a massive team
the thing about these channels that poses an advantage over snakemake for some usecases, is that you don't necessarily need to know about what jobs need doing at the start. eg. nf's `watchPath` channel means you can sit and listen for new data, instead of respawning the pipeline
the experimental DSL2 syntax is actually much more like what you would be used to coming from snakemake, but the fact its experimental means it could be changed at any time
https://twitter.com/samstudio8/status/1239132603216801792
so i have a minimal example that can read some paths in from a CSV, run processes via SLURM and write some "results" files. starting to get the hang of things. biggest difference i think is not having to think about the pipeline being finished
more than this, files are almost entirely abstracted from your workflow. you even need to explicitly force the output directory for a process to be set with `publishDir`
currently trying to find out why this process always gets re-executed when the files exist
how embarrassing
it's because i was using "--resume" not "-resume"
i find it quite weird this "resume" behaviour isnt the default
can i just say, irrespective of my ranting, the documentation, generally, is excellent. snakemake has a bit to learn in that regard.
these channel operators are a big deal. a bunch of my snakemake steps work to collect/combine/merge result files and this can be achieved implicitly in nextflow. will enjoy exploring these.
its operations like these that i suspect lead to long DAG generation times in snakemake. i wonder for snakemake how we could implement some helper functions to collapse those munging jobs.
repeatedly forgetting to prepend "file" to the output filename is causing me some trouble
alright i have a minimal workflow that can watch a directory for fastq files, run nanostats on them asynchronously and publish the result to s3. i could have done the same thing with snakemake in a while loop and been drinking wine by now.
forgot to say that it was pretty easy to get conda working, it uses almost exactly the same syntax as snakemake for this - just pass an environment YAML to a process, and it takes care of the rest. one small gotcha is the name key needs to match the filename?
i really like that processes can be "labelled" to apply properties, this would be nice in snakemake
about a month of basic nextflowing. channels are cool, i like the idea of passing things in and out of "queues". DSL2 looks neat but haven't tried it yet. i really really miss the snakemake params stanza where you can dynamically set properties of a rule based on its inputs
i really really don't get how to do this with nextflow. the closest i've come is using conditional scripts ( https://www.nextflow.io/docs/latest/process.html#conditional-scripts) and passing the switch variable through the channels as part of the tuple
my go-to paradigm for snakemake is to load all the metadata about my samples/files into a dataframe and use key properties (like sample/file name) to look up things whenever i like. i'm still lost for a good way to do this with nextflow.
i could probably use the conditional scripting to set up parameters in a similar way, but i really think the snakemake solution of being able to call a lambda is really elegant
i feel like i am missing something
but tbh i just want snakemake with channels
-resume seems to cause an immense amount of load on our server that makes it unusable
can only assume it is overzealously hashing all the files in my nxf workdir or something
tbh the main recurring gotcha is using # instead of //
today i just want to have a process fail but still publish the log somewhere outside of my nextflow working dir. i'd hoped setting the errorStrategy to ignore would let the publish directives still work, but alas.
would be cool if the publish directives could take an additional parameter that let them work even if the process failed
(i know i can tweak validExitStatus, but i actually still want the process to fail so the bad files don't get passed to the next channel)
I'm just going to run the process twice (it's fast)
the first time we can let the process fail with validExitStatus so we can get the logs, the second time we'll just let it crash and burn and ignore it
don't @ me pal
man this is gross
i know i should just write a filter or something
but tbh i dont want to
You can follow @samstudio8.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: