Class 4 : grep and awk¶§

Class date:	2017 Feb 7 Tues
Last updated:	Apr 18, 2017

Goals¶§

Review
remember BED format (chr, start, end)
learn grep and awk basics to filter and manipulate text

grep¶§

Use grep(1) to identify lines in a file that match a specified pattern.

To find any instance of chr5 in the lamina.bed file

# grep [pattern] [filename]
$ grep chr5 lamina.bed | head

To find all lines that start with a number sign:

# The caret (^) matches the beginning of the line
# FYI dollar sign ($) matches the end
$ grep '^#' lamina.bed

To find any line that does not start with “chr”:

# the -v flag inverts the match (grep "not" [pattern])
$ grep -v '^chr' lamina.bed

Beware of using grep to find patterns that might be partial matches:

# this will match chr1, chr10, chr11 etc.
$ grep chr1 lamina.bed | cut -f1 | uniq

You can find exact matches that are split on words with the -w flag:

# this will only match chr1
$ grep -w chr1 lamina.bed | cut -f1 | uniq

Beware of using grep to search for numbers:

# finds all strings that match `100`
$ grep 100 lamina.bed | head -n 20

# better, but doesn't look at numeric value
$ grep -w 100 lamina.bed | head -n 20

Tip

If you’re trying to find numeric values in a file, use awk instead:

$ awk '$2 == 500' lamina.bed

Exercises¶§

use grep to identify lines in lamina.bed where the second field (start) begins with 100.
use grep to identify lines in lamina.bed where the third field (end) ends with 99 .
use grep with its -w flag to count the number of ‘chr1’ records in lamina.bed.
use grep to count how many fastq records are in the data-sets/fastq/t_R1.fastq.gz file (fastq records begin with an ‘@’ symbol)
use grep to count the number of fastq records in data-sets/fastq/SP1.fq

awk¶§

http://en.wikipedia.org/wiki/AWK

AWK is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.

Named after authors A ho, W einberger & K ernighan

This is programming

basic principles¶§

awk operates on each line of a text file
in an awk program, $1 is an alias for the 1st column, $2 for the 2nd, etc.
awk can filter lines by a pattern

awk program structure¶§

BEGIN runs before the program starts
END runs after the program runs through all lines in the file
PATTERN and ACTIONS check and execute on each line.

awk 'BEGIN {} (PATTERN) { ACTIONS } END {}' some.file.txt

awk BEGIN¶§

You can use BEGIN without a file. We just do one thing then exit:

awk 'BEGIN { print 12 * 12 }'

Same with END:

awk 'END { print 12 * 13 }'
# then type ctrl+d so it knows it's not getting more input.

filtering¶§

A simple and powerful use of awk is lines that match a pattern or meet set of criteria. Here, we match (and implicitly print) only lines where the first column is chr12:

awk '($1 == "chr12")' lamina.bed

We can also filter on start position using ‘&&’ which means ‘and’:

awk '($1 == "chr12" && $2 < 9599990)' lamina.bed

Important

= and == are not the same thing, and are frequently mixed up.

= is the assignment operator == tests for equality != tests for inequality.

program structure¶§

awk '($1 == "chr12" && $2 < 9599990)' lamina.bed

Important

when we are checking as a character (“chr12”) we need the quotes.
when we are checking as a number (9599990) can not use quotes.
can’t use commas (e.g. 9,599,990) in numbers

in-class exercise¶§

we will do the first of these together.

how many regions (lines) in lamina.bed have a start less than 1,234,567 on any chromosome?
how many regions in lamina.bed have a start less than 1,234,567 on chromosome 8?
how many regions (lines) in lamina.bed have a start between 50,000 and 951,000
how many regions in lamina.bed overlap the interval chr12:5,000,000-6,000,000 ?

Important

the last question is not trivial and understanding it will be useful

awk program structure (actions)¶§

print total bases covered on chromosome 13:

awk '($1 == "chr13") { coverage = coverage + $3 - $2 }
     END { print coverage }' lamina.bed

Important

the entire awk program must be wrapped in quotes. Nearly always best to use

single quotes (‘) on the outside.
coverage is a variable that stores values; we don’t use

a $ to access it like we do in bash or like we do for the $1, $2, ... columns

in-class exercise¶§

below is how we find coverage for chr13.

awk '($1 == "chr13") { coverage += $3 - $2 }
     END{ print coverage }' lamina.bed

how can we find the total coverage for all chromsomes except 13?

awk continued¶§

The $0 variable contains the entire line.

multiple patterns

awk '$3 >= 5000 { print $0"\tGREATER" }
     $3  < 5000   { print $0"\tLESS" }' \
      states.tab

remember we can simply filter to the lines > 5000 with:

awk '$3 >= 5000' states.tab

awk special variables¶§

we know $1, $2, ... for the column numbers
NR is a special variable that holds the line number
NF is a special variable that holds the number of fields in the line
FS and OFS are the (F)ield and (O)output (F)ield (S)eparators i.e. the delimiters (default is any space character)

using awk to count lines with NR¶§

$ wc -l lamina.bed

$ awk 'END { print NR }' lamina.bed

using FS and OFS¶§

Let’s convert lamina.bed to comma-delimited but only for chr12

remember FS is the input separator and OFS is the output delimiter

$ awk 'BEGIN{FS="\t"; OFS=","}
    ($1 == "chr12"){ print $1,$2,$3 }' lamina.bed

regular expressions¶§

we won’t cover these in detail, but you can match on regular expressions.

The following finds lines containing chr2 (chr2, chr20, chr21) in the first column

$ awk '$1 ~ /chr2/' lamina.bed

Often we can get by without regular expressions but they are extremeley powerful and available in nearly all programming languages.

advanced awk¶§

You can do a lot more with awk, here are several resources:

In Class Exercises - Class 4¶§

we will do the first 2 of these together.

use NR to print each line of lamina.bed preceded by it’s line number
use NF to see how many columns are in each row of states.tab

review¶§

$1, $2, $3 (default sep is space)
adjust sep with: OFS=”t”; FS=”,”
$0 # entire line

BEGIN {}
(match) { coverage += $3 - $2 }
END { print coverage }

NR is line number; NF is number of fields;
BEGIN {} filter { action } END { }

Exercises¶§

are there any regions in lamina.bed with start > end?
what is the total coverage [sum of (end - start)] of regions on chr13 in lamina.bed?
what is the mean value (4th column) on chromome 3 of lamina.bed
print out only the header and the entry for colorado in states.tab
what is the (single-number) sum of all the incomes for states.tab with illiteracy rate less than 0.1? greater than 2?
use NR to filter out the header from lamina.bed (hint: what is NR for the header?)

Class 4 : grep and awk — Genome Analysis Workshop 0.2 documentation

Class 4 : grep and awk¶§

Goals¶§

grep¶§

Exercises¶§

awk¶§

basic principles¶§

awk program structure¶§

awk BEGIN¶§

filtering¶§

program structure¶§

in-class exercise¶§

awk program structure (actions)¶§

in-class exercise¶§

awk continued¶§

awk special variables¶§

using awk to count lines with NR¶§

using FS and OFS¶§

regular expressions¶§

advanced awk¶§

In Class Exercises - Class 4¶§

review¶§

Exercises¶§