=========================[ Readings ]=========================

The BPF/Libpcap paper from 1992:
http://www.vodun.org/papers/net-papers/van_jacobson_the_bpf_packet_filter.pdf

A quick tutorial on using IPtables:
https://www.netfilter.org/documentation/HOWTO/packet-filtering-HOWTO-7.html

On IPtables internals:
https://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html, 
especially Section 3:
https://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html

==================[ Packet-level networking ]==================

Berkeley sockets APIs abstract away the packet level of
networking, giving you abstractions of streams (SOCK_STREAM)
or "datagrams" (SOCK_DGRAM, i.e., UDP). However, most network
tools---and the kernel itself---work with raw packets. Today
we surveyed several of these mechanisms:

1) Raw sockets (SOCK_RAW). These are the primary way to send
   a crafted packet, bypassing the kernel's constructions of
   its headers (and at the cost of having to build all the 
   headers yourself, including the pesky TCP/IP checksums).
   But for sniffing raw packets, there is a better way.

2) Libpcap + BPF (Berkeley Packet Filter). BPF is a kernel
   mechanism for taking packets from the network driver,
   buffering and filtering them, and then giving them to a
   userland application like tcpdump or Wireshark for analysis.
   These applications use the libpcap library to read these
   packets.

   We went through the code of tcpdump in class; I suggest you
   spend more time with it. See below for more detail.

3) IPtables/Netfilter in Linux and PF in *BSD (including MacOS).
   These firewalls dissect packets to apply rule-based decisions 
   to them (e.g., block all packets for port 22 from a particular
   network, or allow only packets from a particular network like
   129.170./16 and drop packets from all other sources). These
   firewalls also rewrite packets, replacing, e.g., source
   and IP addresses as needed by NAT. You specify the rules with
   the command like tool iptables.

   Linux's Netfilter has an amazing feature: you can write rules
   to match packets, "steal" them from the kernel into a buffer,
   give them into a userland program, and then, after the
   userland program works on them, re-inject the packets back
   into the kernel, changed or unchanged. This feature exists
   in two versions: IPQUEUE and NFQUEUE (newer version). That way
   you could do NAT in your own userland program if the kernel's
   version is, for some reason, insufficient.

4) Virtual network interfaces, TUN/TAP on Linux, UTUN on MacOS.  These
   are used by VPNs such as OpenVPN. With these drivers loaded, you
   can tell the kernel to create and configure a network interface
   with its own IP address and netmask, and to route into it packets
   for certain destinations. Then a userland program can open() this
   interface and read these packets as byte buffers with read() and 
   send packets on that interface with write(). This is close to
   raw sockets, but requires setting up routing so that packets
   would turn up. 

We will discuss (4) later in the course. (1)--(3) are the key tools
of network administration. These mechanisms are like the Force: they
hold networks together and make them understandable and manageable.

=====================[ Libpcap, tcpdump ]=====================

Read the paper 1992 paper
http://www.vodun.org/papers/net-papers/van_jacobson_the_bpf_packet_filter.pdf

Libpcap allows you to configure capture and filtering of packets by
the kernel, and then run the endless loop that calls your "callback"
function for every packet. This callback function gets a pointer
to the raw bytes of a packet (3rd argument), a struct with packet 
metadata (2nd argument), and a pointer to some data structure of
your own that will be the same between all calls, and is intended
for the different invocations of the handler to share data without
making that data global:  

From "man pcap_loop":

#include <pcap/pcap.h>

typedef void (*pcap_handler)(u_char *user, const struct pcap_pkthdr *h,
                                   const u_char *bytes);

struct pcap_pkthdr {
        struct timeval ts;      /* time stamp */
        bpf_u_int32 caplen;     /* length of portion present */
        bpf_u_int32 len;        /* length this packet (off wire) */
#ifdef __APPLE__
        char comment[256];
#endif
};
 
This becomes the main loop of a pcap-using program:

int pcap_loop(pcap_t *p, int cnt,
               pcap_handler callback, u_char *user);

Most of Libpcap's functions (and there are many, all named pcap_*)
take a "pcap descriptor" pcap_t, which is created when you open either
a live interface for sniffing (pcap_open_live) or a saved file
(pcap_open_offline). You can think of this descriptor as an object (in
C++ or Java sense) on which these functions work as methods (in fact,
C++ methods under the hood are functions that take an extra argument,
a pointer to the object, called the "implicit" argument, which is
available inside these functions as "this").

Note this pattern of callback called multiple times with a shared context.
It is a very important C design pattern.

Looking through tcpdump's main source file, tcpdump.c, 
(get http://www.tcpdump.org/release/tcpdump-4.9.0.tar.gz)   
you see the skeleton of libpcap-based programs: 

        *ebuf = '\0';
        pc = pcap_open_live(device, ndo->ndo_snaplen, !pflag, 1000, ebuf);
        if (pc == NULL) {
                ...
                error("%s", ebuf);
        }
  
(device is a string name, such as "eth0", "lo", or "any"; ebuff is a buffer
 to catch a pcap-specific error message if any).

        status = pcap_loop(pd, cnt, callback, pcap_userdata);

(main loop; only an error or ^C will break out of it; the rest is up to
 callback. Any shared context is passed via the pcap_userdata struct).

See my pcap-hexprint.c example in http://www.cs.dartmouth.edu/~sergey/cs60/pcap/

Practice with extending the printer to print the IP addresses of the source
and destination, and the ports for TCP and UDP for each frame.

--------------------[ BPF filtering ]--------------------

Libpcap allows you to set a filter for packets to capture. This saves both
CPU and storage. The filtering occurs in the kernel, on a buffer of
frames received by the network driver; only those passing the filter
are copied into the buffers given to your libpcap-based application.

Even though packet filtering is ultimately the same as you'd write in C
to check a packet's fields, it would be foolhardy to let tcpdump users
inject C code into the kernel. Instead, the filter is compiled to bytecode
for a very simple virtual machine, and that bytecode is sent to the kernel,
and executed there in a VM. The idea is that limited bytecode is easier
to analyze and constrain to just the desired operations than C code.    

You can see the bytecode with the -d option to tcpdump, and it's really
easy to read:

tcpdump -d -i en0 icmp
(000) ldh      [12]                           // get 2 bytes at offset 12, the ethertype
(001) jeq      #0x800           jt 2	jf 5  // compare with 0x800, ethertype for IP
                                              //  jump to 002 if true, 005 if false
(002) ldb      [23]                           // get one byte at offset 23, i.e., offset 9
                                              //  into the IP header, the protocol number
(003) jeq      #0x1             jt 4	jf 5  // compare with 1, proto number for ICMP  
                                              //  jump to 004 if true, 005 if false
(004) ret      #65535                         // accept packet, we want it
(005) ret      #0                             // discard packet

Check out the filters for "host 10.10.10.10" (anything to or from that IP)
and for "not port 22" (useful when you run tcpdump on a machine you are ssh-ing into).

Check out where in tcpdump code pcap_compile() creates a bytecode blob from your
filter string, and where pcap_setfilter() sends it into the kernel.

Note that frames given by a device driver to BPF may have link-layer 
headers of different sizes for different drivers and links. Ethernet header
of 14 bytes (ethertype/protocol number in the two last bytes, offsets 12--13)
is just one option. 

For this reason, pcap provides the function pcap_datalink().  The
returned number is one of the DLT_ constants from /usr/include/pcap/bpf.h
and is used to determine the offset of the IP header from the start of
the frame, and the layout of the frame.

--------------------[ Control flow ]--------------------

Tcpdump code consists mainly of parsers-and-printers for various
protocols, contained in files named print-<protocol>.c . The
parsers for protocols lower in the stack pass control to those
higher in the stack, based on values in the packet such as "ethertype"
in the Ethernet frame ("ether[12:2]" in tcpdump filter language), 
"protocol" in the IP packet ("ip[9]"), well-known ports like UDP
port 53 for DNS, and so on. 

For example, there's print-ip.c, with the main entry function
ip_print(), which prints the bulk of info you see in tcpdump's
output. This function is called from files that deal with link-layer
protocols, most frequently print-ether.c

[I use ctags for quickly finding where a function or variable is 
 defined in code. Consult this tutorial (where ctags are applied to Linux kernel tree):
 https://courses.cs.washington.edu/courses/cse451/10au/tutorials/tutorial_ctags.html]

$ grep -r ' ip_print' . | grep -v tags
...
./print-ether.c:	        ip_print(ndo, p, length);
...

The caller is ethertype_print(), itself called from ether_print(),
called in turn from ether_if_print(), the top-level routine of the
printer:

/*
 * This is the top level routine of the printer.  'p' points
 * to the ether header of the packet, 'h->len' is the length
 * of the packet off the wire, and 'h->caplen' is the number
 * of bytes actually captured.
 */
u_int
ether_if_print(netdissect_options *ndo, const struct pcap_pkthdr *h,
               const u_char *p)
{
        return (ether_print(ndo, p, h->len, h->caplen, NULL, NULL));
}

But who calls ether_if_print()? It is called based on the data link
type---the same as returned by pcap_datalink()---based on the top-level
table called  "printers". In version 4.3.0, this table is in tcpdump.c;
in the latest version 4.9.0, it's in a separate file, print.c, and
has some new utility functions, like has_printer() and lookup_printer()
(trace the use of these through the code):

static const struct printer printers[] = {
        { ether_if_print,       DLT_EN10MB },
...
#ifdef DLT_IEEE802_15_4
        { ieee802_15_4_if_print, DLT_IEEE802_15_4 },   // this is ZigBee
#endif
...
        { raw_if_print,         DLT_RAW },
#ifdef DLT_IPV4
        { raw_if_print,         DLT_IPV4 },
#endif
...
#ifdef DLT_IPV6
        { raw_if_print,         DLT_IPV6 },
#endif
#ifdef HAVE_PCAP_USB_H
#ifdef DLT_USB_LINUX
        { usb_linux_48_byte_print, DLT_USB_LINUX},    // pcap can sniff the USB bus!
#endif /* DLT_USB_LINUX */
...
#ifdef DLT_IEEE802_11
        { ieee802_11_if_print,  DLT_IEEE802_11},      // this is Wi-Fi in Monitor Mode
#endif

and so on. Note this pattern, a dispatch table: it is also very common in C.

I suggest you spend some time with tcpdump code to see what a mature network
tool looks like.

=====================[ Netfilter and IPtables ]=====================

Tcpdump code parses packets in userland. The kernel must also process packets,
to give them to appropriate functions to further process, and so on. So
kernel code branches out in a tree of per-protocols handlers just like
tcpdump's (for processing rather than printing, of course).

We saw how Netfilter hooks fit with the kernel's networking code in slides 4--9
http://www.cs.dartmouth.edu/~sergey/netreads/joanna-rutkowska-passive-covert-channels-slides.pdf

All functions in blue boxes are actual function names in the Linux 
kernel; e.g., arp_rcv() is at http://lxr.free-electrons.com/source/net/ipv4/arp.c#L901
and so on.

Netfilter is very powerful, and can be either a packet filtering firewall,
a NAT, a port forwarder, a DDoS countermeasure, or all of the above. 
For examples, see:

http://blog.erratasec.com/2016/10/configuring-raspberry-pi-as-router.html
https://javapipe.com/iptables-ddos-protection
https://www.karlrupp.net/en/computer/nat_tutorial 

Examples from my machine:

# iptables-save 
# Generated by iptables-save v1.4.21 on Wed Apr 26 18:46:44 2017
*filter
:INPUT ACCEPT [522874:336579980]
:FORWARD ACCEPT [48849:6966028]
:OUTPUT ACCEPT [165758:31420812]
-A INPUT -s 218.64.0.0/15 -i eth0 -p tcp -m tcp --dport 22 -j DROP
-A INPUT -s 103.193.200.0/22 -i eth0 -p tcp -m tcp --dport 22 -j DROP
-A INPUT -s 121.16.0.0/13 -i eth0 -p tcp -m tcp --dport 22 -j DROP
-A INPUT -s 210.13.0.0/16 -i eth0 -p tcp -m tcp --dport 22 -j DROP
-A INPUT -s 220.191.0.0/16 -i eth0 -p tcp -m tcp --dport 22 -j DROP
-A INPUT -s 221.192.0.0/14 -i eth0 -p tcp -m tcp --dport 22 -j DROP

(these are blocks for incoming connections attempting to bruteforce 
 my SSH passwords for root:

# grep 218.65. /var/log/auth.log | head
Apr 23 10:48:03 throk sshd[30863]: reverse mapping checking getaddrinfo for 38.30.65.218.broad.xy.jx.dynamic.163data.com.cn [218.65.30.38] failed - POSSIBLE BREAK-IN ATTEMPT!
Apr 23 10:48:03 throk sshd[30863]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.65.30.38  user=root
Apr 23 10:48:04 throk sshd[30863]: Failed password for root from 218.65.30.38 port 10518 ssh2
Apr 23 10:48:06 throk sshd[30863]: Failed password for root from 218.65.30.38 port 10518 ssh2
Apr 23 10:48:10 throk sshd[30863]: Failed password for root from 218.65.30.38 port 10518 ssh2
Apr 23 10:48:10 throk sshd[30863]: Received disconnect from 218.65.30.38: 11:  [preauth]
Apr 23 10:48:10 throk sshd[30863]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=218.65.30.38  user=root
Apr 23 10:48:40 throk sshd[30865]: reverse mapping checking getaddrinfo for 38.30.65.218.broad.xy.jx.dynamic.163data.com.cn [218.65.30.38] failed - POSSIBLE BREAK-IN ATTEMPT!
...

# grep 218.65. /var/log/auth.log  | wc 
  19763  288473 2345516

That's nearly 20,000 logged attempts from Apr 23 to Apr 25, until I blocked it,
and almost 2M of log space. This is the Internet these days, folks.

Moreover, these rules are mostly futile as a defense, as brute-force scanners
change networks all the time, and there's no lack of devices with default
passwords for root---like many IoT cameras---that join the ranks of scanners.
It's a losing battle; I just put in these rules to save myself some scrolling
while I had to watch the log in real time for something else. 

The right way to frustrate the scanners is to allow connections from 
specific networks and drop the rest:

# iptables -A INPUT -s 129.170.0.0/16    -i eth0 -p tcp -m tcp --dport 22 -j ACCEPT
# iptables -A INPUT -s <my-home-network> -i eth0 -p tcp -m tcp --dport 22 -j ACCEPT
# iptables -A INPUT -i eth0 -p tcp -m tcp --dport 22 -j DROP

(-A means append. The rule order matters: if I were to start with the DROP rule, 
    my connection would drop, too! It's a good idea to leave yourself a connection
    that you know would not be blocked by your new rules, or test new rules with
  "iptables-save > sane.rules" and then
  "iptables <some_new_rule> && sleep 20 && iptables-restore < sane.rules", which
    will reload sane rules after 20 seconds. I learned this the hard way.)

Use your virtual machine to try various rules from
https://www.netfilter.org/documentation/HOWTO/packet-filtering-HOWTO-7.html 

One rule we will need in Lab 4 will drop outgoing TCP RST packets (so that
our own raw TCP implementation would not be interfered with by the kernel TCP stack,
which knows nothing about our connections):

# iptables -A OUTPUT -p tcp --tcp-flags RST RST -j DROP 

(the first RST is the set of flags to examine, the second is exactly which 
  of these flags must be set to match the rule.)

We saw that by setting this rule in the VM and observing that RSTs in
response to a netcat connection no longer reach the host. Thus netcat
doesn't immediately quit after getting a RST back, but keeps sending
SYN packets.