-------------------------[ Readings ]-------------------------

Today we looked into the programming patterns and tools that can help
emulate Linux's TCP stack with most accuracy and least effort.

Work through the details of the following diagrams: the TCP state
machine, the sliding window algorithm, retransmission timers for
multiple packets, and TCP connection teardown:

http://www.tcpipguide.com/free/t_TCPOperationalOverviewandtheTCPFiniteStateMachineF-2.htm
http://www.tcpipguide.com/free/t_TCPConnectionTermination-2.htm
http://www.tcpipguide.com/free/t_TCPSlidingWindowDataTransferandAcknowledgementMech-5.htm
http://www.tcpipguide.com/free/t_TCPSegmentRetransmissionTimersandtheRetransmission-3.htm

The TCP 3-way handshake has a similar diagram:
http://www.tcpipguide.com/free/t_TCPConnectionEstablishmentSequenceNumberSynchroniz-2.htm

Handling TCP packets and connections in the Linux kernel converged to
several data structures. These posts summarize Linux's per-packet
structure SBK and its specific subset used to track TCP packets:

http://vger.kernel.org/~davem/skb.html
http://vger.kernel.org/~davem/skb_redundancy.html  --how a packet is handled through the layers
http://vger.kernel.org/~davem/skb_list.html      
http://vger.kernel.org/~davem/skb_data.html
http://vger.kernel.org/~davem/tcp_skbcb.html       --TCP-specific per-packet data
http://vger.kernel.org/~davem/tcp_output.html      --how outgoing packets are queued

Of course, the kernel has a much bigger scope of functionality to
implement that you do, so copying these data structures for your own
use would be tremendously excessive; you will be able to get away with
a lot less. But I believe it helps to see what data is collected into
SKBs and TCBs, and why. 

Note that these structures do no include variables to track the TCP
sliding window, only the packets themselves and per-packet information!

----------[ Multiplexing listening on several sockets ]----------------

The blueprint for a typical TCP server is built on the accept()
blocking system call: the server blocks on a listening socket until
new input is available, then hands off the processing of that input to
a fork()-ed child process or thread, while continuing to listen for
more input. The decision that accept() returns a new socket
descriptor---which is inherited through a fork---is what makes this
design work.

However, note that this scheme breaks down if you have more than one
source of input. Imagine that you have not one but two or more sockets
to listen on, or a socket to read from and a timer to attend to.  A
blocking system call like accept() or recv() is only meant to react to
_one_ source of input and one kind of event; you need something else
to handle more than one!

This was the motivation for introducing first the select() system
call, then the poll() system call, and, lately, the even more versatile
epoll(). 

In class, we considered poll(). Poll() allows you to wait on an array
of sockets (more precisely, an array of "pollfd", which includes file
descriptors and associated masks for events desired and observed),
blocking until one of these sockets has data to read (or will allow
you to write data into it, or encounters an error condition---we won't
be using these events). You get to define which events you are
interested in for each file descriptor by setting the bitmask "events"
in the array data structure.  When the call to poll() unblocks, this
means that one or more of the desired events have occurred, and you
can find which one(s) by examining their entries (the "revents" member
of the array data structure).

In the code example below, we combine this use of poll() with Linux
timers.

There are various sources on comparing select() vs poll() vs epoll().
A useful intro is http://www.linux-mag.com/id/357/, and 
http://stackoverflow.com/questions/970979/what-are-the-differences-between-poll-and-select
provides more pointers.

----------------[ Linux timers ]----------------

Linux provides a convenient way of combining poll()-ing sockets and
acting on timers. Specifically, you can create a timer that causes a
poll()-able file descriptor to have data to read exactly when its
associated timer expires. That way, you can poll() with the same
system call line of code for two kinds of events: data arriving on a
socket (in our case, a raw packet arriving on a raw socket), and a
timer (such as a TCP retransmit timer) expiring.

Example from class, combining timerfd_create() and poll():
http://www.cs.dartmouth.edu/~sergey/cs60/lab4/tcp-timer.c

Read "man 2 poll", "man 7 time" and "man timerfd_create" in Linux for the documentation
of these calls and their arguments.

----------------[ Libnet ]----------------

Libnet is the library for crafting IP packets of carious kinds, saving
you the effort of recomputing IP and TCP checksums, and a number of
other manual tasks needed to send a packet via raw sockets. Libnet was
the basis of the first generation of network security tools that
exposed many vulnerabilities of TCP/IP implementations.

You can install Libnet in your virtual machines with 
"apt-get install libnet1-dev" as root. 

You will find the libnet tutorial at 

  https://repolinux.wordpress.com/2011/09/18/libnet-1-1-tutorial/ ,

code examples from it on Github at 
  https://github.com/repolho/Libnet-1.1-tutorial-examples .

My example from class is in 
  http://www.cs.dartmouth.edu/~sergey/cs60/lab4/libnet-example-icmp.c

I posted a local copy of the libnet manual in
  http://www.cs.dartmouth.edu/~sergey/cs60/libnet1-doc/ 
Specifically, the function list is in 
  http://www.cs.dartmouth.edu/~sergey/cs60/libnet1-doc/libnet-functions_8h.html

Note that Libnet has a lot of knowledge about IP protocols encoded not
only into its functions like libnet_build_ipv4() but also into its
environment. When you build a TCP, UDP, or ICMP payload and save the
tag from it, and then finish the packet by adding the IPv4 layer, you
won't need to rebuild the IPv4 again for another ICMP packet---if you
reuse the tag. Libnet uses tags to keep track of how packets were
built, including their outer layers, and will rebuild them
automatically if it can. This functionality is, of course, heuristic,
but it can save you a lot of effort.

Also note that Libnet's raw packet sending function, libnet_write(l),
takes just the opaque context l, not the packet! This means that Libnet
has the concept of the current packet being built, with all of its
layers belonging together, and will send _that_ packet when you request
it with libnet_write(). At the same time, you can have many buffers
for packets lying around from previous libnet_build_* calls, identified
by tags, and reuse them as the "current" packet. 

See https://repolinux.wordpress.com/2011/09/18/libnet-1-1-tutorial/#sending-multiple-packets
about the use of tags.
   
One caveat about Libnet: it exists in two incompatible versions, 1.0
and 1.1.  Some tutorials and examples (such as my favorite tool Dsniff,
https://www.monkey.org/~dugsong/dsniff/) use the older 1.0 version,
which is very different from the more complex version we discussed.
Libnet 1.0, though, has useful functions and is easier to read. You
can find this older deprecated version at
http://packetfactory.openwall.net/projects/libnet/index.html together
with its reference manual.

Thanks,

--Sergey