OnSwipe redirect code

Sunday, March 18, 2012

Rails cookie handling -- serialization and format

A typical Rails cookie has this format : cookie-value--signature (the two dashes are literal). The "cookie-value" part is a url encoded, base64 encoded string of the binary dump (via Marshal.dump) of whatever was set in the session. The signature part is a HMAC-SHA1 digest, created using the cookie-value as the data and a secret key. This secret key is typically defined in [app-root]/config/initializers/secret_token.rb.

Let us try and reverse engineer a session cookie for a local app that I am running. I am using Devise for authentication, which in turn uses Warden. I use the Firecookie extension to Firebug to keep track of cookies. It is pretty handy.

Here is the session cookie set by Rails:

# Cookie as seen in Firebug
BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm--51f90f7176326f61636b89ee9a1fce2a4972d24f


As mentioned at the beginning it has two parts separated by two dashes (--).

The cookie value in this case is :

# The cookie-value part
BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm


The signature is :
51f90f7176326f61636b89ee9a1fce2a4972d24f

Whenever Rails gets a cookie it verifies that the cookie is not tampered with, by verifying that the HMAC-SHA1 signature of the cookie-value sent matches the signature sent. We can also do the verification ourselves here. Fire up irb and try the following :
$ irb

irb(main):003:0> cookie_str = "BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm"
=> "BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm"


# This cookie_secret comes from [app-root]/config/initializers/secret_token.rb. Obviously you need to keep this secret for your production apps.
irb(main):005:0> cookie_secret = '392cacbaac74af104375eb91324e254ba232424130e69022690aa98c1d0dfade159260588677e2859204298181385a83b923e58c4ef24bb3a40bdad9a41431b4'
=> "392cacbaac74af104375eb91324e254ba232424130e69022690aa98c1d0dfade159260588677e2859204298181385a83b923e58c4ef24bb3a40bdad9a41431b4"

irb(main):006:0> OpenSSL::HMAC.hexdigest(OpenSSL::Digest::SHA1.new, cookie_secret, cookie_str)
=> "51f90f7176326f61636b89ee9a1fce2a4972d24f"

As can be seen the HMAC-SHA1 hexdigest generated with the cookie-value matches the signature part of the cookie. Hence the cookie is not tampered with.

Now that the cookie authenticity is validated, let us see what information it holds.

Let us retrace the steps taken by Rails to generate this cookie value to get the value stored in the cookie. The steps taken by Rails are :
  1. session_dump = Marshal.dump(session)
  2. b64_encoded_session = Base64.encode64(session_dump)
  3. final_cookie_value = url_encode(b64_encoded_session)

The reverse process would be :
  1. url_decoded_cookie = CGI::unescape(cookie_value)
  2. b64_decoded_session = Base64.decode64(url_decoded_cookie)
  3. session = Marshal.load(b64_decoded_session)

And with a beautiful language like Ruby all these 3 steps can be done in one single line of code. Here it is :
(Btw, I need to require 'mongo' because one of the values contained here is of type BSON::ObjectId which is defined in the mongo gem. Without this Marshal.load will error out)

irb(main):001:0> require 'mongo'
=> true
irb(main):002:0> require 'cgi'
=> true
irb(main):003:0> cookie_str = "BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm"
=> "BAh7B0kiGXdhcmRlbi51c2VyLnVzZXIua2V5BjoGRVRbCEkiCVVzZXIGOwBGWwZvOhNCU09OOjpPYmplY3RJZAY6CkBkYXRhWxFpVGkvaQGsaQGwaRBpAdFpCGk9aQHtaQBpAGkGSSIiJDJhJDEwJEZseHh3c293Q29LcHhneWMxODR2b08GOwBUSSIPc2Vzc2lvbl9pZAY7AEYiJTUwNDdkOTMwNDNkNGEzOTA4YTkwN2U2MDY5OGRmOTdm"

# Reverse engineering the cookie to get the session object
irb(main):004:0> session = Marshal.load(Base64.decode64(CGI.unescape(cookie_str)))
=> {"warden.user.user.key"=>["User", [BSON::ObjectId('4f2aacb00bd10338ed000001')], "$2a$10$FlxxwsowCoKpxgyc184voO"], "session_id"=>"5047d93043d4a3908a907e60698df97f"}

This is the session data that the session cookie was holding. This data is subsequently used by Warden and Devise to fetch the user from the DB and do the authentication.

And that is how Rails handles cookies (at least how Rails 3.0.11 does. I am not sure if things have changed in later versions)

Thursday, March 15, 2012

NAS and SAN explained -- with technical differences.

Acronyms and fancy buzz words (specifically computer science related ones) have always troubled me, at times making me very angry at the person using them and in many cases leaving me in a confused state eventually. So whenever I come across such acronyms/buzz words I try to dissect them and prepare a mental visual map that I will use every time the acronym is used in the future. The acronyms for this write up are NAS (Network Attached Storage) and SAN (Storage Area Network).

These might be very simple and obvious things for many people but I am sure I have lost quite a bit of my hair whenever someone mentioned these acronyms to me. So here is my attempt to decipher them.

First the basics. Both of these consist of two building blocks storage and network, or to put in a less naive manner both SAN and NAS allow applications on one machine to access data present on another machine. Okay, so why two names, why two acronyms? To answer that let me just take up these two building blocks separately.

In the simplest sense "Storage" means dealing with files stored on the hard disk attached to the system. We do that with the the APIs (or "methods" if you want to avoid the acronym) made available by the filesystem and libraries built using those methods. As an application programmer we almost never worry about how the files are actually stored on the disk. It is the responsibility of the filesystem, the kernel and the disk driver. The application always views the data stored on the disk in terms of files (used in a generic sense to refer to both files and directories) - more so as a stream of bytes. If we dig a little deeper we find that these disks are actually made available to the filesystem by the disk drivers as block devices - i.e. whenever they accept or return data they do it in quantum of blocks. A disk doesn't return you a single byte of data when you read from it. It always returns one or more blocks. From what I understand the size of a block these days is typically 4KB. Amount of data transferred to or from the disk is a multiple of this block size. Allocation of space for files is also made in terms of blocks, which some times leads to a file utilizing the last block partially (and that is why we see the difference in the actual file size and file size of disk entries).

That's about storage. To summarize; data is made available as files by filesystem software, but the device actually makes it available as blocks.

Network in the simplest sense is communication between two processes - running either on the same machine or on different machines. To simplify it further let's just limit to the case of communication between two processes on two different machines. Typically one of these two processes will be a server processes and the other a client process. The server process would be listening on a specified port to which the client can connect. The client can then send requests over the connection which the server will "serve", by sending back a suitable response. The format of the request and the response are specified before hand and the client and the server agree to conform to that specification. This conformance is what is called the "protocol" which the two processes (or in this case the two machines) are using for their communication. The client typically asks for some data and the server fetches it from some place and sends the requested data as response. The client doesn't know where the server is fetching the data from and the server doesn't know what the client is doing with the data. The protocol is all that matters to them.

That's network. No summary here.

Okay, so how do storage and network come together now?

In the storage example the data on the hard disk (referred to as "our hard disk" henceforth) was being accessed by the applications running on the same machine (referred to as the "host machine" henceforth). Now what if applications running on a different machine (referred to as the "new machine" henceforth) want to access the data on our hard disk? Let us call this requirement as "remote data access".

The traditional filesystem software is designed to interact with a disk that was made available to it on the local system by the disk driver and the driver is designed to handle a disk that is attached to this local system. For our "remote data access" either the filesystem software has to get smarter and start talking to the device available on our host machine or the disk driver has to become smarter and make the disk on our host machine available as a local device on the new machine. It is these two options that the two acronyms stand for. One acronym means a smarter filesystem software with the same old driver and another means a smarter driver with the same old filesystem. That's the difference between the two and the reason there are two names and two acronyms.. !

NAS - Network Attached Storage -- This one has a smarter filesystem and the same old driver. In our setup, the filesystem on the "new machine" knows that the disk is on the "host machine" and every time an application requests a file (either for reading or writing) it has to contact the "host machine" over network and retrieve the file. In other words the filesystem on the "new machine" makes a request to the "host machine" - making it a client process. To accept and respond to that request there must be a server process running on the "host machine". This server process fetches the requested file from the disk (using the old driver) and sends it back to the client. The client process, which is the filesystem software, in turn makes that file available to the application that requested it. We can see that the data on the server is made available to the client as a file. This is what defines NAS.

So for the filesystem software to get smart, it now needs two components - a client part used by the applications and the server part which handles the disk. There are quite a few such "smart filesystem software" out there. The most common in the UNIX/LINUX world is NFS - Network File System. The server part of NFS is named "nfsd". On the client side, the standard "mount" command is smart enough to mount drives with "nfs" filesystem type.

Note that here the filesystem software is aware that the disk (and hence the data) is on a remote machine. This is another defining trait of NAS.

More details are available here : http://nfs.sourceforge.net/ and here : https://help.ubuntu.com/8.04/serverguide/C/network-file-system.html

SAN - Storage Area Network -- This one has a smarter disk driver and the same old filesystem. The disk driver on the "new machine" lies to the OS and the filesystem software that there is a disk attached to the system locally. The OS and the filesystem software believe the driver and continue to use the fake disk that the driver provided. Whenever the disk driver is asked to fetch a block (not a file, a block), it in turn sends a request to the "host machine" and retrieves that block of data - thereby becoming the client process in the setup. Accordingly there will be a server process running on the "host machine" which accepts this request, fetches the corresponding block from the actual disk and sends it back to the client. The client, which is the smart disk driver in this case, in turn passes that data to the filesystem software and eventually to the application that requested the file data. It is evident here that the data on the server was made available to the client as "blocks" and not as files. This is what defines SAN.

Note that here the filesystem (and every other component apart from disk driver) is not aware that the disk (and the data) is on a remote machine. This is another defining trait of SAN.

A very common and popular appearance of SAN these days is in the various cloud offerings. For instance the Amazon cloud offering has a service named EBS - Elastic Block Storage, which makes network storage available as locally attached disk. We can have all the regular filesystems like ext4 or xfs on top of this EBS drive.

That's it. The two acronyms have been conquered... !