I Think Tech: computer-science

Showing posts with label computer-science. Show all posts

Thursday, March 15, 2012

NAS and SAN explained -- with technical differences.

Acronyms and fancy buzz words (specifically computer science related ones) have always troubled me, at times making me very angry at the person using them and in many cases leaving me in a confused state eventually. So whenever I come across such acronyms/buzz words I try to dissect them and prepare a mental visual map that I will use every time the acronym is used in the future. The acronyms for this write up are NAS (Network Attached Storage) and SAN (Storage Area Network).

These might be very simple and obvious things for many people but I am sure I have lost quite a bit of my hair whenever someone mentioned these acronyms to me. So here is my attempt to decipher them.

First the basics. Both of these consist of two building blocks storage and network, or to put in a less naive manner both SAN and NAS allow applications on one machine to access data present on another machine. Okay, so why two names, why two acronyms? To answer that let me just take up these two building blocks separately.

In the simplest sense "Storage" means dealing with files stored on the hard disk attached to the system. We do that with the the APIs (or "methods" if you want to avoid the acronym) made available by the filesystem and libraries built using those methods. As an application programmer we almost never worry about how the files are actually stored on the disk. It is the responsibility of the filesystem, the kernel and the disk driver. The application always views the data stored on the disk in terms of files (used in a generic sense to refer to both files and directories) - more so as a stream of bytes. If we dig a little deeper we find that these disks are actually made available to the filesystem by the disk drivers as block devices - i.e. whenever they accept or return data they do it in quantum of blocks. A disk doesn't return you a single byte of data when you read from it. It always returns one or more blocks. From what I understand the size of a block these days is typically 4KB. Amount of data transferred to or from the disk is a multiple of this block size. Allocation of space for files is also made in terms of blocks, which some times leads to a file utilizing the last block partially (and that is why we see the difference in the actual file size and file size of disk entries).

That's about storage. To summarize; data is made available as files by filesystem software, but the device actually makes it available as blocks.

Network in the simplest sense is communication between two processes - running either on the same machine or on different machines. To simplify it further let's just limit to the case of communication between two processes on two different machines. Typically one of these two processes will be a server processes and the other a client process. The server process would be listening on a specified port to which the client can connect. The client can then send requests over the connection which the server will "serve", by sending back a suitable response. The format of the request and the response are specified before hand and the client and the server agree to conform to that specification. This conformance is what is called the "protocol" which the two processes (or in this case the two machines) are using for their communication. The client typically asks for some data and the server fetches it from some place and sends the requested data as response. The client doesn't know where the server is fetching the data from and the server doesn't know what the client is doing with the data. The protocol is all that matters to them.

That's network. No summary here.

Okay, so how do storage and network come together now?

In the storage example the data on the hard disk (referred to as "our hard disk" henceforth) was being accessed by the applications running on the same machine (referred to as the "host machine" henceforth). Now what if applications running on a different machine (referred to as the "new machine" henceforth) want to access the data on our hard disk? Let us call this requirement as "remote data access".

The traditional filesystem software is designed to interact with a disk that was made available to it on the local system by the disk driver and the driver is designed to handle a disk that is attached to this local system. For our "remote data access" either the filesystem software has to get smarter and start talking to the device available on our host machine or the disk driver has to become smarter and make the disk on our host machine available as a local device on the new machine. It is these two options that the two acronyms stand for. One acronym means a smarter filesystem software with the same old driver and another means a smarter driver with the same old filesystem. That's the difference between the two and the reason there are two names and two acronyms.. !

NAS - Network Attached Storage -- This one has a smarter filesystem and the same old driver. In our setup, the filesystem on the "new machine" knows that the disk is on the "host machine" and every time an application requests a file (either for reading or writing) it has to contact the "host machine" over network and retrieve the file. In other words the filesystem on the "new machine" makes a request to the "host machine" - making it a client process. To accept and respond to that request there must be a server process running on the "host machine". This server process fetches the requested file from the disk (using the old driver) and sends it back to the client. The client process, which is the filesystem software, in turn makes that file available to the application that requested it. We can see that the data on the server is made available to the client as a file. This is what defines NAS.

So for the filesystem software to get smart, it now needs two components - a client part used by the applications and the server part which handles the disk. There are quite a few such "smart filesystem software" out there. The most common in the UNIX/LINUX world is NFS - Network File System. The server part of NFS is named "nfsd". On the client side, the standard "mount" command is smart enough to mount drives with "nfs" filesystem type.

Note that here the filesystem software is aware that the disk (and hence the data) is on a remote machine. This is another defining trait of NAS.

More details are available here : http://nfs.sourceforge.net/ and here : https://help.ubuntu.com/8.04/serverguide/C/network-file-system.html

SAN - Storage Area Network -- This one has a smarter disk driver and the same old filesystem. The disk driver on the "new machine" lies to the OS and the filesystem software that there is a disk attached to the system locally. The OS and the filesystem software believe the driver and continue to use the fake disk that the driver provided. Whenever the disk driver is asked to fetch a block (not a file, a block), it in turn sends a request to the "host machine" and retrieves that block of data - thereby becoming the client process in the setup. Accordingly there will be a server process running on the "host machine" which accepts this request, fetches the corresponding block from the actual disk and sends it back to the client. The client, which is the smart disk driver in this case, in turn passes that data to the filesystem software and eventually to the application that requested the file data. It is evident here that the data on the server was made available to the client as "blocks" and not as files. This is what defines SAN.

Note that here the filesystem (and every other component apart from disk driver) is not aware that the disk (and the data) is on a remote machine. This is another defining trait of SAN.

A very common and popular appearance of SAN these days is in the various cloud offerings. For instance the Amazon cloud offering has a service named EBS - Elastic Block Storage, which makes network storage available as locally attached disk. We can have all the regular filesystems like ext4 or xfs on top of this EBS drive.

That's it. The two acronyms have been conquered... !

Saturday, March 10, 2012

Analysis of the Duqu Trojan worm by Kaspersky Labs

This summary is not available. Please click here to view the post.

Sunday, November 27, 2011

A testimonial to one of the best mentors I have had

I have previously written about my internship at IBM in this post and this post. The first post makes it pretty clear that the internship was a very productive (very likely most productive) part of my engineering student life. The second post briefly talks about Gautham Pai as being my guru. Both are very much true and I am thankful to Gautham for having provided me the opportunity to be on the Eclifox team. Although I have expressed my gratitude to Gautham a few times, I never really wrote it down anywhere, neither in my blog nor on any social n/w site. A few months ago Gautham started his own company Jnaapti, which is a technical skill development company. Basically he is doing what he is very good at, i.e. bringing the best out of anyone willing to learn and succeed. As part of the company operations he conducts training sessions for various corporate clients and also mentors engineering students helping them understand their subjects better using some useful project as a means of teaching. He is experimenting with various educational methodologies and different ways to teach/mentor students remotely. I am very confident that his efforts are going to change the landscape of computer science engineering education vastly. Having been mentored and guided by Gautham at various points, I thought now would be a good time to pen down a testimonial and finally put that gratitude in words. So here it goes :

Every software engineer who has been in the industry, even for a small amount of time, surely knows the gap between academic teaching and the industry requirements and the initial uphill task of coping up when a fresh engineering graduate joins any company. It would not have been different for me, if not for Gautham's guidance as my senior at college and my mentor at IBM during my internship. With Gautham's mentoring, the internship was probably one of the most productive spans in my 4 years of engineering studies and also the one packed with maximum learning. Additionally it opened up a number of opportunities for me which I previously did not even know existed - like my participation in Google Summer code and later being with the Mozilla community for quite some time and many others.

Traits like general intelligence, theoretical understanding of the subjects and the ability to solve problems are undoubtedly necessary, but not sufficient. An engineer should be able to think not just about the next line of the code that he is going to write but also think about product that he is building or helping build. He should also know that any technology is just a tool to get the work done and not something that limits you. That way you just pick any new tool that you come across and find useful or necessary for the job. This also means you keep up with the latest happenings in the tech world via blogs, articles, mailing lists etc. Above all the zeal to do more, to come up with new ideas, to start executing those ideas and the persistence to see them through, in the course carefully managing a team as a leader, are what will make an engineer truly successful.

I, of course, did not realize or understand all this during my internship. These were not handed out to me in bulleted list of To-Dos. Rather it was all nicely baked into the project that I (with a few friends) carried out and I was set on the right path without any additional effort. More than that, all of this was demonstrated to us in practice by Gautham himself and some of it just rubbed off on me, making me a much better problem solver, much better product developer, much better ENGINEER, than I would have been otherwise. Now when I look back at my internship days and my days as an engineer after that, I clearly see the impact and how much it has helped. That's Gautham and his way of mentoring. Thank you Gautham for letting me be part of the Eclifox team and for your guidance till date and the future too. :). (For the readers : Eclifox was what we built during our internship - http://buzypi.in/2007/10/11/eclifox-bringing-eclipse-to-the-browser/ and I am very proud of it.)

In case you are wondering where did all of this finally land me, here is my linkedIn profile. :)
Keep up the great work Gautham. Wish you all the success and happiness.

Friday, November 25, 2011

Transactions - both single node and distributed - are hardwired in Windows - since Win 95

Transactions or "Atomic Transactions" to be precise, are very well known to anyone who has worked with databases. With the recent advent of NoSQL databases and the CAP theorem being used/abused by anyone and everyone, words like "consistency" and "transactional model" have become run-of-the-mill jargon. But what is actually interesting is that the concept of transactions or transactional model goes beyond our typical RDBMS. Things get even more challenging when we try to achieve transactions in a distributed system. Because transactions inherently lock the resource(s)/data they are operating on until the transaction completes, those resources can become inaccessible altogether very easily in a distributed setup if one of the node fails or if there is some problem with the network or any such thing, there by increasing the complexity of implementing distributed transactions by many folds compared to transactions on a single node.

Today I was trying to figure out if there is a way to "simulate" (albeit it will be very crude) some sort of transactions in my application which uses MongoDB (which doesn't support transactions by design - to avoid the locking mentioned above, although ironically there is a global write lock..!!). Searching on the internet lead me to this blog of a Raven DB developer. The author there mentions that RavenDB supports both sharding and transactions, which means it has implemented distributed transaction support. At first read I was pretty impressed (this was the first time I had heard about RavenDB). Before I could ask the author about the implementation details I saw a comment in which the author had mentioned that they use DTC (which again was a new thing). Turns out DTC, Distributed Transaction Controller, is a service that is baked right in the Windows OS itself, that too dating back to the Windows 95 days (wow.. now I am impressed with Windows..!). Here is the MSDN article describing the service.

The MSDN article clearly explains the basics of distributed transactions and how it is modeled. What is worth noting is that, by abstracting out the code for carrying out distributed transactions as a service, multiple resource managers (like different databases, queue servers, file servers/managers, etc..) can all interact together in a single transaction. For example, lets say that you have web application where in a client request results in a job being picked up from a queue for processing and simultaneously you update the status of the job in a DB and also create a new file associated with the start of the job. Very evidently all the three resource managers and the web application itself can be (very likely will be) on different nodes. With something like DTC you can easily create a new transaction, send across a commit message and you will get a success notification only if all three actions were successful or else none of the actions go through. Of course, this is possible only if all the three resource managers involved here adhere to the Microsoft's DTC specification and provide the necessary interface to work with it.

The previous example might make DTC appear like this Jason Bourne kind of super dude who can take care of all the heavy lifting and also do it very efficiently. But remember even Bourne gets shot at and also loses his girl. So DTC is not fully immune to problems either. Here is one blog post titled "My beef with MSDTC and two phase commits". It is definitely worth reading. Note that my impression about DTC is purely based on reading the documentation. I have not written a single line of code using DTC.

Sunday, October 30, 2011

Computer hardware test/analysis tools

Today I happen to look into the causes for over-heating of my Dell Studio 14 (1458) laptop (i7, 720QM, 4GB RAM, Win 7 with ATI Mobility Radeon HD 5450 - 1GB) and came across these various tools which people use to test/analyze their hardware.

ThrottleStop
CPUmark
CPU RightMark
Furmark
ORTHOS
HWMonitor
RMClock - Undervolt processor.
GPU-Z

I am yet to use any of these, but I do hope to run them soon and post some results here, JFK. :)

Saturday, October 15, 2011

Machine Learning online course - Class 1

As I mentioned in my previous blog post I am going to use this blog as my course notebook. All posts related to this course will have a "ml-class" tag, just in case.

The first class was all introduction stuff, as expected. What I really liked about this class was the real world examples used. They were very useful in understanding what to expect from this course. Anyways, here are my notes for the class :

Initially there were formal definitions of Machine Learning, one of them with rhyming phrases. I think we can skip those parts.

There are two types of learning algorithms - Supervised and Unsupervised

1) Supervised - A bunch of right answers are already provided to the machine. The machine has to try and get more of those right answers for the next set of questions.
The data provided already has some sense of direction or some sort of inference. It is like a set of input and output values and we have to predict the output value for a new input value given based on the existing data. Here the resultant dimension is known and defined. We have to find a suitable function which when applied on the given set of input values will best match the corresponding output values. This same function will then be used to predict output values of new inputs.
- Eg :
    1) Predicting the price of the house of a particular size given the price of various houses of varying sizes.
    2) Predicting whether a tumor is malignant or not based on the size given the answer for tumors of various sizes

Different Types
    1) Regression - Machine tries to predict a continuous valued attribute, i.e. the value of the attribute whose value we are trying to predict belongs to a continuous range. (The house price example)
    2) Classification - Machine tries to predict a discrete valued output, i.e the range of values is a finite small set of discreet values. (The tumor example)

2) Unsupervised learning - The data set given doesn't provide anything conclusive. It is just a data set and we are expected to make sense out of it and come up with the inference. There is no expected or target domain defined. It has to be inferred by examining the data. Very likely several target domains will be defined over the course of analyzing the data.
- Types :
    1) Clustering of data -
      -Eg : Google news example. Several articles about the same topic are grouped/clustered together. The input data set for this is just a bunch of articles (which is just one dimension/attribute). The other dimension (which is the common topic) itself is not well defined, i.e. the topics are not known before hand. We keep defining them as we go. So we have to infer that some of the articles belong to the same/similar topic and can be grouped together.

That's it. Done with the first class. YAY.. !. I am yet to attempt the review exercises. I have decided to go for review exercises of this and the next class together.

ಹರಿಃ ಓಂ.

Friday, October 14, 2011

Starting with the Stanford online Machine Learning class

Today I am starting with the Stanford online Machine Learning classes, taught by Andrew Ng. This is my second attempt at learning machine learning, via the same medium, with the same professor and under the same program. This course has been available online since about 3 years, albeit the current one is much more polished and very meticulously designed for online learning, unlike the old ones which were just recordings of the actual classroom teaching. Two years ago (i.e. in 2009) I, along with some of my friends/colleagues (from different teams) decided to learn machine learning at our office. Quickly a team of interested folks was formed. There was a friend who had completed his post-graduation in US and he had studied and AFAIK also worked on machine learning stuff. Then there were two other friends who had completed their post-graduation from IISc. Then there was my boss, who also had done his post-graduate studies in US. Apart from these folks, there were some smart folks not keen on a post-grad degree. And finally there was me. Yeah. me too. :). The idea was that all of us would watch the video lectures and one of us would present a session weekly.

We started with a bang, with an initial class on basics of statistics and probability, taught by that experienced and well-learned friend of mine. He called it "Statistics 101". It was good. There was no video lecture for this. So it was useful for me as my math needed a lot of dusting. This was followed by the first video lecture, which I believe had introduction to ML in general and also introduced the Linear Regression with one variable. It was taught/presented by another friend who had finished his post-graduation from the prestigious IISc. It went well too. (A side note : This friend seemed to have picked up the teaching traits/style of his IISc professors and I was getting the feeling of actually attending a class in IISc). More importantly most of us had watched this video and read the lecture notes once, before hand. All in all, the plan was on track. Till now.

But then it all fell apart from the subsequent week. The meeting time clashed with another actual work related meeting for some. An upcoming release caused a couple of us to give this a skip. Then after two missed meetings, the interest had pretty much waned away and the ever increasing work load did not help anyone. After postponing the meeting many times and several declined responses for the calendar invite, the "Statistics 101" friend, who had setup the meeting, removed it entirely from the calendar and Machine Learning studies ceased to exist officially too.

Now, a couple of months ago when I found out about the ML class being offered again, in an entirely new package tailor made for online learning, I decided that this time I would take it seriously and learn ML for real. I signed up immediately, not just for ML, but also for the DB and AI classes that are also being offered simultaneously, that too for the "Advanced Track" in all of those (which now looks like a bad move. I don't think I will be able to take up the AI class). Although classes officially started on Sunday/Monday, I could not get to until today. I just kept postponing it. Thanks to the review questions deadline, it came to now or never situation and I finally took the bold step of starting with the ML and DB classes. Luckily DB class doesn't have any assignments due on 16th Oct. So I just watched the introductory video. Then I started with the ML video lectures, which I am going through right now. I hope to keep up with the course schedule and get to all assignments on time, although they allow two delayed submissions. More importantly, I hope to learn something that I can use at my work right away, because I know there is scope for that at work.

I intend to continuously blog, as I go through the course. This will sort act like my notebook and also keep my blog alive and updated and also have some meaningful content. :)

Good luck to myself. .. !
Hari: Om.
ಹರಿಃ ಓಂ.

OnSwipe redirect code