Wednesday, December 22, 2010

How Hadoop's HDFS input file get mapped to MapReduce map?

Though quite late, but around a week back I started on Hadoop, it took couple of days for me (with the help of my team members) to set up a local Hadoop installation on my system using cygwin.
I wrote an example Map Reduce, in which Mapper processes a given file to calculate some GPS displacement for a person based on lattitute and longitude information and finally Reducer figures out the maximum displacement on the combined displacement list.
Every thing went well, I got stuck at a point where I was unable to understand how KeyIn and ValueIn are mapped from HDFS file read. How can I make customize which goes in key and what goes in value, hadoop wiki states,

"It is not necessary for the InputFormat to generate both meaningful keys and values. For example, the default output from TextInputFormat consists of input lines as values and somewhat meaningless line start file offsets as keys - most applications only use the lines and ignores the offsets."


Hence, it depends on the specific implementation of RecordReader, in case of TextInputFormat we use LineRecordReader which makes meaningless LongWritable (Hadoop's serialization format, Writable's implementation for Long datatype) keys as an input to Mapper. KeyValueLineRecordReader in KeyValueTextInputFormat (not in hadoop-core-0.20.2, I can see in mapreduce trunk), reads the text file and seperates key and value by /t (tab) seperator in the input file.

Monday, November 22, 2010

Using HPROF and HAT to profile

HPROF is actually a JVM native agent library which is dynamically loaded through a command line option, at JVM startup, and becomes part of the JVM process.

Live objects allocation in the heap generated by Javac
javac -J-agentlib:hprof=heap=dump Hello.java
Each frame in the stack trace contains class name, method name, source file name, and the line number. The user can set the maximum number of frames collected by the HPROF agent (depth option). The default depth is 4. Stack traces reveal not only which methods performed heap allocation, but also which methods were ultimately responsible for making calls that resulted in memory allocation.

CPU usage Sampling Profiles:
javac -J-agentlib:hprof=cpu=samples Hello.java
The cpu=samples option doesn't use BCI, HPROF just spawns a separate thread that sleeps for a fixed number of micro seconds, and wakes up and samples all the running thread stacks using JVM TI.

CPU usage Times Profiles:
javac -J-agentlib:hprof=cpu=times Hello.java
The cpu=times option attempts to track the running stack of all threads, and keep accurate CPU time usage on all methods. This option probably places the greatest strain on the VM, where every method entry and method exit is tracked. Applications that make many method calls will be impacted more than others.


Using HAT
Generate a binary hprof file using -Xrunhprof flag
java -Xrunhprof:file=dump.hprof,format=b Main

To run HAT:
jhat -port 7002 dump.hprof
(by default the port is 7000)

Have happy profiling your applications :)

Tuesday, September 21, 2010

Camel in Action – Book Review

ibsen_cover150

Recently, I reviewed MEAP edition of Camel in Action. I guess it was a last stage review, It is suppose to be released later this year. It was a very interesting and informative experience for me. Book is very well devised, provides seamless flow across chapters as we learn Camel.

The authors Claus Ibsen and Jonathan Anstey have provided Java DSL and Spring DSL for plugging in almost each and every processor whether the processor is one of the Enterprise Integration Pattern (EIP), or creating Routes, or one of Transformers or one of Components supported by Camel. They have very well defined Camel development for adding customized routes to Camel Context using Route Builder, intermediate routes the routing engine or adding one of around 80 components like File, FTP, CXF, JMS, JPA, Quartz, HTTP etc. They have taken a simple scenario of Rider Auto Parts, came up with fairly practical requirements and shown how Camel can be used to implement/configure the solutions for those requirements. Also, we get a comprehensive list of well-known data formats supported by Camel and their usage details. What I appreciate in Camel and also well explained in this book is good support for debugging your message, support for introducing mocking component and testing. My personal interest has been on CXF, I was happy to see good description of configuring CXF component via referencing a bean and configuring it by URI. Authors have diligently shown both contract-first and code-first approaches. Every chapter and example refers to downloadable source code which can be run quickly with simple maven command. Even for creating your own projects many Camel’s maven archetypes are explained. A good point is Camel development is well supported by Eclipse IDE.

Beginners will learn many things what Camel is, what all it can do and how simple it is to use. For intermediate users, I feel they will get a hands on to some alternative ways of configuring and using Camel, Components, Data formatters and EIPs, monitoring and managing Camel. For advance users, I guess concurrency and transactions will be helpful in getting more insight about Camel.

Tuesday, May 18, 2010

Small facts about Generics

Small facts about Generics:



  • Type parameters are considered as non static typed variables; hence you can’t have static variable declaration or usage of typed variable in a static method.

  • Type parameters may hide one of actual already declared java type. Caution should be taken when declaring type variable name.

  • Compiler allows your declared type parameter to extend an already declared final java type, it just gives you a warning, no error, may be because to allow to hint and allow get only and put only operations respectively.

  • You may declare an entirely different typed parameter while creating an object and can work with it. But reflection on that object won’t give you expected actual instance types variables, it will give you the actual typed variables declared in the class declaration.

Saturday, May 01, 2010

ConcurrentModificationException - How to avoid / remove it?

As per the Java Doc:

This exception may be thrown by methods that have detected concurrent modification of an object when such modification is not permissible. For example, it is not generally permissible for one thread to modify a Collection while another thread is iterating over it.

Caused by Fail Fast Iterators, any collection using these iterators can throw the CME. They happen because of mainly 2 reasons -

  • The state of the iterator gets changed by the same/other thread in such a fashion that all other references feel that their understanding of the iterator's state is dirty. For instance, threadA is iterating using iterator.next() and some other threadB takes the reference of the iterator and removes some entry from the collection, hence threadA will get hinted that it's knowledge about iterator state is not right, in other words it has become dirty now. Even, one important point is that there is no requirement to have another thread to cause CME, it can happen because of the same thread itself. In single thread, it can happen when you are iterating over the collection and calling collectionImpl.remove or collectionImpl.add, which causes the iterator reference to feel like it has been fooled, as it's state is going to be changed.
  • Another reason could be that the underlying collection is not initialized and some other thread has started iterating over it.

A code like below will throw concurrent modification exception,

  1: import java.util.List;
  2: import java.util.ArrayList;
  3: import java.util.Iterator;
  4:
  5: public class Sample{
  6:
  7: List<String> strings = new ArrayList<String>();
  8:
  9: public void fillList(){
 10:  for(int i=0;i<10;i++){
 11:   strings.add(""+i);
 12:  }
 13: }
 14:
 15: public void iterateList(){
 16: /*WRONG WAY, will produce CME
 17: *Iterator reference is old,
 18: *iterator's idea of collections' state becomes dirty
 19: * after collectionImpl.remove/add/etc
 20: */
 21:  Iterator<String> itr = strings.iterator();
 22:  strings.remove("7");
 23:  while(itr.hasNext()){
 24:   System.out.println(itr.next());
 25:  }
 26: }
 27:
 28: public static void main(String args[]){
 29: Sample s = new Sample();
 30: s.fillList();
 31: s.iterateList();
 32: }
 33: }
 34: 


The solutions to avoid ConcurrentModificationException are as follows:



1. Use Weakly Consistent Iterators - JavaSE 1.5 and JavaSE 1.6 has many collection implementation which uses Weakly Consistent Iterators, which don't throw CME.




  • CopyOnWriteArraySet, CopyOnWriteArrayList (Java SE 5): They copy an internal array on each modification. Hence, make sure that when you are using these implementation your usage is more of iteration than modification to them.


  • ConcurrentSkipListSet, ConcurrentSkipListMap (Java SE 6): They are skip list based implementation, provides concurrency with sorting.


  • ConcurrentHashMap (Java SE 6): provides extra atomic methods.



2. Make sure you are using iterator in right way,



  1: import java.util.List;
  2: import java.util.ArrayList;
  3: import java.util.Iterator;
  4:
  5: public class Sample{
  6:
  7: List<String> strings = new ArrayList<String>();
  8:
  9: public void fillList(){
 10:  for(int i=0;i<10;i++){
 11:   strings.add(""+i);
 12:  }
 13: }
 14:
 15: public void iterateList(){
 16: // RIGHT WAY 1
 17: //Get a new reference to the iterator
 18:
 19: /*
 20:  strings.remove("7");
 21:  for(String str:strings){
 22:   System.out.println(str);
 23:  }
 24: */
 25:
 26:
 27: //OR RIGHT WAY 2
 28: // Try to remove the element from Iterator reference,
 29: // so that iterator doesn't become dirty
 30:
 31: /*
 32:  Iterator<String> itrNew = strings.iterator();
 33:  while(itrNew.hasNext()){
 34:  String str = itrNew.next();
 35:  if("7".equals(str)){
 36:    itrNew.remove();
 37:   }else{
 38:    System.out.println(str);
 39:   }
 40:  }
 41: */
 42:
 43:
 44: //OR RIGHT WAY 3, quite similar to RIGHT WAY 1
 45: //Get the itr reference after your collection modification operation
 46: /*
 47:  strings.remove("7");
 48:  Iterator<String> itrOther = strings.iterator();
 49:  while(itrOther.hasNext()){
 50:   System.out.println(itrOther.next());
 51:  }
 52: */
 53: }
 54:
 55: public static void main(String args[]){
 56:  Sample s = new Sample();
 57:  s.fillList();
 58:  s.iterateList();
 59: }
 60:
 61: }


3. Make sure that your collection instance are populated before your thread start iterating over them.



  1: class MyClass {
  2:
  3: private final List myList = makeList();
  4:
  5: private static list makeList() {
  6:  List list = new ArrayList();
  7:  // do what you need to initialize this 
  8:  return list;
  9:  }
 10: }
 11: 


ConcurrentSet are not included in Java 1.6, as Collections class provides a convenient method newSetFromMap which returns a set backed by map.



If you want you can use Decorator pattern to write a ConcurrentSet which takes a regular Set are constructor arguments and inside the class uses a ConcurrentHashMap<E, BOOLEAN> to give concurrent set.

Wednesday, April 21, 2010

Linux Shell script to find the process listening on a given port

port=$1
procinfo=$(netstat --numeric-ports -nlp 2> /dev/null grep ^tcp grep -w ${port} tail -n 1 awk '{print $7}')
case "${procinfo}" in
"")
echo "No process listening on port ${port}"
;;
"-")
echo "Process is running on ${port}, but current user does not have rights to see process information."
;;
*)
echo "${procinfo} is running on port ${port}"
ps -uwep ${procinfo%%/*}
;;
esac

Wednesday, February 10, 2010

Ctrl-S hanged vi editor

This has been the standard almost since the beginning of time.

Ctrl S is the XOFF character, uses for flow control on serial links that do not have hardware flow control. It doesn't freeze the output...
Receipt of the XON character (Ctrl Q) disables the flow control, and commands output to continue.