User Tools

Site Tools



Hadoop Summary

Our production Hadoop Cluster is based on Cloudera's CD3U6 repository. Here are some details :

  • namenode (that is login node):
    • whitetail also runs the Hadoop Scheduler and Health Monitor
    • ssh to it directly or from any of our other tails
  • resources: access to 600 GB of memory and 1.75 TB of Hadoop's Distributed File System (HDFS)
    • could be doubled in near future if needed
  • HDFS is not backed up!
    • You must request a writable work area /userdata/username
    • Be sure to down load your results to /home/username (that is the regular filesystem)
  • Data to be shared (dictionaries, anagrams, etc) can be posted in /shareddata
    • request such items to be posted there
  • Basic tools (request other tools to be installed)
    • shell scripting
    • python
    • perl (Hadoop::Streaming)
    • java (both Oracle in /usr/java and openJDK)
    • R+RHadoop (rmr2, rhdfs, rhbase)
    • Hbase (noSQL database)
    • MySQL
      • request a database to be set up for you (limited space)
  • Note: the permissions are bit weird in HDFS but I think it is sorted out.
    • If this turns into a problem we'll let everybody run as user hdfs …
  • Note: some http links will not work because they point to the private network
    • If you wish to view them launch firefox from whitetail …

Other useful pages


cluster/121.txt · Last modified: 2013/09/16 11:09 by hmeij