When using cluster filesystem like Hadop, the best approach is to compress the data before you store it to Hadoop, this is to save data transfer time over network.

In LAN environment, perhaps network bandwidth is not a problem, you can enable Hadoop HDFS compression to do the job.
Typically, GzipCodec is used for HDFS compression. Use the following instructions to use GZipCodec in Hadoop.

Enable GzipCodec as the default compression:

Edit the core-site.xml file on the NameNode host machine:

       <property>
          <name>io.compression.codecs</name>
          <value>org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,
            org.apache.hadoop.io.compress.SnappyCodec</value>
          <description>A list of the compression codec classes that can be used
            for compression/decompression.</description>
        </property>

Edit the mapred-site.xml file on the JobTracker host machine

        <property>
          <name>mapred.compress.map.output</name>
          <value>true</value>
        </property>
         
        <property>
          <name>mapred.map.output.compression.codec</name>
          <value>org.apache.hadoop.io.compress.GzipCodec</value>
        </property>
         
        <property>
          <name>mapred.output.compression.type</name>
          <value>BLOCK</value>
        </property>

(Optional) - Enable the following two configuration parameters to enable job output compression.

Edit the mapred-site.xml file on the Resource Manager host machine:

        <property> 
          <name>mapred.output.compress</name>
          <value>true</value>
        </property>

        <property>
          <name>mapred.output.compression.codec</name>
          <value>org.apache.hadoop.io.compress.GzipCodec</value>
        </property>

Restart the cluster

$ stop-dfs.sh
$ start-dfs.sh

 

To use GzipCodec with a one-time only job:

For one time only job, try the command below(or similar).

hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output