org.apache.hadoop.tools
Class CopyListing

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.hadoop.tools.CopyListing
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable
Direct Known Subclasses:
FileBasedCopyListing, GlobbedCopyListing, SimpleCopyListing, SimpleFileBasedCopyListing

public abstract class CopyListing
extends org.apache.hadoop.conf.Configured

The CopyListing abstraction is responsible for how the list of sources and targets is constructed, for DistCp's copy function. The copy-listing should be a SequenceFile, located at the path specified to buildListing(), each entry being a pair of (Source relative path, source file status), all the paths being fully qualified.


Constructor Summary
protected CopyListing(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.security.Credentials credentials)
          Protected constructor, to initialize configuration.
 
Method Summary
 void buildListing(org.apache.hadoop.fs.Path pathToListFile, DistCpOptions options)
          Build listing function creates the input listing that distcp uses to perform the copy.
protected  void checkForDuplicates(org.apache.hadoop.fs.Path pathToListFile)
          Validate the final resulting path listing to see if there are any duplicate entries
protected abstract  void doBuildListing(org.apache.hadoop.fs.Path pathToListFile, DistCpOptions options)
          The interface to be implemented by sub-classes, to create the source/target file listing.
protected abstract  long getBytesToCopy()
          Return the total bytes that distCp should copy for the source paths This doesn't consider whether file is same should be skipped during copy
static CopyListing getCopyListing(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.security.Credentials credentials, DistCpOptions options)
          Public Factory method with which the appropriate CopyListing implementation may be retrieved.
protected  org.apache.hadoop.security.Credentials getCredentials()
          get credentials to update the delegation tokens for accessed FS objects
protected abstract  long getNumberOfPaths()
          Return the total number of paths to distcp, includes directories as well This doesn't consider whether file/dir is already present and should be skipped during copy
protected  void setCredentials(org.apache.hadoop.security.Credentials credentials)
          set Credentials store, on which FS delegatin token will be cached
protected abstract  void validatePaths(DistCpOptions options)
          Validate input and output paths
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CopyListing

protected CopyListing(org.apache.hadoop.conf.Configuration configuration,
                      org.apache.hadoop.security.Credentials credentials)
Protected constructor, to initialize configuration.

Parameters:
configuration: - The input configuration, with which the source/target FileSystems may be accessed.
credentials - - Credentials object on which the FS delegation tokens are cached.If null delegation token caching is skipped
Method Detail

buildListing

public final void buildListing(org.apache.hadoop.fs.Path pathToListFile,
                               DistCpOptions options)
                        throws IOException
Build listing function creates the input listing that distcp uses to perform the copy. The build listing is a sequence file that has relative path of a file in the key and the file status information of the source file in the value For instance if the source path is /tmp/data and the traversed path is /tmp/data/dir1/dir2/file1, then the sequence file would contain key: /dir1/dir2/file1 and value: FileStatus(/tmp/data/dir1/dir2/file1) File would also contain directory entries. Meaning, if /tmp/data/dir1/dir2/file1 is the only file under /tmp/data, the resulting sequence file would contain the following entries key: /dir1 and value: FileStatus(/tmp/data/dir1) key: /dir1/dir2 and value: FileStatus(/tmp/data/dir1/dir2) key: /dir1/dir2/file1 and value: FileStatus(/tmp/data/dir1/dir2/file1) Cases requiring special handling: If source path is a file (/tmp/file1), contents of the file will be as follows TARGET DOES NOT EXIST: Key-"", Value-FileStatus(/tmp/file1) TARGET IS FILE : Key-"", Value-FileStatus(/tmp/file1) TARGET IS DIR : Key-"/file1", Value-FileStatus(/tmp/file1)

Parameters:
pathToListFile - - Output file where the listing would be stored
options - - Input options to distcp
Throws:
IOException - - Exception if any

validatePaths

protected abstract void validatePaths(DistCpOptions options)
                               throws IOException,
                                      org.apache.hadoop.tools.CopyListing.InvalidInputException
Validate input and output paths

Parameters:
options - - Input options
Throws:
InvalidInputException: - If inputs are invalid
IOException: - any Exception with FS
IOException
org.apache.hadoop.tools.CopyListing.InvalidInputException

doBuildListing

protected abstract void doBuildListing(org.apache.hadoop.fs.Path pathToListFile,
                                       DistCpOptions options)
                                throws IOException
The interface to be implemented by sub-classes, to create the source/target file listing.

Parameters:
pathToListFile: - Path on HDFS where the listing file is written.
options: - Input Options for DistCp (indicating source/target paths.)
Throws:
IOException: - Thrown on failure to create the listing file.
IOException

getBytesToCopy

protected abstract long getBytesToCopy()
Return the total bytes that distCp should copy for the source paths This doesn't consider whether file is same should be skipped during copy

Returns:
total bytes to copy

getNumberOfPaths

protected abstract long getNumberOfPaths()
Return the total number of paths to distcp, includes directories as well This doesn't consider whether file/dir is already present and should be skipped during copy

Returns:
Total number of paths to distcp

checkForDuplicates

protected void checkForDuplicates(org.apache.hadoop.fs.Path pathToListFile)
                           throws org.apache.hadoop.tools.CopyListing.DuplicateFileException,
                                  IOException
Validate the final resulting path listing to see if there are any duplicate entries

Parameters:
pathToListFile - - path listing build by doBuildListing
Throws:
IOException - - Any issues while checking for duplicates and throws
DuplicateFileException - - if there are duplicates
org.apache.hadoop.tools.CopyListing.DuplicateFileException

setCredentials

protected void setCredentials(org.apache.hadoop.security.Credentials credentials)
set Credentials store, on which FS delegatin token will be cached

Parameters:
credentials - - Credentials object

getCredentials

protected org.apache.hadoop.security.Credentials getCredentials()
get credentials to update the delegation tokens for accessed FS objects

Returns:
Credentials object

getCopyListing

public static CopyListing getCopyListing(org.apache.hadoop.conf.Configuration configuration,
                                         org.apache.hadoop.security.Credentials credentials,
                                         DistCpOptions options)
Public Factory method with which the appropriate CopyListing implementation may be retrieved.

Parameters:
configuration: - The input configuration.
credentials - - Credentials object on which the FS delegation tokens are cached
options: - The input Options, to help choose the appropriate CopyListing Implementation.
Returns:
An instance of the appropriate CopyListing implementation.


Copyright © 2014 InMobi. All rights reserved.