org.apache.hadoop.tools
Class CopyListing
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.hadoop.tools.CopyListing
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
- Direct Known Subclasses:
- FileBasedCopyListing, GlobbedCopyListing, SimpleCopyListing, SimpleFileBasedCopyListing
public abstract class CopyListing
- extends org.apache.hadoop.conf.Configured
The CopyListing abstraction is responsible for how the list of
sources and targets is constructed, for DistCp's copy function.
The copy-listing should be a SequenceFile,
located at the path specified to buildListing(),
each entry being a pair of (Source relative path, source file status),
all the paths being fully qualified.
Constructor Summary |
protected |
CopyListing(org.apache.hadoop.conf.Configuration configuration,
org.apache.hadoop.security.Credentials credentials)
Protected constructor, to initialize configuration. |
Method Summary |
void |
buildListing(org.apache.hadoop.fs.Path pathToListFile,
DistCpOptions options)
Build listing function creates the input listing that distcp uses to
perform the copy. |
protected void |
checkForDuplicates(org.apache.hadoop.fs.Path pathToListFile)
Validate the final resulting path listing to see if there are any duplicate entries |
protected abstract void |
doBuildListing(org.apache.hadoop.fs.Path pathToListFile,
DistCpOptions options)
The interface to be implemented by sub-classes, to create the source/target file listing. |
protected abstract long |
getBytesToCopy()
Return the total bytes that distCp should copy for the source paths
This doesn't consider whether file is same should be skipped during copy |
static CopyListing |
getCopyListing(org.apache.hadoop.conf.Configuration configuration,
org.apache.hadoop.security.Credentials credentials,
DistCpOptions options)
Public Factory method with which the appropriate CopyListing implementation may be retrieved. |
protected org.apache.hadoop.security.Credentials |
getCredentials()
get credentials to update the delegation tokens for accessed FS objects |
protected abstract long |
getNumberOfPaths()
Return the total number of paths to distcp, includes directories as well
This doesn't consider whether file/dir is already present and should be skipped during copy |
protected void |
setCredentials(org.apache.hadoop.security.Credentials credentials)
set Credentials store, on which FS delegatin token will be cached |
protected abstract void |
validatePaths(DistCpOptions options)
Validate input and output paths |
Methods inherited from class org.apache.hadoop.conf.Configured |
getConf, setConf |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CopyListing
protected CopyListing(org.apache.hadoop.conf.Configuration configuration,
org.apache.hadoop.security.Credentials credentials)
- Protected constructor, to initialize configuration.
- Parameters:
configuration:
- The input configuration,
with which the source/target FileSystems may be accessed.credentials
- - Credentials object on which the FS delegation tokens are cached.If null
delegation token caching is skipped
buildListing
public final void buildListing(org.apache.hadoop.fs.Path pathToListFile,
DistCpOptions options)
throws IOException
- Build listing function creates the input listing that distcp uses to
perform the copy.
The build listing is a sequence file that has relative path of a file in the key
and the file status information of the source file in the value
For instance if the source path is /tmp/data and the traversed path is
/tmp/data/dir1/dir2/file1, then the sequence file would contain
key: /dir1/dir2/file1 and value: FileStatus(/tmp/data/dir1/dir2/file1)
File would also contain directory entries. Meaning, if /tmp/data/dir1/dir2/file1
is the only file under /tmp/data, the resulting sequence file would contain the
following entries
key: /dir1 and value: FileStatus(/tmp/data/dir1)
key: /dir1/dir2 and value: FileStatus(/tmp/data/dir1/dir2)
key: /dir1/dir2/file1 and value: FileStatus(/tmp/data/dir1/dir2/file1)
Cases requiring special handling:
If source path is a file (/tmp/file1), contents of the file will be as follows
TARGET DOES NOT EXIST: Key-"", Value-FileStatus(/tmp/file1)
TARGET IS FILE : Key-"", Value-FileStatus(/tmp/file1)
TARGET IS DIR : Key-"/file1", Value-FileStatus(/tmp/file1)
- Parameters:
pathToListFile
- - Output file where the listing would be storedoptions
- - Input options to distcp
- Throws:
IOException
- - Exception if any
validatePaths
protected abstract void validatePaths(DistCpOptions options)
throws IOException,
org.apache.hadoop.tools.CopyListing.InvalidInputException
- Validate input and output paths
- Parameters:
options
- - Input options
- Throws:
InvalidInputException:
- If inputs are invalid
IOException:
- any Exception with FS
IOException
org.apache.hadoop.tools.CopyListing.InvalidInputException
doBuildListing
protected abstract void doBuildListing(org.apache.hadoop.fs.Path pathToListFile,
DistCpOptions options)
throws IOException
- The interface to be implemented by sub-classes, to create the source/target file listing.
- Parameters:
pathToListFile:
- Path on HDFS where the listing file is written.options:
- Input Options for DistCp (indicating source/target paths.)
- Throws:
IOException:
- Thrown on failure to create the listing file.
IOException
getBytesToCopy
protected abstract long getBytesToCopy()
- Return the total bytes that distCp should copy for the source paths
This doesn't consider whether file is same should be skipped during copy
- Returns:
- total bytes to copy
getNumberOfPaths
protected abstract long getNumberOfPaths()
- Return the total number of paths to distcp, includes directories as well
This doesn't consider whether file/dir is already present and should be skipped during copy
- Returns:
- Total number of paths to distcp
checkForDuplicates
protected void checkForDuplicates(org.apache.hadoop.fs.Path pathToListFile)
throws org.apache.hadoop.tools.CopyListing.DuplicateFileException,
IOException
- Validate the final resulting path listing to see if there are any duplicate entries
- Parameters:
pathToListFile
- - path listing build by doBuildListing
- Throws:
IOException
- - Any issues while checking for duplicates and throws
DuplicateFileException
- - if there are duplicates
org.apache.hadoop.tools.CopyListing.DuplicateFileException
setCredentials
protected void setCredentials(org.apache.hadoop.security.Credentials credentials)
- set Credentials store, on which FS delegatin token will be cached
- Parameters:
credentials
- - Credentials object
getCredentials
protected org.apache.hadoop.security.Credentials getCredentials()
- get credentials to update the delegation tokens for accessed FS objects
- Returns:
- Credentials object
getCopyListing
public static CopyListing getCopyListing(org.apache.hadoop.conf.Configuration configuration,
org.apache.hadoop.security.Credentials credentials,
DistCpOptions options)
- Public Factory method with which the appropriate CopyListing implementation may be retrieved.
- Parameters:
configuration:
- The input configuration.credentials
- - Credentials object on which the FS delegation tokens are cachedoptions:
- The input Options, to help choose the appropriate CopyListing Implementation.
- Returns:
- An instance of the appropriate CopyListing implementation.
Copyright © 2014 InMobi. All rights reserved.