How to get/copy files in HDFS into Git repo vice versa?

ParuchuriM Source

I have files in HDFS that need to be compared with other files in the Git repo. So, I want to copy HDFS files into the Git repo. Another tool will compare that can't talk to HDFS.

Is it doable or not?

If yes please advise if there is another way to get the files out?



answered 6 months ago Michal Lonski #1

Some ideas that come to my mind:

  1. You can copy the files from hdfs to local machine and then run the tool that compares the files.

    a) You can do it manually, using command line tools:

    hadoop fs -copyToLocal <hdfs file> <local file>

    b) You can compose oozie workflow that will contain action with your 'comparer' and will fetch files from hdfs using distributed cache.

    c) If you do not have command line tools available you can fetch the files using webhdfs:

  2. You can stream the data content from hdfs and compare it 'on the fly' using file system API:

comments powered by Disqus