Hadoop涉及GBK编码的文件

日期：2020-09-21 栏目：程序人生浏览：次

Hadoop源代码中涉及编码问题时都是写死的utf-8，但是不少情况下，也会遇到输入文件和输出文件需要GBK编码的情况。

输入文件为GBK，则只需在mapper或reducer程序中读取Text时，使用transformTextToUTF8(text, "GBK");进行一下转码，以确保都是以UTF-8的编码方式在运行。

public static Text transformTextToUTF8(Text text, String encoding) {
String value = null;
try {
value = new String(text.getBytes(), 0, text.getLength(), encoding);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return new Text(value);
}

输出文件为GBK，则重写TextOutputFormat类，public class GBKFileOutputFormat<K, V> extends FileOutputFormat<K, V>，把TextOutputFormat的源码拷过来，然后把里面写死的utf-8编码改成GBK编码。最后，在run程序中，设置job.setOutputFormatClass(GBKFileOutputFormat.class);

转载注明出处：http://www.heiqu.com/0df8adc02aab57615771723027318c73.html

Hadoop涉及GBK编码的文件

相关推荐