Hive建表语句中得FIELDS TERMINATED BY 只能是单字符,遇到多字符作为分隔符的就尴尬了。目前我们的字段分隔符是’@#@’ 。遇到这个问题除了变更分隔符外,hive也可以使用serde的方式来支持多字符作为分隔符。
例如一个分隔符为’@#@’的数据,有3个字段
create table hive_test(
id string,
tour_cd string,
flt_statis_cd string )
ROW FORMAT
SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES
( ‘input.regex’ = ‘^([^@#]*)@#@([^@#]*)@#@([^@#]*)’,
‘output.format.string’ = ‘%1$s %2$s %3$s ‘)
STORED AS TEXTFILE;
input.regex 就是按照java的字段分割正则表达式方式编写。
output.format.string 按照顺序往后递增即可。
需要注意的是,字段类型只支持string,不然就会报错:
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[3] named id_valid_ind has type int)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
建完以后就可以往hive表里面load数据了。但是用的时候很可能报这个错。
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:891) at org.apache.hadoop.hive.ql.exec.MapOperator.initObjectInspector(MapOperator.java:233) at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:366) ... 33 more
执行add jar 命令 将hive-contrib.jar 加入再执行hive语句即可
hive> add jar /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar;
Added /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar to class path
Added resource: /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar
一个带分区的外部表,自定义多字符字符串作为分隔符的建表语句例子
create EXTERNAL table hive_test(
seg_fr_bs string,
tour_cd string,
flt_statis_cd string )
PARTITIONED BY(dt STRING)
ROW FORMAT
SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES
( ‘input.regex’ = ‘^([^@#]*)@#@([^@#]*)@#@([^@#]*))’,
‘output.format.string’ = ‘%1$s %2$s %3$s’)
STORED AS TEXTFILE
LOCATION ‘/user/adhoc/file/pir2_base_ics_wxl’;