Hive 中使用多字符字符串作为字段分隔符

Hive建表语句中得FIELDS TERMINATED BY 只能是单字符,遇到多字符作为分隔符的就尴尬了。目前我们的字段分隔符是’@#@’ 。遇到这个问题除了变更分隔符外,hive也可以使用serde的方式来支持多字符作为分隔符。

例如一个分隔符为’@#@’的数据,有3个字段

create table hive_test(
id string,
tour_cd string,
flt_statis_cd string )
ROW FORMAT
SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES
( ‘input.regex’ = ‘^([^@#]*)@#@([^@#]*)@#@([^@#]*)’,
‘output.format.string’ = ‘%1$s %2$s %3$s ‘)
STORED AS TEXTFILE;

input.regex 就是按照java的字段分割正则表达式方式编写。

output.format.string 按照顺序往后递增即可。

需要注意的是,字段类型只支持string,不然就会报错:

FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[3] named id_valid_ind has type int)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

建完以后就可以往hive表里面load数据了。但是用的时候很可能报这个错。

Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.contrib.serde2.RegexSerDe
	at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:247)
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:891)
	at org.apache.hadoop.hive.ql.exec.MapOperator.initObjectInspector(MapOperator.java:233)
	at org.apache.hadoop.hive.ql.exec.MapOperator.setChildren(MapOperator.java:366)
	... 33 more

执行add jar 命令 将hive-contrib.jar 加入再执行hive语句即可

 

hive> add jar /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar;
Added /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar to class path
Added resource: /usr/lib/hive/lib/hive-contrib-0.9.0-Intel.jar

一个带分区的外部表,自定义多字符字符串作为分隔符的建表语句例子

 

create EXTERNAL table hive_test(
seg_fr_bs string,
tour_cd string,
flt_statis_cd string )
PARTITIONED BY(dt STRING)
ROW FORMAT
SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe’
WITH SERDEPROPERTIES
( ‘input.regex’ = ‘^([^@#]*)@#@([^@#]*)@#@([^@#]*))’,
‘output.format.string’ = ‘%1$s %2$s %3$s’)
STORED AS TEXTFILE
LOCATION ‘/user/adhoc/file/pir2_base_ics_wxl’;

 

 

 

 

 

 

 

 

 

走路

从小到大的教育,无数的名言、典故告诉我,只要努力就能成功。不成功只是你还不够努力。但工作也好些年头了,一无所成。

被各种项目中疲于奔命,被现实的压力牵着鼻子走,却没有时间停下来想一想,努力很好,但是我挣得选择了正确的方式了么,去一个城市,我可以走路,骑车,坐车,开车,高铁,飞机。

老大告诉我,走路很好,坐飞机的看不到地上的风景。我真的要把短暂的生命就这么花费?
可我想看的风景也许正是在目的地呢?但我想坐飞机的时候已经买不起机票了。。

刚毕业的时候我看不清,不知道该选择什么样的路。但现在趋势越来越明显。IT外包行业不好做,可真正重要的不是趋势,而是趋势的改变。趋势的改变才是决定成功的关键。

公司能转型么,公司从来都不缺乏创意,从领导们大大小小的饼就知道。最缺乏的是创意的执行,
要出新,就必须推陈。而现在,有限的人力的时间用来保卫昨天。大搞起数据仓库那一套了。
大家都知道未来在改变,可是人人都在为过去而忙碌。所以想要转型,信心从何而来。。

redhat rhel 6 kernel: nf_conntrack: table full, dropping packet.

在做HAWQ 压力测试的时候突然发现连不上服务器了,检查一下,莫名自动切换到了备机运行。

检查heartbeat的日志:

Jul 24 15:55:17 big3hd02.corp.haier.com heartbeat: [23081]: info: Link big3hd01.corp.haier.com:bond0 dead.
Jul 24 15:55:17 big3hd02.corp.haier.com ipfail: [23133]: info: Link Status update: Link big3hd01.corp.haier.com/bond0 now has status dead
Jul 24 15:55:18 big3hd02.corp.haier.com ipfail: [23133]: info: Asking other side for ping node count.
Jul 24 15:55:18 big3hd02.corp.haier.com ipfail: [23133]: info: Checking remote count of ping nodes.
Jul 24 15:55:21 big3hd02.corp.haier.com ipfail: [23133]: info: Telling other node that we have more visible ping nodes.
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: big3hd01.corp.haier.com wants to go standby [all]
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: standby: other_holds_resources: 3
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: New standby state: 2
Jul 24 15:55:26 big3hd02.corp.haier.com heartbeat: [23081]: info: New standby state: 2
Jul 24 15:55:27 big3hd02.corp.haier.com heartbeat: [23081]: info: other_holds_resources: 0
Jul 24 15:55:41 big3hd02.corp.haier.com heartbeat: [23081]: info: Link big3hd01.corp.haier.com:bond0 up.

主节点的bond0网卡无法联通了,所以自动切换到了备机运行。这有点纳闷。检查主节点的系统日志

Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 kernel: nf_conntrack: table full, dropping packet.
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=210 [-1]: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:54:59 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=198 [-1]: Operation not permitted
Jul 24 15:54:59 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=198 [-1]: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:55:01 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:55:01 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: ERROR: glib: Error sending packet: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3685]: ERROR: glib: ucast_write: Unable to send HBcomm packet bond0 10.135.24.2:694 len=197 [-1]: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: info: glib: euid=0 egid=0
Jul 24 15:55:03 big3hd01 heartbeat: [3687]: ERROR: write_child: write failure on ping 10.135.25.254.: Operation not permitted
Jul 24 15:55:03 big3hd01 heartbeat: [3685]: ERROR: write_child: write failure on ucast bond0.: Operation not permitted
Jul 24 15:55:04 big3hd01 kernel: __ratelimit: 169 callbacks suppressed
Jul 24 15:55:04 big3hd01 kernel: nf_conntrack: table full, dropping packet.

出现了大量 kernel: nf_conntrack: table full, dropping packet.的信息。

检查netfilter的设置

sysctl net.nf_conntrack_max

net.nf_conntrack_max = 65536

检查当前的连接

wc -l /proc/net/nf_conntrack 达到5万多。

调大net.nf_conntrack_max到200000.

sysctl -w net.nf_conntrack_max=65536

监控并发测试时候,wc -l /proc/net/nf_conntrack  可以达到7万多。难怪日志中出现 kernel: nf_conntrack: table full, dropping packet.信息。

sysctl -w net.nf_conntrack_max=65536

再次测试就很顺利解决了并非测试的问题。

 

 

 

 

 

NoSQL 数据库的分布式算法

Distributed Algorithms in NoSQL Databases 

 

Scalability is one of the main drivers of the NoSQL movement. As such, it encompasses distributed system coordination, failover, resource management and many other capabilities. It sounds like a big umbrella, and it is. Although it can hardly be said that NoSQL movement brought fundamentally new techniques into distributed data processing, it triggered an avalanche of practical studies and real-life trials of different combinations of protocols and algorithms. These developments gradually highlight a system of relevant database building blocks with proven practical efficiency. In this article I’m trying to provide more or less systematic description of techniques related to distributed operations in NoSQL databases.

In the rest of this article we study a number of distributed activities like replication of failure detection that could happen in a database. These activities, highlighted in bold below, are grouped into three major sections:

  • Data Consistency. Historically, NoSQL paid a lot of attention to tradeoffs between consistency, fault-tolerance and performance to serve geographically distributed systems, low-latency or highly available applications. Fundamentally, these tradeoffs spin around data consistency, so this section is devoted?data replication?and?data repair.
  • Data Placement. A database should accommodate itself to different data distributions, cluster topologies and hardware configurations. In this section we discuss how todistribute or rebalance data?in such a way that failures are handled rapidly, persistence guarantees are maintained, queries are efficient, and system resource like RAM or disk space are used evenly throughout the cluster.
  • System Coordination. Coordination techniques like?leader election?are used in many databases to implements fault-tolerance and strong data consistency. However, even decentralized databases typically track their global state,?detect failures and topology changes. This section describes several important techniques that are used to keep the system in a coherent state.

继续阅读“NoSQL 数据库的分布式算法”

读书

读书的好处与不读书的区别,一句话就能表达:

一日不读书,无人看得出;一周不读书,开始会爆粗;一月不读书,智商输给猪。

最近看书时间少了,发现要认真坐下来看书的时间真是太少了,都是一些很碎片的时间。所以现在和以前反过来了,以前总是喜欢看纸质的书,觉得电子书看不下去。现在,要认认真真坐那看书,是真心坐不住。发现这个看书少这个事情以后,心急了,纸质书花钱太多,有些书国内还买不到,手机看又屏幕太小,于是赶紧给自己买了个iPad,把自己的碎片时间给利用了起来。

最近又读了好些书,最终觉得还是电子书好,里面可以放好多书,一本看不下,不用起来就可以换另外一本,可以任意注释,重新打开会从上次看的那一页开始,对于一些不认识的单词,按住就能有解释。

前几天试图看纸质书,反而不习惯了,看纸质书的习惯的力量没有想象中那么大,科技确实在改变着生活,改变着买一个人。

看书,是走进另外一个人的思想,与智者进行心灵的交流,真是一种很享受的感觉。所以读书,选择书很重要,话说如果当初老毛不是选择看二十四史之类的所谓帝王治国之道,而是看些国外的民主政治学等等文章,说不定中国又会走向另外的道路。