poolboy max_overflow 参数坑

问题

某个服务节点在较低的qps(每秒2000次数据库访问)下, 在worker进程数100, max_overflow进程数100的情况下. 突然性能下降, 每秒只能处理1500次数据库访问. 导致请求处理延时从几MS上升至几百MS, 之后又逐渐恢复.

原因

逐渐把范围缩小至 mongodb poolboy 进程池的 checkout:

check out

handle_call({checkout, CRef, Block}, {FromPid, _} = From, State) ->
 #state{supervisor = Sup,
 workers = Workers,
 monitors = Monitors,
 overflow = Overflow,
 max_overflow = MaxOverflow} = State,
 case Workers of
 [Pid | Left] ->
 MRef = erlang:monitor(process, FromPid),
 true = ets:insert(Monitors, {Pid, CRef, MRef}),
 {reply, Pid, State#state{workers = Left}};
 [] when MaxOverflow > 0, Overflow < MaxOverflow ->
 {Pid, MRef} = new_worker(Sup, FromPid),
 true = ets:insert(Monitors, {Pid, CRef, MRef}),
 {reply, Pid, State#state{overflow = Overflow + 1}};
 [] when Block =:= false ->
 {reply, full, State};
 [] ->
 MRef = erlang:monitor(process, FromPid),
 Waiting = queue:in({From, CRef, MRef}, State#state.waiting),
 {noreply, State#state{waiting = Waiting}}
 end;

可以看到, 当max_overflow不为0时, 瞬间过载会创建新的worker, 而这些worker, 都会去链接mongodb, 耗时1-2MS. 创建的消耗会阻塞master process.

check in

而归还时, 又会将worker销毁, 导致链接一直创建/销毁, 而且都卡在master process, 这导致所有的请求, 都会因master process的链接创建和销毁而阻塞, 导致qps雪崩下降.

handle_checkin(Pid, State) ->
 #state{supervisor = Sup,
 waiting = Waiting,
 monitors = Monitors,
 overflow = Overflow,
 strategy = Strategy} = State,
 case queue:out(Waiting) of
 {{value, {From, CRef, MRef}}, Left} ->
 true = ets:insert(Monitors, {Pid, CRef, MRef}),
 gen_server:reply(From, Pid),
 State#state{waiting = Left};
 {empty, Empty} when Overflow > 0 ->
 ok = dismiss_worker(Sup, Pid),
 State#state{waiting = Empty, overflow = Overflow - 1};
 {empty, Empty} ->
 Workers = case Strategy of
 lifo -> [Pid | State#state.workers];
 fifo -> State#state.workers ++ [Pid]
 end,
 State#state{workers = Workers, waiting = Empty, overflow = 0}
 end.

结论

不要使用 poolboy 的 max_overflow, 若创建/销毁 children process时有一定消耗, 很容易阻塞 poolboy master进程, 频繁创建/销毁 worker 导致雪崩.

每次查BUG, 回头看来都是理所当然. 追查时却要费一番心思, 监控数据不便在个人blog给出. 不免省掉很多推断过程, 希望这个结论对大家有帮助.

作者:enjolras1205原文地址:https://segmentfault.com/a/1190000018412465

%s 个评论

要回复文章请先登录注册