【实现一套爬虫数据抓取平台】[3-5-02] CentOS 关闭超时进程

news/2024/7/7 7:09:47

文章目录

  • 零、系列目录
  • 一、背景
  • 二、脚本代码
  • 三、解析
    • 执行步骤
    • 代码
      • 获取进程 pid
      • 获取进程的运行时间
  • 四、总结

零、系列目录

写这套文章的时候,不会完全按照目录的顺序一篇一篇写, 大家可以到目录中直接找到对应的章节进行查看。

点我跳转

一、背景

在运行爬虫的时候,有些任务难免执行超时,针对超时的任务,我们采取的策略就是直接关闭这个任务的进程,避免任务阻塞。

二、脚本代码

先上干货,有需要的同学可以直接拿走了。

完整代码位置:点击跳转。

#!/bin/bash

function kill_timeout_process()
{
    pid=$1
    max_time=$2
    start_time=$(cat /proc/$pid/stat | cut -d" " -f22)

    if [[ "$start_time" == "" ]]
    then
        echo "[$pid] not exists."
        return 0
    fi

    user_hz=$(getconf CLK_TCK)
    sys_uptime=$(cat /proc/uptime | cut -d" " -f1)

    runs_time=$(( ${sys_uptime%.*} - $start_time/$user_hz ))

    echo "[$pid] runs for $runs_time seconds."

    if [ "$runs_time" -ge "$max_time" ]
    then
        echo "[$pid] killed."
        kill -9 $pid
    fi

    return 1
}

process_name="scrapy"
timeout=7200

for pid in `pgrep $process_name`
do
    kill_timeout_process $pid $timeout
done

三、解析

如果有同学对上面的脚本原理感兴趣,咱们继续。

执行步骤

1、获取待比较的进程
2、获取进程的运行时间
3、判断运行时间是否超过设置时间
4、关闭超时进行

代码

针对上述步骤,我们一个一个来讲。

获取进程 pid

想要获取进程id,ps 可以、pgrep 也可以。

下面例子中,scrapy 是我想要查询的进程包含的命令。

ps 方式:

ps -ef | grep "scrapy" | grep -v grep | awk '{print $2}'

运行结果:

[root@docker]# ps -ef | grep "scrapy" | grep -v grep | awk '{print $2}'
5369
5371
14628
14634
16514
16516
17383
17384

pgrep 方式:

pgrep scrapy

运行结果:

[root@docker]# pgrep scrapy
5371
14634
17384
17857
27193
27587
27592
27599
27604

可以看到,如果只是单纯的需要 pid 的话,pgrep 的方式更简单,我们就选它。

获取进程的运行时间

获取进程运行时间的方式是:

进程运行时间 = 系统当前运行时间 - 进程启动时的系统时间

如何获取系统当前运行时间?

系统当前的运行时间信息在 /proc/uptime 这个文件里面,例如:

[root@docker]# cat /proc/uptime
5047164.56 34461345.27

可以看到,这个文件里面有两个数。第一个就表示系统到目前为止的运行时间,单位秒。

如何获取进程启动时的系统时间?

这里可以从 /proc/[pid]/stat 这个文件下手,这里保存着 pid 对应的进程的所有运行数据。 具体可见下文:

/proc/[pid]/stat
          Status information about the process.  This is used by ps(1).
          It is defined in the kernel source file fs/proc/array.c.

          The fields, in order, with their proper scanf(3) format speci‐
          fiers, are listed below.  Whether or not certain of these
          fields display valid information is governed by a ptrace
          access mode PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT
          check (refer to ptrace(2)).  If the check denies access, then
          the field value is displayed as 0.  The affected fields are
          indicated with the marking [PT].

          (1) pid  %d
                    The process ID.

          (2) comm  %s
                    The filename of the executable, in parentheses.
                    This is visible whether or not the executable is
                    swapped out.

          (3) state  %c
                    One of the following characters, indicating process
                    state:

                    R  Running

                    S  Sleeping in an interruptible wait

                    D  Waiting in uninterruptible disk sleep

                    Z  Zombie

                    T  Stopped (on a signal) or (before Linux 2.6.33)
                       trace stopped

                    t  Tracing stop (Linux 2.6.33 onward)

                    W  Paging (only before Linux 2.6.0)

                    X  Dead (from Linux 2.6.0 onward)

                    x  Dead (Linux 2.6.33 to 3.13 only)

                    K  Wakekill (Linux 2.6.33 to 3.13 only)

                    W  Waking (Linux 2.6.33 to 3.13 only)

                    P  Parked (Linux 3.9 to 3.13 only)

          (4) ppid  %d
                    The PID of the parent of this process.

          (5) pgrp  %d
                    The process group ID of the process.

          (6) session  %d
                    The session ID of the process.

          (7) tty_nr  %d
                    The controlling terminal of the process.  (The minor
                    device number is contained in the combination of
                    bits 31 to 20 and 7 to 0; the major device number is
                    in bits 15 to 8.)

          (8) tpgid  %d
                    The ID of the foreground process group of the con‐
                    trolling terminal of the process.

          (9) flags  %u
                    The kernel flags word of the process.  For bit mean‐
                    ings, see the PF_* defines in the Linux kernel
                    source file include/linux/sched.h.  Details depend
                    on the kernel version.

                    The format for this field was %lu before Linux 2.6.

          (10) minflt  %lu
                    The number of minor faults the process has made
                    which have not required loading a memory page from
                    disk.

          (11) cminflt  %lu
                    The number of minor faults that the process's
                    waited-for children have made.

          (12) majflt  %lu
                    The number of major faults the process has made
                    which have required loading a memory page from disk.

          (13) cmajflt  %lu
                    The number of major faults that the process's
                    waited-for children have made.

          (14) utime  %lu
                    Amount of time that this process has been scheduled
                    in user mode, measured in clock ticks (divide by
                    sysconf(_SC_CLK_TCK)).  This includes guest time,
                    guest_time (time spent running a virtual CPU, see
                    below), so that applications that are not aware of
                    the guest time field do not lose that time from
                    their calculations.

          (15) stime  %lu
                    Amount of time that this process has been scheduled
                    in kernel mode, measured in clock ticks (divide by
                    sysconf(_SC_CLK_TCK)).

          (16) cutime  %ld
                    Amount of time that this process's waited-for chil‐
                    dren have been scheduled in user mode, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).  (See
                    also times(2).)  This includes guest time,
                    cguest_time (time spent running a virtual CPU, see
                    below).

          (17) cstime  %ld
                    Amount of time that this process's waited-for chil‐
                    dren have been scheduled in kernel mode, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (18) priority  %ld
                    (Explanation for Linux 2.6) For processes running a
                    real-time scheduling policy (policy below; see
                    sched_setscheduler(2)), this is the negated schedul‐
                    ing priority, minus one; that is, a number in the
                    range -2 to -100, corresponding to real-time priori‐
                    ties 1 to 99.  For processes running under a non-
                    real-time scheduling policy, this is the raw nice
                    value (setpriority(2)) as represented in the kernel.
                    The kernel stores nice values as numbers in the
                    range 0 (high) to 39 (low), corresponding to the
                    user-visible nice range of -20 to 19.

                    Before Linux 2.6, this was a scaled value based on
                    the scheduler weighting given to this process.

          (19) nice  %ld
                    The nice value (see setpriority(2)), a value in the
                    range 19 (low priority) to -20 (high priority).

          (20) num_threads  %ld
                    Number of threads in this process (since Linux 2.6).
                    Before kernel 2.6, this field was hard coded to 0 as
                    a placeholder for an earlier removed field.

          (21) itrealvalue  %ld
                    The time in jiffies before the next SIGALRM is sent
                    to the process due to an interval timer.  Since ker‐
                    nel 2.6.17, this field is no longer maintained, and
                    is hard coded as 0.

          (22) starttime  %llu
                    The time the process started after system boot.  In
                    kernels before Linux 2.6, this value was expressed
                    in jiffies.  Since Linux 2.6, the value is expressed
                    in clock ticks (divide by sysconf(_SC_CLK_TCK)).

                    The format for this field was %lu before Linux 2.6.

          (23) vsize  %lu
                    Virtual memory size in bytes.

          (24) rss  %ld
                    Resident Set Size: number of pages the process has
                    in real memory.  This is just the pages which count
                    toward text, data, or stack space.  This does not
                    include pages which have not been demand-loaded in,
                    or which are swapped out.

          (25) rsslim  %lu
                    Current soft limit in bytes on the rss of the
                    process; see the description of RLIMIT_RSS in
                    getrlimit(2).

          (26) startcode  %lu  [PT]
                    The address above which program text can run.

          (27) endcode  %lu  [PT]
                    The address below which program text can run.

          (28) startstack  %lu  [PT]
                    The address of the start (i.e., bottom) of the
                    stack.

          (29) kstkesp  %lu  [PT]
                    The current value of ESP (stack pointer), as found
                    in the kernel stack page for the process.

          (30) kstkeip  %lu  [PT]
                    The current EIP (instruction pointer).

          (31) signal  %lu
                    The bitmap of pending signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (32) blocked  %lu
                    The bitmap of blocked signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (33) sigignore  %lu
                    The bitmap of ignored signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (34) sigcatch  %lu
                    The bitmap of caught signals, displayed as a decimal
                    number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (35) wchan  %lu  [PT]
                    This is the "channel" in which the process is wait‐
                    ing.  It is the address of a location in the kernel
                    where the process is sleeping.  The corresponding
                    symbolic name can be found in /proc/[pid]/wchan.

          (36) nswap  %lu
                    Number of pages swapped (not maintained).

          (37) cnswap  %lu
                    Cumulative nswap for child processes (not main‐
                    tained).

          (38) exit_signal  %d  (since Linux 2.1.22)
                    Signal to be sent to parent when we die.

          (39) processor  %d  (since Linux 2.2.8)
                    CPU number last executed on.

          (40) rt_priority  %u  (since Linux 2.5.19)
                    Real-time scheduling priority, a number in the range
                    1 to 99 for processes scheduled under a real-time
                    policy, or 0, for non-real-time processes (see
                    sched_setscheduler(2)).

          (41) policy  %u  (since Linux 2.5.19)
                    Scheduling policy (see sched_setscheduler(2)).
                    Decode using the SCHED_* constants in linux/sched.h.

                    The format for this field was %lu before Linux
                    2.6.22.

          (42) delayacct_blkio_ticks  %llu  (since Linux 2.6.18)
                    Aggregated block I/O delays, measured in clock ticks
                    (centiseconds).

          (43) guest_time  %lu  (since Linux 2.6.24)
                    Guest time of the process (time spent running a vir‐
                    tual CPU for a guest operating system), measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (44) cguest_time  %ld  (since Linux 2.6.24)
                    Guest time of the process's children, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (45) start_data  %lu  (since Linux 3.3)  [PT]
                    Address above which program initialized and unini‐
                    tialized (BSS) data are placed.

          (46) end_data  %lu  (since Linux 3.3)  [PT]
                    Address below which program initialized and unini‐
                    tialized (BSS) data are placed.

          (47) start_brk  %lu  (since Linux 3.3)  [PT]
                    Address above which program heap can be expanded
                    with brk(2).

          (48) arg_start  %lu  (since Linux 3.5)  [PT]
                    Address above which program command-line arguments
                    (argv) are placed.

          (49) arg_end  %lu  (since Linux 3.5)  [PT]
                    Address below program command-line arguments (argv)
                    are placed.

          (50) env_start  %lu  (since Linux 3.5)  [PT]
                    Address above which program environment is placed.

          (51) env_end  %lu  (since Linux 3.5)  [PT]
                    Address below which program environment is placed.

          (52) exit_code  %d  (since Linux 3.5)  [PT]
                    The thread's exit status in the form reported by
                    waitpid(2).

可以看到,我们上文是通过 cat /proc/$pid/stat | cut -d" " -f22 来获取的该进程的 starttime —— 启动时间。这里需要注意,这里所说的启动时间,是指该进程启动时,系统当时的中断次数,这个次数除以系统的时钟频率(CLK_TCK)就得到了该进程的开始时间(单位秒)。

至于剩下的比较和判断,还有关闭进程,这些比较基础的内容就不一一讲解了。

四、总结

关闭超时进程的操作思路比较简单,主要是需要对 Linux 系统有一些了解,知道相关的内容(比如系统运行时间、进程运行时间等)保存在哪里,其他的都是基本操作了。

祝大家变的更强。


http://www.niftyadmin.cn/n/4411688.html

相关文章

SMS短信开发技术

SMS短信开发技术总结--协议篇  现在提供短信服务的SP都需要接入到各个移动运营商,虽然作为短信来说是同过SMPP协议和移动的交换中心进行通信。但是为了提供信息服务,对各种业务进行业务管理,以及计费,因此每个移动运营商都开发了…

【实现一套爬虫数据抓取平台】[3-3-02] CentOS 设置定时任务/计划任务

文章目录0、系列目录1、应用场景2、周期性任务 - crontab安装 crontab创建任务参数解释其他命令3、一次性任务 - at安装 at创建任务查看任务删除任务时间参数4、总结0、系列目录 写这套文章的时候,不会完全按照目录的顺序一篇一篇写, 大家可以到目录中直…

Proxool 配置

一&#xff0e;Proxool 在hibernate 中的配置1&#xff0c; Hibernate的三种连接池设置C3P0、Proxool和DBCP以下三种连接都是以连接MySQl为例。 <!-- JDBC驱动程序 --> <property name"connection.driver_class">com.mysql.jdbc.Driver</property>…

c# websocket client java websocket server

实现功能&#xff1a;c# websocket 客户端 连接 java websocket 服务端 一&#xff0c;c# websocket 客户端 nuget websocketsharp-netstandard Program.cs using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threadi…

【实现一套爬虫数据抓取平台】[3-3-03] Ubuntu 如何升级 CMake

文章目录0、系列目录1、背景2、操作步骤3、总结0、系列目录 写这套文章的时候&#xff0c;不会完全按照目录的顺序一篇一篇写&#xff0c; 大家可以到目录中直接找到对应的章节进行查看。 点我跳转 1、背景 安装某些 Python 依赖库的时候&#xff0c;需要特定版本的 Cmake&…

网上有关回车和换行的一个有趣说法

"回车"(carriage return)和"换行"(line feed)这两个概念的来历和区别: 在计算机还没有出现之前&#xff0c;有一种叫做电传打字机(Teletype Model 33)的玩意&#xff0c;每秒钟可以打10个字符。但是它有一个问题&#xff0c;就是打完一行换行的时候&#x…

【实现一套爬虫数据抓取平台】[3-3-04] 使用 Docker-Compose 安装 Kafka

文章目录0、系列目录1、代码0、系列目录 写这套文章的时候&#xff0c;不会完全按照目录的顺序一篇一篇写&#xff0c; 大家可以到目录中直接找到对应的章节进行查看。 点我跳转 1、代码 简单粗暴&#xff0c;直接上代码。 version: 2 services:zookeeper:image: wurstmei…

汇编的有趣问题

int main(){ int a1; int b2; int c-1;} 问题是下面哪个关系成立: &a>&b>&c还是&a<&b<&c? 我们知道局部变量是存放在栈中的,a先PUSH,然后是b,最后是c。 而栈指针SP是从高地址→低地址方向移动的,所以&a>&b>…