Contents
  1. 1. 提取数据为csv
  2. 2. 查看文件schema
  3. 3. hive连接数据库密码加密
    1. 3.1. 创建密码文件
    2. 3.2. 配置实施
  4. 4. 分区修复
  5. 5. 修改注释
  6. 6. 增加列
  7. 7. 获取所有数据库表
  • 元数据
    1. 1. 查看所有表分区情况
    2. 2. 元数据的修改
  • Beeline
  • Query
    1. 1. 最大连续登陆天数
      1. 1.1. 每天的连续登录天数
  • Map类型使用
  • 提取数据为csv

    1
    insert overwrite directory '/tmp/blog' row format delimited fields terminated by ','  STORED AS TEXTFILE select * from tbl where  concat(year,month,day) = '20210721' and phoneNumber='xx' and platform='android';

    修改存储路径
    alter table dim.tb_hcy_number_segment set location ‘hdfs://hcycluster2/hive/warehouse/dim.db/tb_hcy_number_segment’

    查看文件schema

    hive –orcfiledump [hdfs file path]

    hive连接数据库密码加密

    创建密码文件

    hadoop credential create javax.jdo.option.ConnectionPassword
    [hdfs文件] -provider /tmp/hive.jceks
    [hdfs文件] -provider jceks://hdfs/hive/hive.jceks
    [local文件] -provider jceks://file//tmp/hive.jceks
    [可选显式输入密码] -value AAA

    配置实施

    删除配置项 javax.jdo.option.ConnectionPassword
    增加配置

    1
    2
    3
    4
    <property>
    <name>hadoop.security.credential.provider.path</name>
    <value>jceks://hdfs/hive/hive.jceks</value>
    </property>

    分区修复

    hdfs直接删除分区文件,hive里分区仍存在,需要 ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
    批量修复可以使用以下命令:
    MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
    默认是add操作,hdfs有,meta无的,添加元数据;
    drop操作是hdfs无,meta有的,删除元数据;
    sync操作是 add+drop

    修改注释

    alter table ads.whitelist_extraction ALTER COLUMN platform comment '1:营销投放白名单,2:139邮箱侧勾选协议用户,3:各端勾选协议用户,4:配置中心'

    增加列

    ALTER TABLE ods.burpoint_active_log_d ADD COLUMNS (clouddevice string);

    获取所有数据库表

    1
    2
    3
    val dbs = spark.sql("show databases").collect().map(_.getString(0))
    spark.sql("create table tmp.dbtbl_test(namespace string, tablename string) row format delimited fields terminated by ',' STORED AS TEXTFILE ")
    dbs.foreach(r=>{spark.sql(s"show tables in $r ").select("namespace","tablename").write.insertInto("tmp.dbtbl_test") })
    1
    2
    3
    4
    5
    # 表目录修改日期
    hdfs dfs -ls -t /hive/warehouse/*.db |grep '^d' |awk '{ print $8 " " $6 }' >> rawinfo
    sed 's/\/hive\/warehouse\///' rawinfo |sed 's/\.db\// /' | sort -n -k 1 -k 2 -t ' ' > sorted_info
    # 表目录大小
    hdfs dfs -dus /hive/warehouse/*.db/* |awk '{ print $2 " " $1}' | sed 's/\/hive\/warehouse\///' |sed 's/\.db\// /' >sizeinfo

    元数据

    • DBS
      • DB_ID
      • DESC
      • DB_LOCATION_URI
      • NAME
      • OWNER_NAME
      • OWNER_TYPE
    • DATABASE_PARAMS 对应建表WITH DBPROPERTIES参数
      • DB_ID
      • PARAM_KEY
      • PARAM_VALUE
    • DB_PRIVS
    • TBLS
      • TBL_ID
      • CREATE_TIME
      • DB_ID
      • LAST_ACCESS_TIME - 何时更新?目前观测spark不更新,值均为0
      • OWNER
      • RETENTION 保留字段
      • IS_REWRITE_ENABLED
      • SD_ID 序列号配置信息,对应SDS表
      • TBL_NAME
      • TBL_TYPE
      • VIEW_EXPENDED_TEXT
      • VIEW_ORIGINAL_TEXT
    • TABLE_PARAMS 对应建表tbl_properties参数
      • TBL_ID
      • PARAM_KEY
      • PARAM_VALUE
    • TBL_PRIVS
    • PARTITIONS 所有分区信息
      • PART_ID
      • CREATE_TIME
      • LAST_ACCESS_TIME
      • PART_NAME
      • SD_ID
      • TBL_ID
    • PARTITION_KEYS 分区配置
      • TBL_ID
      • PKEY_COMMENT
      • PKEY_NAME
      • PKEY_TYPE
      • INTEGER_IDX
    • PARTITION_KEY_VALS 所有分区信息,分区列拆成行
      • PART_ID
      • PART_KEY_VAL
      • INTEGER_IDX
    • PARTITION_PARAMS
      • PART_ID
      • PARAM_KEY: numFiles/totalSize/transient_lastDdlTime
      • PARAM_VALUE
    • SDS 表级存储相关信息
      • SD_ID
      • CD_ID
      • INPUT_FORMAT
      • IS_COMPRESSED
      • IS_STOREDASSUBDIRECTORIES
      • LOCATIOIN
      • NUM_BUCKETS
      • OUTPUT_FORMAT
      • SERDE_ID
    • SD_PARAMS 对应WITH SERDEPROPERTIES
      • SD_ID
      • PARAM_KEY
      • PARAM_VALUE
    • SERDES
      • SERDE_ID
      • NAME
      • SLIB 类名
    • SERDE_PARAMS
      • SERDE_ID
      • PARAM_KEY: path/serialization.format/field.delim
      • PARAM_VALUE
    • CDS
      • CD_ID
    • COLUMNS_V2
      • CD_ID
      • COMMENT
      • COLUMN_NAME
      • TYPE_NAME
      • INTEGER_IDX
    • FUNCS
      • FUNC_ID
      • CLASS_NAME
      • CREATE_TIME
      • DB_ID
      • FUNC_NAME
    • FUNC_RU
      • FUNC_ID
      • RESOURCE_TYPE
      • RESOURCE_URI
      • INTEGER_IDX

    查看所有表分区情况

    1
    2
    3
    4
    5
    6
    7
    8
    9
    use hive;

    select t.TBL_ID,TBL_NAME,min_part,max_part,part_cnt from
    (select TBL_ID,min(part_name) min_part,max(part_name) max_part,count(*) part_cnt from hive.PARTITIONS
    group by TBL_ID) t
    left join TBLS on TBLS.TBL_ID=t.TBL_ID
    where TBLS.DB_ID=4

    left join DBS on DBS.DB_ID=TBLS.DB_ID

    删文件后修复分区,会进回收站,删文件时效率低
    MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

    删分区
    ALTER TABLE table_name DROP PARTITION (partition_spec)[, PARTITION (partition_spec), ...]

    元数据的修改

    https://blog.csdn.net/haozhugogo/article/details/73274832

    –get DB_ID TBL_ID SD_ID
    select * from tbls where TBL_NAME=’t1’;

    –DB_LOCATION_URI
    select * from dbs where DB_ID = 41;

    –spark.sql.sources.schema -> spark创建表的字段大小写导致hive读出为null.
    – spark中实现依赖 set spark.sql.caseSensitive=false;
    select * from table_params where TBL_ID =89239;

    –get CDS
    select * from sds where SD_ID=245768;

    –COLUMN_NAME TYPE_NAME
    select * from columns_v2 where CD_ID =90196;

    Beeline

    !set fetchsize 200

    !addlocaldriverjar Add driver jar file in the beeline client side.
    !addlocaldrivername Add driver name that needs to be supported in the beeline client side.
    !all Execute the specified SQL against all the current connections

    !batch Start or execute a batch of statements
    !brief Set verbose mode off
    !call Execute a callable statement
    !close Close the current connection to the database
    !closeall Close all current open connections
    !commit Commit the current transaction (if autocommit is off)
    !connect Open a new connection to the database.
    !go Select the current connection
    !help Print a summary of command usage
    !history Display the command history
    !list List the current connections
    !metadata Obtain metadata information
    !nullemptystring Set to true to get historic behavior of printing null as
    empty string. Default is false.
    !outputformat Set the output format for displaying results
    (table,vertical,csv2,dsv,tsv2,xmlattrs,xmlelements, and
    deprecated formats(csv, tsv))
    !reconnect Reconnect to the database
    !record Record all output to the specified file
    !rehash Fetch table and column names for command completion
    !run Run a script from the specified file
    !save Save the current variabes and aliases
    !scan Scan for installed JDBC drivers
    !script Start saving a script to a file
    !set Set a beeline variable
    !sh Execute a shell command
    !sql Execute a SQL command
    !tables List all the tables in the database
    !typeinfo Display the type map for the current connection
    !verbose Set verbose mode on

    Query

    最大连续登陆天数

    如果登录数据是连续登录的,比如对用户id为1的用户来说,2021-08-01到2021-08-03是连续登录
    的,那么他们的日期和行号之差就是相等的,为2021-07-31,那么就可以按照用户和他们的登录日期和行号之差来分组,也就得到了哪些天是连续登录的

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    select id,
    date_sub(login_rk.login_date, rk) as login
    ,min(login_date) as min_date,
    max(login_date) as max_date,
    count(1)
    from (
    select id,
    login_date,
    row_number() over (partition by id order by login_date asc) as rk
    from login_data
    ) as login_rk
    group by id,date_sub(login_rk.login_date, rk)

    每天的连续登录天数

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    SELECT login.id,
    login.login_date,
    DATEDIFF(login_date, min_date)+1 as continue_days
    FROM(
    SELECT id,
    MIN(login_date) as min_date,
    MAX(login_date) as max_date
    FROM (
    SELECT id,
    login_date,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY login_date ASC) as rk
    FROM login_data
    ) AS login_rk
    GROUP BY id,date_sub(login_rk.login_date, rk)
    ) AS max_login_days
    RIGHT JOIN
    (
    SELECT id, login_date FROM login_data
    ) AS login
    on login.id = max_login_days.id and login.login_date between max_login_days.min_date and max_login_days.max_date
    ORDER BY id

    Map类型使用