hive
提取数据为csv
1 | insert overwrite directory '/tmp/blog' row format delimited fields terminated by ',' STORED AS TEXTFILE select * from tbl where concat(year,month,day) = '20210721' and phoneNumber='xx' and platform='android'; |
修改存储路径
alter table dim
.tb_hcy_number_segment
set location ‘hdfs://hcycluster2/hive/warehouse/dim.db/tb_hcy_number_segment’
查看文件schema
hive –orcfiledump [hdfs file path]
hive连接数据库密码加密
创建密码文件
hadoop credential create javax.jdo.option.ConnectionPassword
[hdfs文件] -provider /tmp/hive.jceks
[hdfs文件] -provider jceks://hdfs/hive/hive.jceks
[local文件] -provider jceks://file//tmp/hive.jceks
[可选显式输入密码] -value AAA
配置实施
删除配置项 javax.jdo.option.ConnectionPassword
增加配置
1 | <property> |
分区修复
hdfs直接删除分区文件,hive里分区仍存在,需要 ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
批量修复可以使用以下命令:MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
默认是add操作,hdfs有,meta无的,添加元数据;
drop操作是hdfs无,meta有的,删除元数据;
sync操作是 add+drop
修改注释
alter table ads.whitelist_extraction ALTER COLUMN platform comment '1:营销投放白名单,2:139邮箱侧勾选协议用户,3:各端勾选协议用户,4:配置中心'
增加列
ALTER TABLE ods.burpoint_active_log_d ADD COLUMNS (clouddevice string);
获取所有数据库表
1 | val dbs = spark.sql("show databases").collect().map(_.getString(0)) |
1 | # 表目录修改日期 |
元数据
- DBS
- DB_ID
- DESC
- DB_LOCATION_URI
- NAME
- OWNER_NAME
- OWNER_TYPE
- DATABASE_PARAMS 对应建表WITH DBPROPERTIES参数
- DB_ID
- PARAM_KEY
- PARAM_VALUE
- DB_PRIVS
- TBLS
- TBL_ID
- CREATE_TIME
- DB_ID
- LAST_ACCESS_TIME - 何时更新?目前观测spark不更新,值均为0
- OWNER
- RETENTION 保留字段
- IS_REWRITE_ENABLED
- SD_ID 序列号配置信息,对应SDS表
- TBL_NAME
- TBL_TYPE
- VIEW_EXPENDED_TEXT
- VIEW_ORIGINAL_TEXT
- TABLE_PARAMS 对应建表tbl_properties参数
- TBL_ID
- PARAM_KEY
- PARAM_VALUE
- TBL_PRIVS
- PARTITIONS 所有分区信息
- PART_ID
- CREATE_TIME
- LAST_ACCESS_TIME
- PART_NAME
- SD_ID
- TBL_ID
- PARTITION_KEYS 分区配置
- TBL_ID
- PKEY_COMMENT
- PKEY_NAME
- PKEY_TYPE
- INTEGER_IDX
- PARTITION_KEY_VALS 所有分区信息,分区列拆成行
- PART_ID
- PART_KEY_VAL
- INTEGER_IDX
- PARTITION_PARAMS
- PART_ID
- PARAM_KEY: numFiles/totalSize/transient_lastDdlTime
- PARAM_VALUE
- SDS 表级存储相关信息
- SD_ID
- CD_ID
- INPUT_FORMAT
- IS_COMPRESSED
- IS_STOREDASSUBDIRECTORIES
- LOCATIOIN
- NUM_BUCKETS
- OUTPUT_FORMAT
- SERDE_ID
- SD_PARAMS 对应WITH SERDEPROPERTIES
- SD_ID
- PARAM_KEY
- PARAM_VALUE
- SERDES
- SERDE_ID
- NAME
- SLIB 类名
- SERDE_PARAMS
- SERDE_ID
- PARAM_KEY: path/serialization.format/field.delim
- PARAM_VALUE
- CDS
- CD_ID
- COLUMNS_V2
- CD_ID
- COMMENT
- COLUMN_NAME
- TYPE_NAME
- INTEGER_IDX
- FUNCS
- FUNC_ID
- CLASS_NAME
- CREATE_TIME
- DB_ID
- FUNC_NAME
- FUNC_RU
- FUNC_ID
- RESOURCE_TYPE
- RESOURCE_URI
- INTEGER_IDX
查看所有表分区情况
1 | use hive; |
删文件后修复分区,会进回收站,删文件时效率低MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];
删分区ALTER TABLE table_name DROP PARTITION (partition_spec)[, PARTITION (partition_spec), ...]
元数据的修改
https://blog.csdn.net/haozhugogo/article/details/73274832
–get DB_ID TBL_ID SD_ID
select * from tbls where TBL_NAME=’t1’;
–DB_LOCATION_URI
select * from dbs where DB_ID = 41;
–spark.sql.sources.schema -> spark创建表的字段大小写导致hive读出为null.
– spark中实现依赖 set spark.sql.caseSensitive=false;
select * from table_params where TBL_ID =89239;
–get CDS
select * from sds where SD_ID=245768;
–COLUMN_NAME TYPE_NAME
select * from columns_v2 where CD_ID =90196;
Beeline
!set fetchsize 200
!addlocaldriverjar Add driver jar file in the beeline client side.
!addlocaldrivername Add driver name that needs to be supported in the beeline client side.
!all Execute the specified SQL against all the current connections
!batch Start or execute a batch of statements
!brief Set verbose mode off
!call Execute a callable statement
!close Close the current connection to the database
!closeall Close all current open connections
!commit Commit the current transaction (if autocommit is off)
!connect Open a new connection to the database.
!go Select the current connection
!help Print a summary of command usage
!history Display the command history
!list List the current connections
!metadata Obtain metadata information
!nullemptystring Set to true to get historic behavior of printing null as
empty string. Default is false.
!outputformat Set the output format for displaying results
(table,vertical,csv2,dsv,tsv2,xmlattrs,xmlelements, and
deprecated formats(csv, tsv))
!reconnect Reconnect to the database
!record Record all output to the specified file
!rehash Fetch table and column names for command completion
!run Run a script from the specified file
!save Save the current variabes and aliases
!scan Scan for installed JDBC drivers
!script Start saving a script to a file
!set Set a beeline variable
!sh Execute a shell command
!sql Execute a SQL command
!tables List all the tables in the database
!typeinfo Display the type map for the current connection
!verbose Set verbose mode on
Query
最大连续登陆天数
如果登录数据是连续登录的,比如对用户id为1的用户来说,2021-08-01到2021-08-03是连续登录
的,那么他们的日期和行号之差就是相等的,为2021-07-31,那么就可以按照用户和他们的登录日期和行号之差来分组,也就得到了哪些天是连续登录的
1 | select id, |
每天的连续登录天数
1 | SELECT login.id, |