本文是小编为大家收集整理的关于在BigQuery中提取按日期分组的hashtags时间线的最高效查询的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。
问题描述
我想从"注释数据"列中提取按日期分组的标签时间轴.数据是JSON列.我们必须计算可以在" t"或" d"属性中的标签
Table: notes ---------------------------------------------------------------------- | id | data | created_at | ---------------------------------------------------------------------- | 1 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 10:00:00 | | 2 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 11:00:00 | | 3 | {"t":"title","d":"#hash1 #hash2 desc"} | 2018-01-03 10:00:00 |
如下所述,所需的输出需要将每个主题标签带有与格式相应的时间表:DATE:COUNT|DATE:COUNT|DATE:COUNT
Required Output ---------------------------------------------------------- | hashtag | timeline | ---------------------------------------------------------- | #hash1 | 2018-01-01:4|2018-01-03:1 | | #hash2 | 2018-01-03:1 |
具有所有这些功能的最有效的单个查询是什么:
- 来自数据的" T"和" D"属性的提取主题标签.
- 计数按日期分组的主题标签.
- 以所需格式的相应的主题标签时间表.
更新1: 以下是我的查询.效率低下,因为我必须不努力2次.我无法弄清楚如何使其高效.
WITH r0 AS ( SELECT JSON_EXTRACT_SCALAR(data, '$[d]') as data, created_at FROM `notes` UNION ALL SELECT JSON_EXTRACT_SCALAR(data, '$[t]') as data, created_at from `notes` ), r1 AS ( SELECT created_at, REGEXP_EXTRACT_ALL(data, r"#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])") AS hashtags FROM r0 ), r2 AS ( SELECT ARRAY_AGG(DATE(created_at)) as created_at_dates, hashtag FROM r1, UNNEST(hashtags) hashtag GROUP BY hashtag ), r3 AS ( SELECT created_at_date, hashtag FROM r2, UNNEST(created_at_dates) created_at_date ), r4 AS ( SELECT hashtag, created_at_date, count(created_at_date) as day_val FROM r3 GROUP BY hashtag, created_at_date ORDER BY created_at_date ) SELECT hashtag, STRING_AGG(CONCAT(CAST(created_at_date as STRING),':',CAST(day_val as STRING)), '|') as timeline FROM r4 GROUP BY hashtag
推荐答案
以下是BigQuery Standard SQL
#standardSQL SELECT hashtag, STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline FROM ( SELECT hashtag, CAST(DATE(created_at) AS STRING) day, CAST(COUNT(1) AS STRING) cnt FROM `project.dataset.table`, UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d)":(".*?")')) val, UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)\s')) hashtag GROUP BY hashtag, day ) GROUP BY hashtag
如果您需要提取的不仅仅是t和d属性 - 您只需将它们添加到(?:t|d)列表中,而不是使用多个UNION ALL的
如果要在您的问题中执行以上示例数据 - 结果为
Row hashtag timeline 1 #hash1 2018-01-01:4|2018-01-03:1 2 #hash2 2018-01-03:1
更新以解决 @user2576951注释
中提到的"深层结构"
参见下面的更新以及虚拟数据,以测试
#standardSQL WITH `project.dataset.table` AS ( SELECT 1 id, '{"x":"title","t":"#hash1 title","d":"help #hash1 desc"}' data, TIMESTAMP '2018-01-01 10:00:00' created_at UNION ALL SELECT 2, '{"t":"#hash1 title","y":"title","d":"#hash1 desc"}', '2018-01-01 11:00:00' UNION ALL SELECT 3, '{"t":"title","d":"#hash1 #hash2 desc","z":"title"}', '2018-01-03 10:00:00' UNION ALL SELECT 4, '{"t":"title","d":"description","snippet":{"t":"#hash1","st":"#hash1", "ssd":"#hash3"}}', '2018-02-03 10:00:00' ) SELECT hashtag, STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline FROM ( SELECT hashtag, CAST(DATE(created_at) AS STRING) day, CAST(COUNT(1) AS STRING) cnt FROM `project.dataset.table`, UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d|st|sd)":"(.*?)"')) val, UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)(?:$|\s)')) hashtag GROUP BY hashtag, day ) GROUP BY hashtag -- ORDER BY hashtag
输出
Row hashtag timeline 1 #hash1 2018-01-01:4|2018-01-03:1|2018-02-03:2 2 #hash2 2018-01-03:1
您可以在此处看到的标签从嵌套元素收集,即使sd是其中的一部分,也不匹配" SSD"
我认为以上解决了您的两个评论/关注
其他推荐答案
我不确定这是否是"最有效的",但这应该做您想做的事情:
select hashtag, array_agg(concat(created_at, ':', cast(cnt as string)) from (select hashtag, created_at, count(*) as cnt from ((select json_extract_scalar(data, '$[d]') as hashtag, created_at from t ) union all (select json_extract_scalar(data, '$[t]') as hashtag, created_at from t ) ) h group by hash ) ch group by hashtag;
问题描述
I want to extract the Hashtag Timeline grouped by date from the notes data column. Data is a JSON column. We have to count hashtags that can be in 't' OR 'd' property
Table: notes ---------------------------------------------------------------------- | id | data | created_at | ---------------------------------------------------------------------- | 1 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 10:00:00 | | 2 | {"t":"#hash1 title","d":"#hash1 desc"} | 2018-01-01 11:00:00 | | 3 | {"t":"title","d":"#hash1 #hash2 desc"} | 2018-01-03 10:00:00 |
As described below the required output needs to have each hashtag with it's corresponding timeline in format: DATE:COUNT|DATE:COUNT|DATE:COUNT
Required Output ---------------------------------------------------------- | hashtag | timeline | ---------------------------------------------------------- | #hash1 | 2018-01-01:4|2018-01-03:1 | | #hash2 | 2018-01-03:1 |
What is the most efficient single query that has all these features:
- Extract Hashtags from 't' AND 'd' property of data.
- Count the Hashtags grouped by date.
- Concatenate respective hashtag timeline in the desired format.
UPDATE 1: Below is my query. It's inefficient because I have to UNNEST 2 times. I am not able to figure out how to make it efficient.
WITH r0 AS ( SELECT JSON_EXTRACT_SCALAR(data, '$[d]') as data, created_at FROM `notes` UNION ALL SELECT JSON_EXTRACT_SCALAR(data, '$[t]') as data, created_at from `notes` ), r1 AS ( SELECT created_at, REGEXP_EXTRACT_ALL(data, r"#(\w*[0-9a-zA-Z]+\w*[0-9a-zA-Z])") AS hashtags FROM r0 ), r2 AS ( SELECT ARRAY_AGG(DATE(created_at)) as created_at_dates, hashtag FROM r1, UNNEST(hashtags) hashtag GROUP BY hashtag ), r3 AS ( SELECT created_at_date, hashtag FROM r2, UNNEST(created_at_dates) created_at_date ), r4 AS ( SELECT hashtag, created_at_date, count(created_at_date) as day_val FROM r3 GROUP BY hashtag, created_at_date ORDER BY created_at_date ) SELECT hashtag, STRING_AGG(CONCAT(CAST(created_at_date as STRING),':',CAST(day_val as STRING)), '|') as timeline FROM r4 GROUP BY hashtag
推荐答案
Below is for BigQuery Standard SQL
#standardSQL SELECT hashtag, STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline FROM ( SELECT hashtag, CAST(DATE(created_at) AS STRING) day, CAST(COUNT(1) AS STRING) cnt FROM `project.dataset.table`, UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d)":(".*?")')) val, UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)\s')) hashtag GROUP BY hashtag, day ) GROUP BY hashtag
If you need to extract for more than just t and d properties - you just add them into (?:t|d) list as opposed to using multiple UNION ALL's
if to execute above on sample data in your question - result is
Row hashtag timeline 1 #hash1 2018-01-01:4|2018-01-03:1 2 #hash2 2018-01-03:1
Update to address "deep structure" mentioned in @user2576951 comment
See below update along with dummy data to test with
#standardSQL WITH `project.dataset.table` AS ( SELECT 1 id, '{"x":"title","t":"#hash1 title","d":"help #hash1 desc"}' data, TIMESTAMP '2018-01-01 10:00:00' created_at UNION ALL SELECT 2, '{"t":"#hash1 title","y":"title","d":"#hash1 desc"}', '2018-01-01 11:00:00' UNION ALL SELECT 3, '{"t":"title","d":"#hash1 #hash2 desc","z":"title"}', '2018-01-03 10:00:00' UNION ALL SELECT 4, '{"t":"title","d":"description","snippet":{"t":"#hash1","st":"#hash1", "ssd":"#hash3"}}', '2018-02-03 10:00:00' ) SELECT hashtag, STRING_AGG(CONCAT(day, ':', cnt), '|' ORDER BY day) AS timeline FROM ( SELECT hashtag, CAST(DATE(created_at) AS STRING) day, CAST(COUNT(1) AS STRING) cnt FROM `project.dataset.table`, UNNEST(REGEXP_EXTRACT_ALL(data, r'"(?:t|d|st|sd)":"(.*?)"')) val, UNNEST(REGEXP_EXTRACT_ALL(val, r'(#.*?)(?:$|\s)')) hashtag GROUP BY hashtag, day ) GROUP BY hashtag -- ORDER BY hashtag
with output
Row hashtag timeline 1 #hash1 2018-01-01:4|2018-01-03:1|2018-02-03:2 2 #hash2 2018-01-03:1
as you can see here hashtags are collected from nested elements and "ssd" was not matched even though sd is part of it
I think above addresses your both comments / concerns
其他推荐答案
I'm not sure if this is "most efficient", but this should do what you want:
select hashtag, array_agg(concat(created_at, ':', cast(cnt as string)) from (select hashtag, created_at, count(*) as cnt from ((select json_extract_scalar(data, '$[d]') as hashtag, created_at from t ) union all (select json_extract_scalar(data, '$[t]') as hashtag, created_at from t ) ) h group by hash ) ch group by hashtag;