【论文阅读 ICTIR‘2022】Revisiting Open Domain Query Facet Extraction and Generation
创始人
2024-04-15 04:02:17
0

文章目录

  • Revisiting Open Domain Query Facet Extraction and Generation
    • Motivation
    • Contributions
    • Method
      • Facet Extraction and Generation
      • Facet Extraction as Sequence Labeling
      • Autoregressive Facet Generation
      • Facet Generation as Extreme Multi-Label Classification
      • Facet Generation by Prompting Large Language Models
      • Unsupervised Facet Extraction from SERP
      • Facet Lists Aggregation
    • Data

Revisiting Open Domain Query Facet Extraction and Generation

https://dl.acm.org/doi/abs/10.1145/3539813.3545138

Motivation

Revisit the task of query facet extraction and generation and study various formulations of this task

  • also explored various aggregation approaches based on relevance and diversity to combine the facet sets produced by different formulations of the task

Contributions

  • Introduction of novel formulations for the facet extraction and generation task(by the recent advancements in text understanding and generation)
  • Through offline evaluation, we demonstrate that the models studied in this paper significantly outperform state-of-art baselines. We demonstrate that their combination leads to improvement in recall
  • create an open-source toolkit, named Faspect, that includes various implementations of facet extraction and generation methods in this paper

Method

Facet Extraction and Generation

We focus on the extraction and generation of facets from the search engine result page (SERP) for a given query

  • training set:

    在这里插入图片描述

    • qiq_iqi​ is an open-domain search query
    • Di=[di1,di2,⋯,dik]D_i = [d_{i1}, d_{i2}, \cdots,d_{ik}]Di​=[di1​,di2​,⋯,dik​]​ denotes the top 𝑘 documents returned by a retrieval model in response to query.
    • Fi={fi1,fi2,⋯,fim}F_i = \{f_{i1}, f_{i2}, \cdots,f_{im}\}Fi​={fi1​,fi2​,⋯,fim​} is a set of m ground truth facets associated with query qiq_iqi​

The task is to train a model to return an accurate list of facets.

Facet Extraction as Sequence Labeling

We can cast the facet extraction problem as sequence labeling task.

在这里插入图片描述

  • wx∈tokenize(dij)w_x \in tokenize(d_{ij})wx​∈tokenize(dij​)

Our MθextM_{\theta_{ext}}Mθext​​ classifies each document token to B,I,O. We use RoBERTa and apply an MLP with the output dimensionality of three to each token representation of BERT.

  • input: [CLS] query tokens [SEP] doc tokens [SEP]

  • objective:

    在这里插入图片描述

    • where

      在这里插入图片描述

    • where ppp can be computed by applying a softmax operator to the model’s output for the xthx^{th}xth token.

在这里插入图片描述

  • inference: get the model output for all the documents in 𝐷𝑖𝐷_𝑖Di​ and sort them by frequency

Autoregressive Facet Generation

We perform facet generation using an autoregressive text generation model.

For evert query qiq_iqi​ we concatenate the facets in FiF_iFi​ using a separation token as yiy_iyi​.

The model is BART(a Transformer-based encoder-decoder model for text generation.) and we use two variations:

  • variations:

    • only takes the query tokens and generates the facets

    • takes the query tokens and the document tokens for all documents in SERP (separated by [SEP]) as input and generates facet tokens one by one.

  • objective:

    在这里插入图片描述

    • vvv is the BART encoder’s output
  • inference: perform autoregressive text generation with beam search and sampling, conditioning the probability of the next token on the previous generated tokens

Facet Generation as Extreme Multi-Label Classification

we treat the facet generation task as an extreme multi-label text classification problem.

  • The intuition behind this approach is that some facets tend to appear very frequently across different queries

The model is RoBERTa MθmclM_{\theta_{mcl}}Mθmcl​​

  • get the probability of every facet by applying a linear transformation to the representation of the [CLS] token followed by sigmoid activation

  • objective(binary cross-entropy):

    在这里插入图片描述

    • where yi,j′y'_{i,j}yi,j′​​ is the probability of relevance of the facet fjf_jfj​ given the query qiq_iqi​ and the list of documents DiD_iDi​

      • it can be computed by applying a sigmoid operator to the model’s output for the jthj^{th}jth facet class

        在这里插入图片描述

Facet Generation by Prompting Large Language Models

We investigate the few-shot effectiveness of largescale pre-trained autoregressive language models.

model: GPT-3

  • generate facets using a task description followed by a small number of examples(prompt)

    • Through prompting, we define the number of facets in the beginning of every example output. so that we can have control over the number of facets GPT-3 can generate.

    在这里插入图片描述

Unsupervised Facet Extraction from SERP

Use some rules to extract facets from SERP and re-rank them.

Facet Lists Aggregation

We explore three aggregation methods: Learning to Rank, MMR diversification, Round Robin Diversification

  • Facet Relevance Ranking:

    • use a bi-encoder model to assign a score to each candidate facet for each query and re-rank them based on their score in descending order

      • score: use the dot product of the query and facet representations: sim(𝑞𝑖 , 𝑓𝑖 ) = 𝐸(𝑞𝑖 ) · 𝐸( 𝑓𝑖 ).

      • E: use the average token embedding of BERT pre-trained on multiple text similarity tasks. To find optimal parameter, minimize cross-entropy loss for every positive query-facet pair (qi,fi+)(q_i,f_i^+)(qi​,fi+​) in MIMICS dataset

        在这里插入图片描述

        • B is the training batch size
        • {fi,j−}j=1B−1\{f_{i,j}^-\}_{j=1}^{B-1}{fi,j−​}j=1B−1​ is the set of in-batch negative examples
  • MMR diversification:

    • use a popular diversification approach, named Maximal Marginal Relevance (MMR).

      • The intuition is that different models may generate redundant facets

      • score function:

        在这里插入图片描述

        • RRR is the list of extracted facets for a given query
        • SSS is the set of already selected facets
  • Round Robin Diversification:

    • iterate over the four lists of facets generated by different models, and alternatively select the facet with the highest score from each list until we generate the desired number of facets.

Data

MIMICS: contains web search queries sampled from the Bing query logs, and for each query, it provides up to 5 facets and the returned result snippets.

  • train: MIMICS-Click
  • evaluation: MIMICS-Manual

相关内容

热门资讯

喜闻乐见,2026年“国补”政... 12月30日,国家发展改革委、财政部印发的《关于2026年实施大规模设备更新和消费品以旧换新政策的通...
女子与表姐夫婚外情获赠三百余万... 一桩发生在海南的民间借贷纠纷,将一段持续二十余年的婚外情推到台前。相关材料显示,男子林森(化名)在婚...
诉讼纠纷频发!中央商场子公司又... 继控股子公司新亚百货面临补缴税款及滞纳金7392万元后,百货零售巨头中央商场(600280.SH)另...
从明天起,楼市迎来两大利好政策... 作者:暴哥 来源:暴财经pro 同志们,2025年要过去了! 今年,各位在股市里应该收获不少,把过去...
国台办回应台湾网红“馆长”大陆... 12月31日,国台办举行例行新闻发布会,国台办发言人张晗就近期两岸热点问题回答记者提问。 有记者提问...
中钢天源:股东中钢热能院907... 雷达财经 文|冯秀语 编|李亦辉 12月30日,中钢天源(证券代码:002057)发布公告称,其控股...
宇树科技因侵权责任纠纷案件被起... 天眼查法律诉讼信息显示,近日,杭州宇树科技股份有限公司及邵某新增1条开庭公告,原告为孙某,涉及侵权责...
今日视点:“制度创新+科技突破... 2025年,A股市场交易活跃度持续提升。截至12月30日,全年累计成交金额达417.8万亿元,同比增...
专业文章丨第二顺位抵押权人实现... 【珠海律师、珠海法律咨询、珠海律师事务所、京师律所、京师珠海律所】 (本文转载自北京市京师郑州律师事...
企业就网络谣言报案并追究法律责... 2025-12-31 09:41:29 作者:狼叫兽 声明指出,目前相关网络平台已对上述不实信息进...