学技术学英语:elasticsearch查询的两阶段queryingfetching

news/2025/1/31 11:49:45 标签: elasticsearch, jenkins, 大数据

To understand Elasticsearch’s distributed search, let’s take a moment to understand how querying and fetching work. Unlike simple CRUD tasks, distributed search is like navigating through a maze of shards spread across the cluster.

In Elasticsearch, CRUD operations handle individual documents identified by their unique indextype, and routing-value (usually the document’s _id). However, search queries are more complex. They don’t have a fixed destination and must search through every shard in the index or indices to locate potential matches.

However, discovering matching documents marks just the beginning. The search API needs to combine results from various shards into a unified, organized list before displaying them to the user. This initiates the two-step process of querying and fetching.

By default, Elasticsearch utilizes a search method known as “Query Then Fetch.” This approach progresses through the following steps:

  1. Client sent a query to Elasticsearch
  2. Broadcast the query to each shard
  3. Find all matching documents and calculate scores using local Term/Document Frequencies
  4. Build a priority queue of results (sort, pagination with from/to, etc)
  5. Return metadata about the results to requesting node. Note, the actual document is not sent yet, just the scores
  6. Scores from all the shards are merged and sorted on the requesting node, docs are selected according to query criteria
  7. Finally, the actual docs are retrieved from individual shards where they reside.
  8. Results are returned to the client

Note: Coordinator node responsible for the steps 1,2, and 8.

Query Phase (3,4,5,6): the search query is sent to every shard, initiating local execution and the creation of a priority queue containing matching documents.

Fetch Phase (7): while the query phase identifies relevant documents, the fetch phase is responsible for fetching the actual documents from their respective shards.

This divided method guarantees effective and scalable search operations in a distributed setting. In the query phase, the search query navigates through each shard copy (primary or replica shards) to initiate local searches and compile a prioritized list of matching documents. This phase marks the initial step in refining the search results.

The fetch phase, resulting in the delivery of desired search outcomes. This phase acts as a bridge between query execution and result retrieval, ensuring the thoroughness of the search process.

Additional information:

Enabling Elasticsearch’s slow logs separately for query and fetch phases enables precise monitoring and optimization of search performance. Administrators can pinpoint potential bottlenecks and adjust system parameters by establishing thresholds for query and fetch durations separately.

For instance, configuring slow logs with specific thresholds for query and fetch phases can be done as follows:

PUT *,-.*/_settings
{
  "index.search.slowlog.threshold.query.warn": "1s",
  "index.search.slowlog.threshold.fetch.warn": "100ms"
}

#or with curl

curl -XPUT "http://localhost:9200/*,-.*/_settings" -H "Content-Type: application/json" -d'
{
  "index.search.slowlog.threshold.query.warn": "1s",
  "index.search.slowlog.threshold.fetch.warn": "100ms"
}'

Elasticsearch query vs fetch times

It’s expected to see way more less fetch time compared to query time. Here is a topic that created in elastic discuss about the speed.

中文总结:

  1. 分布式搜索的两阶段过程

    • Elasticsearch 的分布式搜索分为 查询阶段(Query Phase) 和 获取阶段(Fetch Phase)

    • 查询阶段:搜索请求广播到每个分片,分片本地执行查询并返回匹配文档的元数据(如评分)。

    • 获取阶段:根据查询阶段的结果,从各个分片获取实际的文档内容。

  2. 查询阶段的工作流程

    • 客户端发送查询请求到协调节点(Coordinator Node)。

    • 协调节点将查询广播到索引的每个分片(主分片或副本分片)。

    • 每个分片本地执行查询,计算文档评分,并构建一个优先级队列。

    • 分片返回元数据(如文档 ID 和评分)到协调节点,协调节点合并和排序所有分片的结果。

  3. 获取阶段的工作流程

    • 协调节点根据查询阶段的结果,向相关分片请求实际的文档内容。

    • 分片返回文档内容,协调节点将最终结果返回给客户端。

  4. 慢日志监控

    • 可以为查询阶段和获取阶段分别启用慢日志,以监控和优化搜索性能。

    • 示例配置:

      json

      复制

      PUT *,-.*/_settings
      {
        "index.search.slowlog.threshold.query.warn": "1s",
        "index.search.slowlog.threshold.fetch.warn": "100ms"
      }
  5. 查询时间与获取时间的对比

    • 通常情况下,获取时间(Fetch Time)远低于 查询时间(Query Time),因为查询阶段涉及更多的计算和排序操作。


http://www.niftyadmin.cn/n/5838632.html

相关文章

代码随想录_栈与队列

栈与队列 232.用栈实现队列 232. 用栈实现队列 使用栈实现队列的下列操作: push(x) – 将一个元素放入队列的尾部。 pop() – 从队列首部移除元素。 peek() – 返回队列首部的元素。 empty() – 返回队列是否为空。 思路: 定义两个栈: 入队栈, 出队栈, 控制出入…

Spring Boot 无缝集成SpringAI的函数调用模块

这是一个 完整的 Spring AI 函数调用实例&#xff0c;涵盖从函数定义、注册到实际调用的全流程&#xff0c;以「天气查询」功能为例&#xff0c;结合代码详细说明&#xff1a; 1. 环境准备 1.1 添加依赖 <!-- Spring AI OpenAI --> <dependency><groupId>o…

《DeepSeek R1:开启AI推理新时代》

《DeepSeek R1&#xff1a;开启AI推理新时代》 一、AI 浪潮中的新星诞生二、DeepSeek R1 的技术探秘&#xff08;一&#xff09;核心技术架构&#xff08;二&#xff09;强化学习的力量&#xff08;三&#xff09;多阶段训练策略&#xff08;四&#xff09;长序列处理优势 三、…

告别页面刷新!如何使用AJAX和FormData优化Web表单提交

系列文章目录 01-从零开始学 HTML&#xff1a;构建网页的基本框架与技巧 02-HTML常见文本标签解析&#xff1a;从基础到进阶的全面指南 03-HTML从入门到精通&#xff1a;链接与图像标签全解析 04-HTML 列表标签全解析&#xff1a;无序与有序列表的深度应用 05-HTML表格标签全面…

园区管理智能化创新引领企业效能提升与风险控制新趋势

内容概要 在现代园区管理中&#xff0c;智能化创新正成为越来越多企业优化效能和控制风险的重要途径。通过引入先进的技术手段&#xff0c;企业能够更高效地管理资源&#xff0c;并实现全面的风险控制。 首先&#xff0c;园区管理系统的基本概念和发展现状让我们看到科技与管…

【算法】经典博弈论问题——威佐夫博弈 python

目录 威佐夫博弈(Wythoff Game)【模板】 威佐夫博弈(Wythoff Game) 有两堆石子&#xff0c;数量任意&#xff0c;可以不同&#xff0c;游戏开始由两个人轮流取石子 游戏规定&#xff0c;每次有两种不同的取法 1)在任意的一堆中取走任意多的石子 2)可以在两堆中同时取走相同数量…

【redis进阶】分布式锁

目录 一、什么是分布式锁 二、分布式锁的基础实现 三、引入过期时间 四、引入校验 id 五、引入lua 六、引入 watch dog (看门狗) 七、引入 Redlock 算法 八、其他功能 redis学习&#x1f973; 一、什么是分布式锁 在一个分布式的系统中&#xff0c;也会涉及到多个节点访问同一…

Java面试题2025-并发编程基础(多线程、锁、阻塞队列)

并发编程 一、线程的基础概念 一、基础概念 1.1 进程与线程A 什么是进程&#xff1f; 进程是指运行中的程序。 比如我们使用钉钉&#xff0c;浏览器&#xff0c;需要启动这个程序&#xff0c;操作系统会给这个程序分配一定的资源&#xff08;占用内存资源&#xff09;。 …