空间科学虚拟观测台智能检索系统构建

doi:10.11871/jfdc.issn.2096-742X.2025.04.002

数据与计算发展前沿 ›› 2025, Vol. 7 ›› Issue (4): 20-32.

CSTR: 32002.14.jfdc.CN10-1649/TP.2025.04.002

doi: 10.11871/jfdc.issn.2096-742X.2025.04.002

• 专刊：空间科学大数据智能算法模型与工具 • 上一篇下一篇

空间科学虚拟观测台智能检索系统构建

李云龙¹(),焦琦融¹,王慈枫²,邹自明^1,^*()

1.中国科学院国家空间科学中心，北京 100190
2.中国科学院计算机网络信息中心，北京 100083

收稿日期:2025-05-31 出版日期:2025-08-20 发布日期:2025-08-21
通讯作者: 邹自明
作者简介:李云龙，中国科学院国家空间科学中心副研究员，硕士生导师，主要研究方向为空间科学大数据智能检索与关联挖掘。
本文中承担的工作为原型系统实现和验证。
LI Yunlong is an associated researcher and a master’s advisor at the National Space Science Center, Chinese Academy of Sciences. His main reseach interests include intelligent retrieval and knowledge mining of big data in space science.
In this paper, he is mainly responsible for implementation and validation of a prototype system.
E-mail: liyunlong@nssc.ac.cn|邹自明，中国科学院国家空间科学中心研究员，国家空间科学数据中心主任，博士生导师，长期从事空间科学与数据科学交叉领域研究，包括科学数据治理理论、标准研制、空间信息组织与互操作、日地空间大数据系统工程、空间天气领域数据挖掘与知识发现。本文中承担的工作为原型系统设计。
ZOU Ziming is a researcher and a doctoral supervisor at the National Space Science Center, Chinese Academy of Sciences. He is also the Director of the National Space Science Data Center. He has long been engaged in research at the intersection of space science and data science, including scientific data governance, development of standards, space science information management and interoperation, system engineering for space science big data, data mining and knowledge discovery in space weather.
In this paper, he is mainly responsible for the design of the prototype system.
E-mail: mzou@nssc.ac.cn
基金资助:
国家重点研发计划“基础科研条件与重大科学仪器设备研发”重点专项(2022YFF0711400)

Construction of an Intelligent Retrieval System for the Virtual Space Science Observatory

LI Yunlong¹(),JIAO Qirong¹,WANG Cifeng²,ZOU Ziming^1,^*()

1. National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
2. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100083, China

Received:2025-05-31 Online:2025-08-20 Published:2025-08-21
Contact: ZOU Ziming

摘要/Abstract

摘要：

【背景】随着空间科学数据的快速增长和多模态化，传统的基于元数据字段的检索方式难以满足科研用户对复杂语义和未预定义查询的检索需求，亟需引入具备语义理解能力的智能检索系统。【目的】本研究旨在构建一个面向空间科学领域数据的智能检索系统，以解决传统元数据查询方式在语义理解和多模态数据检索方面的不足，提升科研人员对异构空间科学数据的发现效率和准确性。【方法】研究基于大语言模型构建动态语义解析机制，结合BM25和稠密向量检索方法实现数据集的混合检索；针对图像和时序数据，采用DINOv2、VISTA、Timer-XL等模型提取内容特征，构建多模态语义索引；系统采用分层架构，集成全文检索与向量数据库，支持自然语言、标签和数据样例等多种查询方式。【结论】空间科学虚拟观测台智能检索系统通过融合多种AI模型，显著提升了数据发现的灵活性与准确性，为大规模空间科学数据的高效利用提供了新范式。

关键词: 空间科学, 大数据, 智能检索

Abstract:

[Background] With the rapid growth of space science data, the traditional metadata-based retrieval methods have gradually become insufficient to meet the needs of researchers for complex semantic queries. There is an urgent need to introduce intelligent retrieval systems capable of semantic understanding. [Objective] This study aims to develop an intelligent retrieval system for space science data, addressing the limitations of conventional metadata-driven approaches in semantic comprehension and multi-modal data retrieval, thereby enhancing the efficiency and accuracy of accessing heterogeneous space science datasets. [Methods] The proposed system employs a dynamic semantic parsing mechanism based on large language models, combined with hybrid retrieval strategies integrating BM25 and dense vector search methods. For image and time-series data, feature representations are extracted using models such as DINOv2, VISTA, and Timer-XL to construct a multi-modal semantic index. The system adopts a hierarchical architecture that integrates full-text search and vector databases, supporting multiple query modes including natural language, tags, and data examples. [Conclusion] The intelligent retrieval system for the virtual space science observatory significantly enhances the flexibility and accuracy of data discovery by integrating multiple AI models, offering a novel paradigm for the efficient utilization of large-scale space science data.

Key words: space science, big data, intelligent retrieval

李云龙, 焦琦融, 王慈枫, 邹自明. 空间科学虚拟观测台智能检索系统构建[J]. 数据与计算发展前沿, 2025, 7(4): 20-32.

LI Yunlong, JIAO Qirong, WANG Cifeng, ZOU Ziming. Construction of an Intelligent Retrieval System for the Virtual Space Science Observatory[J]. Frontiers of Data and Computing, 2025, 7(4): 20-32, https://cstr.cn/32002.14.jfdc.CN10-1649/TP.2025.04.002.

图/表 10

图1

图2

图3

图4

图5

图6

表1

图7

图8

图9

参考文献 21

[1]	Space Physics Data Facility[DB/OL]. [2025-04-27]. https://spdf.gsfc.nasa.gov/.
[2]	Planetary Data System[DB/OL]. [2025-04-27]. https://pds.nasa.gov/
[3]	Mikulski Archive for Space Telescopes[DB/OL]. [2025-04-27]. https://archive.stsci.edu/.
[4]	空间科学虚拟观测台[DB/OL]. [2025-04-27]. https://www.nssdc.ac.cn/nssdc_zh/html/index.html.
[5]	PDS Image Atlas[DB/OL]. [2025-04-27]. https://pds-imaging.jpl.nasa.gov/search/.
[6]	Heliophysics Events Knowledgebase[DB/OL]. [2025-04-27]. https://www.lmsal.com/hek/.
[7]	BERRIMAN G B. The international virtual observatory alliance (IVOA) in 2020[J/OL]. arXiv preprint, 2020: arXiv:2012.05988.
[8]	SAWARKAR K, MANGAL A, SOLANKI S R.Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers[C]// In 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR). 2024: 155-161.
[9]	CHEN J, XIAO S, ZHANG P, et al. Bge m3-embedding:Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation[J/OL]. arXiv preprint, 2024: arXiv:2402.03216.
[10]	LI C, LIU Z, XIAO S, et al. Making large language models a better foundation for dense retrieval[J/OL]. arXiv preprint, 2023: arXiv:2312.15503.
[11]	OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: Learning Robust Visual Features without Supervision[J]. Transactions on Machine Learning Research Journal, 2024, 11:1-31.
[12]	DARCET T, OQUAB M, MAIRAL J, et al. Vision transformers need registers[J/OL]. arXiv preprint, 2023:arXiv:2309.16588.
[13]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C] // In International conference on machine learning. 2021, 1: 8748-8763.
[14]	ZHOU J, LIU Z, XIAO S, et al. VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval[C] // In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024, 1: 3185-3200.
[15]	YUE Z, WANG Y, DUAN J, et al. Ts2vec: Towards universal representation of time series[C] // In Proceedings of the AAAI conference on artificial intelligence. 2022, 36(8): 8980-8987.
[16]	WU H, HU T, LIU Y, et al. Timesnet: Temporal 2d-variation modeling for general time series analysis[J]. arXiv preprint, 2022:arXiv:2210.02186.
[17]	LIU Y, QIN G, HUANG X, et al. Timer-XL: Long-Context Transformers for Unified Time Series Forecasting[J]. arXiv preprint, 2024: arXiv:2410.04803.
[18]	KWON W, LI Z, ZHUANG S, et al. Efficient memory management for large language model serving with paged attention[C] // In Proceedings of the 29th Symposium on Operating Systems Principles. 2023: 611-626.
[19]	YANG A, LI A, YANG B, et al. Qwen3 technical report[J]. arXiv preprint, 2025:arXiv:2505.09388.
[20]	DORAN G, DUNKEL E, LU S, et al. Mars orbital image (HiRISE) labeled data set version 3.2[DB/OL]. [2025-05-21]. https://zenodo.org/records/4002935.
[21]	GASTEL R V. Finetuning DINOv2 with LoRA for Image Segmentation[CP/OL]. [2025-06-17].https://github.com/RobvanGastel/dinov2-finetune.

节点	GPU	CPU	内存(GB)	存储
检索服务	A30 x4	x80	256	80 TB
数据库	—	x40	256	200 TB
索引构建	A10 x8	x80	256	3.2 PB

空间科学虚拟观测台智能检索系统构建

Construction of an Intelligent Retrieval System for the Virtual Space Science Observatory

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 21

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王磊,马福利,于勤思,魏明月. 基于PatchCore的科学卫星图像异常检测算法研究[J]. 数据与计算发展前沿, 2025, 7(4): 143-154.
[2]	吴兆晨,路长发,李刚,蓝晨阳,王慈枫. 语义关联驱动的空间科学数据仓储系统构建与关联推荐研究[J]. 数据与计算发展前沿, 2025, 7(4): 67-78.
[3]	卢莎莎,牛铁,吴璨,康乐,肖海力. 面向地球大数据的新型计算系统设计与实践[J]. 数据与计算发展前沿, 2025, 7(2): 40-48.
[4]	米琳, 李晓辉, 朱家佳, 窦帅. 临近空间科学数据管理与共享服务机制研究与实现[J]. 数据与计算发展前沿, 2025, 7(1): 86-98.
[5]	汪洋,周小军,魏鑫,褚大伟,郑晓欢,彭颖,冷伏海,张凤,丛培民,吉志霞,廖方宇. 大数据驱动下的科研机构科研治理创新实践[J]. 数据与计算发展前沿, 2024, 6(6): 43-52.
[6]	郭学兵, 朱小杰, 唐新斋, 杨刚, 侯艳飞, 何洪林. 基于大数据流水线系统的算法模型整合方法研究——以基于机器学习方法的LiDAR数据树木生物量反演为例[J]. 数据与计算发展前沿, 2024, 6(4): 96-105.
[7]	程垚松, 毕玉江, 郭超奇, 闫晓飞. LHAASO模拟作业从X86到ARM计算集群的移植[J]. 数据与计算发展前沿, 2024, 6(3): 83-91.
[8]	任焕萍, 李一凡, 张斌, 郑双强, 王彦俊, 冯立强, 李富超. 海洋科学数据汇聚共享服务平台建设[J]. 数据与计算发展前沿, 2024, 6(3): 92-98.
[9]	叶旭, 杜一, 崔文娟, 沈俊杰, 谢靖, 王露笛. 机器学习技术在眼健康领域的应用[J]. 数据与计算发展前沿, 2024, 6(2): 117-133.
[10]	陈灿, 朴英超. SDGs科研工作台架构设计与实现[J]. 数据与计算发展前沿, 2024, 6(1): 94-101.
[11]	朱明明, 曹无敌, 吴林, 王自溪, 廖琦, 张思, 唐晓, 李杰, 王婧, 王彦棡, 王自发. 基于人工智能与大数据的双碳大气环境信息化应用进展与展望[J]. 数据与计算发展前沿, 2023, 5(3): 2-12.
[12]	路公仆,李晓会. 一种个性化位置数据发布KSPPL-Anonymity算法[J]. 数据与计算发展前沿, 2023, 5(2): 150-163.
[13]	胡晓彦,徐寄遥,邹自明. “大数据&人工智能”驱动的空间天气科研范式变革初步探索[J]. 数据与计算发展前沿, 2023, 5(2): 24-36.
[14]	齐法制,李刚,李纯,汪璐,张一,张正德,陈刚,罗武鸣,赵丽娜,胡誉,袁野. 基于人工智能的高能物理大数据技术与应用[J]. 数据与计算发展前沿, 2023, 5(2): 50-59.
[15]	王凡,冯立强,曹荣强. 大数据驱动的海洋人工智能服务平台设计与应用[J]. 数据与计算发展前沿, 2023, 5(2): 73-85.