向向量存储上传文件

将文件内容批量上传到向量存储

post

Upload files to a vector store.

Path parameters

vector-store-idstringRequired

The ID of the vector store.

Header parameters

x-api-keystringRequired

The API key for authentication.

Body

filesstring · binary[]Optional

The files to upload.

Responses

201

Files uploaded successfully.

application/json

207

Some files failed to upload.

post

POST /api/v1/vector-stores/{vector-store-id}/documents/upload HTTP/1.1
Host: api.rememberizer.ai
x-api-key: text
Content-Type: multipart/form-data
Accept: */*
Content-Length: 20

{
  "files": [
    "binary"
  ]
}

{
  "documents": [
    {
      "id": 1,
      "name": "text"
    }
  ],
  "errors": [
    {
      "file": "text",
      "error": "text"
    }
  ]
}

示例请求

curl -X POST \
  https://api.rememberizer.ai/api/v1/vector-stores/vs_abc123/documents/upload \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@/path/to/document1.pdf" \
  -F "files=@/path/to/document2.docx"

将 YOUR_API_KEY 替换为您的实际向量存储 API 密钥，将 vs_abc123 替换为您的向量存储 ID，并提供本地文件的路径。

const uploadFiles = async (vectorStoreId, files) => {
  const formData = new FormData();
  
  // 将多个文件添加到表单数据中
  for (const file of files) {
    formData.append('files', file);
  }
  
  const response = await fetch(`https://api.rememberizer.ai/api/v1/vector-stores/${vectorStoreId}/documents/upload`, {
    method: 'POST',
    headers: {
      'x-api-key': 'YOUR_API_KEY'
      // 注意：不要设置 Content-Type 头，它会自动设置为正确的边界
    },
    body: formData
  });
  
  const data = await response.json();
  console.log(data);
};

// 使用文件输入元素的示例
const fileInput = document.getElementById('fileInput');
uploadFiles('vs_abc123', fileInput.files);

将 YOUR_API_KEY 替换为您的实际向量存储 API 密钥，将 vs_abc123 替换为您的向量存储 ID。

import requests

def upload_files(vector_store_id, file_paths):
    headers = {
        "x-api-key": "YOUR_API_KEY"
    }
    
    files = [('files', (file_path.split('/')[-1], open(file_path, 'rb'))) for file_path in file_paths]
    
    response = requests.post(
        f"https://api.rememberizer.ai/api/v1/vector-stores/{vector_store_id}/documents/upload",
        headers=headers,
        files=files
    )
    
    data = response.json()
    print(data)

upload_files('vs_abc123', ['/path/to/document1.pdf', '/path/to/document2.docx'])

将 YOUR_API_KEY 替换为您的实际向量存储 API 密钥，将 vs_abc123 替换为您的向量存储 ID，并提供本地文件的路径。

require 'net/http'
require 'uri'
require 'json'

def upload_files(vector_store_id, file_paths)
  uri = URI("https://api.rememberizer.ai/api/v1/vector-stores/#{vector_store_id}/documents/upload")
  
  # 创建一个新的 HTTP 对象
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true
  
  # 创建一个 multipart-form 请求
  request = Net::HTTP::Post.new(uri)
  request['x-api-key'] = 'YOUR_API_KEY'
  
  # 创建一个 multipart 边界
  boundary = "RubyFormBoundary#{rand(1000000)}"
  request['Content-Type'] = "multipart/form-data; boundary=#{boundary}"
  
  # 构建请求体
  body = []
  file_paths.each do |file_path|
    file_name = File.basename(file_path)
    file_content = File.read(file_path, mode: 'rb')
    
    body << "--#{boundary}\r\n"
    body << "Content-Disposition: form-data; name=\"files\"; filename=\"#{file_name}\"\r\n"
    body << "Content-Type: #{get_content_type(file_name)}\r\n\r\n"
    body << file_content
    body << "\r\n"
  end
  body << "--#{boundary}--\r\n"
  
  request.body = body.join
  
  # 发送请求
  response = http.request(request)
  
  # 解析并返回响应
  JSON.parse(response.body)
end

辅助方法以确定内容类型

def get_content_type(filename) ext = File.extname(filename).downcase case ext when '.pdf' then 'application/pdf' when '.doc' then 'application/msword' when '.docx' then 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' when '.txt' then 'text/plain' when '.md' then 'text/markdown' when '.json' then 'application/json' else 'application/octet-stream' end end

示例用法

result = upload_files('vs_abc123', ['/path/to/document1.pdf', '/path/to/document2.docx']) puts result


<div data-gb-custom-block data-tag="hint" data-style='info'>

将 `YOUR_API_KEY` 替换为您的实际 Vector Store API 密钥，将 `vs_abc123` 替换为您的 Vector Store ID，并提供您本地文件的路径。

</div>

</div>

</div>

## 路径参数

| 参数 | 类型 | 描述 |
|-----------|------|-------------|
| vector-store-id | 字符串 | **必填。** 要上传文件的向量存储的 ID。 |

## 请求体

此端点接受一个 `multipart/form-data` 请求，其中 `files` 字段包含一个或多个文件。

## 响应格式

```json
{
  "documents": [
    {
      "id": 1234,
      "name": "document1.pdf",
      "type": "application/pdf",
      "size": 250000,
      "status": "processing",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    },
    {
      "id": 1235,
      "name": "document2.docx",
      "type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "size": 180000,
      "status": "processing",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    }
  ],
  "errors": []
}

如果某些文件上传失败，它们将被列在 errors 数组中：

{
  "documents": [
    {
      "id": 1234,
      "name": "document1.pdf",
      "type": "application/pdf",
      "size": 250000,
      "status": "processing",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    }
  ],
  "errors": [
    {
      "file": "document2.docx",
      "error": "文件格式不支持"
    }
  ]
}

认证

此端点需要使用 x-api-key 头中的 API 密钥进行认证。

支持的文件格式

PDF (.pdf)
Microsoft Word (.doc, .docx)
Microsoft Excel (.xls, .xlsx)
Microsoft PowerPoint (.ppt, .pptx)
文本文件 (.txt)
Markdown (.md)
JSON (.json)
HTML (.html, .htm)

文件大小限制

单个文件大小限制：50MB
总请求大小限制：100MB
每个请求的最大文件数量：20

错误响应

状态码

描述

400

错误请求 - 未提供文件或请求格式无效

401

未授权 - API 密钥无效或缺失

404

未找到 - 找不到向量存储

413

有效负载过大 - 文件超过大小限制

415

不支持的媒体类型 - 文件格式不受支持

500

服务器内部错误

207

多状态 - 一些文件上传成功，但其他文件失败

处理状态

文件最初以 processing 状态被接受。您可以使用获取向量存储中的文档列表端点检查文档的处理状态。最终状态将是以下之一：

done: 文档已成功处理
error: 处理过程中发生错误
processing: 文档仍在处理中

处理时间取决于文件大小和复杂性。典型的处理时间为每个文档 30 秒到 5 分钟之间。

批量操作

为了高效地将多个文件上传到您的向量存储，Rememberizer 支持批量操作。这种方法有助于在处理大量文档时优化性能。

批量上传实现

import os
import requests
import time
import concurrent.futures
from pathlib import Path

def batch_upload_to_vector_store(vector_store_id, folder_path, batch_size=5, file_types=None):
    """
    从目录批量上传所有文件到向量存储
    
    参数:
        vector_store_id: 向量存储的ID
        folder_path: 包含要上传文件的文件夹路径
        batch_size: 每批上传的文件数量
        file_types: 可选的文件扩展名列表进行过滤（例如，['.pdf', '.docx']）
        
    返回:
        上传结果列表
    """
    api_key = "YOUR_API_KEY"
    headers = {"x-api-key": api_key}
    
    # 获取目录中的文件列表
    files = []
    for entry in os.scandir(folder_path):
        if entry.is_file():
            file_path = Path(entry.path)
            # 如果指定了文件扩展名，则进行过滤
            if file_types is None or file_path.suffix.lower() in file_types:
                files.append(file_path)
    
    print(f"找到 {len(files)} 个文件待上传")
    results = []
    
    # 按批处理文件
    for i in range(0, len(files), batch_size):
        batch = files[i:i+batch_size]
        print(f"处理批次 {i//batch_size + 1}/{(len(files) + batch_size - 1)//batch_size}: {len(batch)} 个文件")
        
        # 上传批次
        upload_files = []
        for file_path in batch:
            upload_files.append(('files', (file_path.name, open(file_path, 'rb'))))
        
        try:
            response = requests.post(
                f"https://api.rememberizer.ai/api/v1/vector-stores/{vector_store_id}/documents/upload",
                headers=headers,
                files=upload_files
            )
            
            # 关闭所有文件句柄
            for _, (_, file_obj) in upload_files:
                file_obj.close()
            
            if response.status_code in (200, 201, 207):
                batch_result = response.json()
                results.append(batch_result)
                print(f"成功上传批次 - 处理了 {len(batch_result.get('documents', []))} 个文档")
                
                # 检查错误
                if batch_result.get('errors') and len(batch_result['errors']) > 0:
                    print(f"遇到错误: {len(batch_result['errors'])}")
                    for error in batch_result['errors']:
                        print(f"- {error['file']}: {error['error']}")
            else:
                print(f"批量上传失败，状态码 {response.status_code}: {response.text}")
                results.append({"error": f"批量失败: {response.text}"})
                
        except Exception as e:
            print(f"批量上传期间发生异常: {str(e)}")
            results.append({"error": str(e)})
            
            # 在发生异常时关闭任何剩余的文件句柄
            for _, (_, file_obj) in upload_files:
                try:
                    file_obj.close()
                except:
                    pass
        
        # 速率限制 - 批次之间暂停
        if i + batch_size < len(files):
            print("在下一个批次之前暂停...")
            time.sleep(2)
    
    return results

# 示例用法
results = batch_upload_to_vector_store(
    'vs_abc123',
    '/path/to/documents/folder',
    batch_size=5,
    file_types=['.pdf', '.docx', '.txt']
)

/**
 * 批量上传文件到向量存储
 * 
 * @param {string} vectorStoreId - 向量存储的ID
 * @param {FileList|File[]} files - 要上传的文件
 * @param {Object} options - 配置选项
 * @returns {Promise<Array>} - 上传结果列表
 */
async function batchUploadToVectorStore(vectorStoreId, files, options = {}) {
  const {
    batchSize = 5,
    delayBetweenBatches = 2000,
    onProgress = null
  } = options;
  
  const apiKey = 'YOUR_API_KEY';
  const results = [];
  const fileList = Array.from(files);
  const totalBatches = Math.ceil(fileList.length / batchSize);
  
  console.log(`准备上传 ${fileList.length} 个文件，共 ${totalBatches} 批次`);
  
  // 分批处理文件
  for (let i = 0; i < fileList.length; i += batchSize) {
    const batch = fileList.slice(i, i + batchSize);
    const batchNumber = Math.floor(i / batchSize) + 1;
    
    console.log(`正在处理第 ${batchNumber}/${totalBatches} 批次: ${batch.length} 个文件`);
    
    if (onProgress) {
      onProgress({
        currentBatch: batchNumber,
        totalBatches: totalBatches,
        filesInBatch: batch.length,
        totalFiles: fileList.length,
        completedFiles: i
      });
    }
    
    // 为此批次创建 FormData
    const formData = new FormData();
    batch.forEach(file => {
      formData.append('files', file);
    });
    
    try {
      const response = await fetch(
        `https://api.rememberizer.ai/api/v1/vector-stores/${vectorStoreId}/documents/upload`,
        {
          method: 'POST',
          headers: {
            'x-api-key': apiKey
          },
          body: formData
        }
      );
      
      if (response.ok) {
        const batchResult = await response.json();
        results.push(batchResult);
        
        console.log(`成功上传批次 - ${batchResult.documents?.length || 0} 个文档已处理`);
        
        // 检查错误
        if (batchResult.errors && batchResult.errors.length > 0) {
          console.warn(`遇到错误: ${batchResult.errors.length}`);
          batchResult.errors.forEach(error => {
            console.warn(`- ${error.file}: ${error.error}`);
          });
        }
      } else {
        console.error(`批量上传失败，状态 ${response.status}: ${await response.text()}`);
        results.push({ error: `批量失败，状态: ${response.status}` });
      }
    } catch (error) {
      console.error(`批量上传期间发生异常: ${error.message}`);
      results.push({ error: error.message });
    }
    
    // 在批次之间添加延迟以避免速率限制
    if (i + batchSize < fileList.length) {
      console.log(`在下一批次之前暂停 ${delayBetweenBatches}ms...`);
      await new Promise(resolve => setTimeout(resolve, delayBetweenBatches));
    }
  }
  
  console.log(`上传完成。处理了 ${fileList.length} 个文件。`);
  return results;
}

// 使用文件输入元素的示例用法
document.getElementById('upload-button').addEventListener('click', async () => {
  const fileInput = document.getElementById('file-input');
  const vectorStoreId = 'vs_abc123';
  
  const progressBar = document.getElementById('progress-bar');
  
  try {
    const results = await batchUploadToVectorStore(vectorStoreId, fileInput.files, {
      batchSize: 5,
      onProgress: (progress) => {
        // 更新进度UI
        const percentage = Math.round((progress.completedFiles / progress.totalFiles) * 100);
        progressBar.style.width = `${percentage}%`;
        progressBar.textContent = `${percentage}% (第 ${progress.currentBatch}/${progress.totalBatches} 批次)`;
      }
    });
    
    console.log('完整的上传结果:', results);
  } catch (error) {
    console.error('上传失败:', error);
  }
});

require 'net/http'
require 'uri'
require 'json'
require 'mime/types'

# 批量上传文件到向量存储
#
# @param vector_store_id [String] 向量存储的 ID
# @param folder_path [String] 上传文件的文件夹路径
# @param batch_size [Integer] 每批上传的文件数量
# @param file_types [Array<String>] 可选的文件扩展名数组，用于过滤
# @param delay_between_batches [Float] 批次之间等待的秒数
# @return [Array] 上传结果列表
def batch_upload_to_vector_store(vector_store_id, folder_path, batch_size: 5, file_types: nil, delay_between_batches: 2.0)
  api_key = 'YOUR_API_KEY'
  results = []
  
  # 获取目录中的文件列表
  files = Dir.entries(folder_path)
    .select { |f| File.file?(File.join(folder_path, f)) }
    .select { |f| file_types.nil? || file_types.include?(File.extname(f).downcase) }
    .map { |f| File.join(folder_path, f) }
  
  puts "找到 #{files.count} 个文件待上传"
  total_batches = (files.count.to_f / batch_size).ceil
  
  # 分批处理文件
  files.each_slice(batch_size).with_index do |batch, batch_index|
    puts "正在处理第 #{batch_index + 1}/#{total_batches} 批次: #{batch.count} 个文件"
    
    # 准备 HTTP 请求
    uri = URI("https://api.rememberizer.ai/api/v1/vector-stores/#{vector_store_id}/documents/upload")
    request = Net::HTTP::Post.new(uri)
    request['x-api-key'] = api_key
    
    # 创建多部分表单边界
    boundary = "RubyBoundary#{rand(1000000)}"
    request['Content-Type'] = "multipart/form-data; boundary=#{boundary}"
    
    # 构建请求体
    body = []
    batch.each do |file_path|
      file_name = File.basename(file_path)
      mime_type = MIME::Types.type_for(file_path).first&.content_type || 'application/octet-stream'
      
      begin
        file_content = File.binread(file_path)
        
        body << "--#{boundary}\r\n"
        body << "Content-Disposition: form-data; name=\"files\"; filename=\"#{file_name}\"\r\n"
        body << "Content-Type: #{mime_type}\r\n\r\n"
        body << file_content
        body << "\r\n"
      rescue => e
        puts "读取文件 #{file_path} 时出错: #{e.message}"
      end
    end
    body << "--#{boundary}--\r\n"
    
    request.body = body.join
    
    # 发送请求
    begin
      http = Net::HTTP.new(uri.host, uri.port)
      http.use_ssl = true
      response = http.request(request)
      
      if response.code.to_i == 200 || response.code.to_i == 201 || response.code.to_i == 207
        batch_result = JSON.parse(response.body)
        results << batch_result
        
        puts "成功上传批次 - #{batch_result['documents']&.count || 0} 个文档已处理"
        
        # 检查错误
        if batch_result['errors'] && !batch_result['errors'].empty?
          puts "遇到错误: #{batch_result['errors'].count}"
          batch_result['errors'].each do |error|
            puts "- #{error['file']}: #{error['error']}"
          end
        end
      else
        puts "批量上传失败，状态码 #{response.code}: #{response.body}"
        results << { "error" => "批量失败: #{response.body}" }
      end
    rescue => e
      puts "批量上传期间发生异常: #{e.message}"
      results << { "error" => e.message }
    end
    
    # 速率限制 - 批次之间暂停
    if batch_index < total_batches - 1
      puts "在下一批次之前暂停 #{delay_between_batches} 秒..."
      sleep(delay_between_batches)
    end
  end
  
  puts "上传完成。处理了 #{files.count} 个文件。"
  results
end

# 示例用法
results = batch_upload_to_vector_store(
  'vs_abc123',
  '/path/to/documents/folder',
  batch_size: 5,
  file_types: ['.pdf', '.docx', '.txt'],
  delay_between_batches: 2.0
)

批量上传最佳实践

为了优化上传大量文件时的性能和可靠性：

管理批量大小：保持批量大小在5-10个文件之间以获得最佳性能。单个请求中的文件过多会增加超时的风险。
实施速率限制：在批次之间添加延迟（建议2-3秒）以避免触及API速率限制。
添加错误重试逻辑：对于生产系统，实施失败上传的重试逻辑，采用指数退避策略。
验证文件类型：在尝试上传之前，预先过滤文件以确保它们是支持的类型。
监控批次进度：对于面向用户的应用程序，提供批量操作的进度反馈。
处理部分成功：API可能会返回207状态码以表示部分成功。始终检查单个文档的状态。
清理资源：确保所有文件句柄在发生错误时正确关闭。
明智地并行化：对于非常大的上传（数千个文件），考虑多个并发批处理进程，针对不同的向量存储，然后在需要时合并结果。
实施校验和：对于关键数据，在上传前后使用校验和验证文件完整性。
记录全面结果：保持所有上传操作的详细日志以便于故障排除。

通过遵循这些最佳实践，您可以高效地管理大规模文档的摄取到您的向量存储中。

Previous向向量存储添加新文本文档 Next更新向量存储中的文件内容

Last updated 2 months ago