Tải lên tệp vào Lưu Trữ Vector

Tải nội dung tệp lên Vector Store với các thao tác theo lô

Tải tệp lên một Vector Store

post

Upload files to a vector store.

Path parameters

vector-store-idstringRequired

The ID of the vector store.

Header parameters

x-api-keystringRequired

The API key for authentication.

Body

filesstring · binary[]Optional

The files to upload.

Responses

201

Files uploaded successfully.

application/json

207

Some files failed to upload.

post

POST /api/v1/vector-stores/{vector-store-id}/documents/upload HTTP/1.1
Host: api.rememberizer.ai
x-api-key: text
Content-Type: multipart/form-data
Accept: */*
Content-Length: 20

{
  "files": [
    "binary"
  ]
}

{
  "documents": [
    {
      "id": 1,
      "name": "text"
    }
  ],
  "errors": [
    {
      "file": "text",
      "error": "text"
    }
  ]
}

Ví dụ Yêu cầu

curl -X POST \
  https://api.rememberizer.ai/api/v1/vector-stores/vs_abc123/documents/upload \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@/path/to/document1.pdf" \
  -F "files=@/path/to/document2.docx"

Thay thế YOUR_API_KEY bằng khóa API Vector Store thực tế của bạn, vs_abc123 bằng ID Vector Store của bạn, và cung cấp đường dẫn đến các tệp cục bộ của bạn.

const uploadFiles = async (vectorStoreId, files) => {
  const formData = new FormData();
  
  // Thêm nhiều tệp vào dữ liệu biểu mẫu
  for (const file of files) {
    formData.append('files', file);
  }
  
  const response = await fetch(`https://api.rememberizer.ai/api/v1/vector-stores/${vectorStoreId}/documents/upload`, {
    method: 'POST',
    headers: {
      'x-api-key': 'YOUR_API_KEY'
      // Lưu ý: Không đặt tiêu đề Content-Type, nó sẽ được thiết lập tự động với ranh giới chính xác
    },
    body: formData
  });
  
  const data = await response.json();
  console.log(data);
};

// Ví dụ sử dụng với phần tử đầu vào tệp
const fileInput = document.getElementById('fileInput');
uploadFiles('vs_abc123', fileInput.files);

Thay thế YOUR_API_KEY bằng khóa API Vector Store thực tế của bạn và vs_abc123 bằng ID Vector Store của bạn.

import requests

def upload_files(vector_store_id, file_paths):
    headers = {
        "x-api-key": "YOUR_API_KEY"
    }
    
    files = [('files', (file_path.split('/')[-1], open(file_path, 'rb'))) for file_path in file_paths]
    
    response = requests.post(
        f"https://api.rememberizer.ai/api/v1/vector-stores/{vector_store_id}/documents/upload",
        headers=headers,
        files=files
    )
    
    data = response.json()
    print(data)

upload_files('vs_abc123', ['/path/to/document1.pdf', '/path/to/document2.docx'])

require 'net/http'
require 'uri'
require 'json'

def upload_files(vector_store_id, file_paths)
  uri = URI("https://api.rememberizer.ai/api/v1/vector-stores/#{vector_store_id}/documents/upload")
  
  # Tạo một đối tượng HTTP mới
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true
  
  # Tạo một yêu cầu multipart-form
  request = Net::HTTP::Post.new(uri)
  request['x-api-key'] = 'YOUR_API_KEY'
  
  # Tạo một ranh giới multipart
  boundary = "RubyFormBoundary#{rand(1000000)}"
  request['Content-Type'] = "multipart/form-data; boundary=#{boundary}"
  
  # Xây dựng thân yêu cầu
  body = []
  file_paths.each do |file_path|
    file_name = File.basename(file_path)
    file_content = File.read(file_path, mode: 'rb')
    
    body << "--#{boundary}\r\n"
    body << "Content-Disposition: form-data; name=\"files\"; filename=\"#{file_name}\"\r\n"
    body << "Content-Type: #{get_content_type(file_name)}\r\n\r\n"
    body << file_content
    body << "\r\n"
  end
  body << "--#{boundary}--\r\n"
  
  request.body = body.join
  
  # Gửi yêu cầu
  response = http.request(request)
  
  # Phân tích và trả về phản hồi
  JSON.parse(response.body)
end

Phương thức trợ giúp để xác định loại nội dung

def get_content_type(filename) ext = File.extname(filename).downcase case ext when '.pdf' then 'application/pdf' when '.doc' then 'application/msword' when '.docx' then 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' when '.txt' then 'text/plain' when '.md' then 'text/markdown' when '.json' then 'application/json' else 'application/octet-stream' end end

Ví dụ sử dụng

result = upload_files('vs_abc123', ['/path/to/document1.pdf', '/path/to/document2.docx']) puts result


<div data-gb-custom-block data-tag="hint" data-style='info'>

Thay thế `YOUR_API_KEY` bằng khóa API Vector Store thực tế của bạn, `vs_abc123` bằng ID Vector Store của bạn, và cung cấp các đường dẫn đến các tệp cục bộ của bạn.

</div>

</div>

</div>

## Tham số Đường dẫn

| Tham số           | Loại   | Mô tả                                                        |
|-------------------|--------|-------------------------------------------------------------|
| vector-store-id   | chuỗi | **Bắt buộc.** ID của kho vector để tải lên tệp.            |

## Thân Request

Điểm cuối này chấp nhận một yêu cầu `multipart/form-data` với một hoặc nhiều tệp trong trường `files`.

## Định dạng Phản hồi

```json
{
  "documents": [
    {
      "id": 1234,
      "name": "document1.pdf",
      "type": "application/pdf",
      "size": 250000,
      "status": "đang xử lý",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    },
    {
      "id": 1235,
      "name": "document2.docx",
      "type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "size": 180000,
      "status": "đang xử lý",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    }
  ],
  "errors": []
}

Nếu một số tệp không tải lên được, chúng sẽ được liệt kê trong mảng errors:

{
  "documents": [
    {
      "id": 1234,
      "name": "document1.pdf",
      "type": "application/pdf",
      "size": 250000,
      "status": "đang xử lý",
      "created": "2023-06-15T10:15:00Z",
      "vector_store": "vs_abc123"
    }
  ],
  "errors": [
    {
      "file": "document2.docx",
      "error": "Định dạng tệp không được hỗ trợ"
    }
  ]
}

Xác thực

Điểm cuối này yêu cầu xác thực bằng cách sử dụng khóa API trong tiêu đề x-api-key.

Các định dạng tệp được hỗ trợ

PDF (.pdf)
Microsoft Word (.doc, .docx)
Microsoft Excel (.xls, .xlsx)
Microsoft PowerPoint (.ppt, .pptx)
Tệp văn bản (.txt)
Markdown (.md)
JSON (.json)
HTML (.html, .htm)

Giới Hạn Kích Thước Tệp

Giới hạn kích thước tệp cá nhân: 50MB
Giới hạn kích thước yêu cầu tổng: 100MB
Số lượng tệp tối đa mỗi yêu cầu: 20

Phản hồi Lỗi

Mã Trạng Thái

Mô Tả

400

Yêu Cầu Không Hợp Lệ - Không có tệp nào được cung cấp hoặc định dạng yêu cầu không hợp lệ

401

Không Được Ủy Quyền - Khóa API không hợp lệ hoặc bị thiếu

404

Không Tìm Thấy - Không tìm thấy Vector Store

413

Tải Lượng Quá Lớn - Các tệp vượt quá giới hạn kích thước

415

Loại Phương Tiện Không Hỗ Trợ - Định dạng tệp không được hỗ trợ

500

Lỗi Máy Chủ Nội Bộ

207

Đa Trạng Thái - Một số tệp đã được tải lên thành công, nhưng một số khác đã thất bại

Trạng Thái Xử Lý

Các tệp được chấp nhận ban đầu với trạng thái processing. Bạn có thể kiểm tra trạng thái xử lý của các tài liệu bằng cách sử dụng điểm cuối Lấy Danh Sách Tài Liệu Trong Một Vector Store. Trạng thái cuối cùng sẽ là một trong các trạng thái sau:

done: Tài liệu đã được xử lý thành công
error: Đã xảy ra lỗi trong quá trình xử lý
processing: Tài liệu vẫn đang được xử lý

Thời gian xử lý phụ thuộc vào kích thước và độ phức tạp của tệp. Thời gian xử lý điển hình là từ 30 giây đến 5 phút cho mỗi tài liệu.

Các hoạt động theo lô

Để tải lên nhiều tệp đến Vector Store của bạn một cách hiệu quả, Rememberizer hỗ trợ các hoạt động theo lô. Cách tiếp cận này giúp tối ưu hóa hiệu suất khi xử lý một số lượng lớn tài liệu.

Triển Khai Tải Lên Theo Lô

import os
import requests
import time
import concurrent.futures
from pathlib import Path

def batch_upload_to_vector_store(vector_store_id, folder_path, batch_size=5, file_types=None):
    """
    Tải tất cả các tệp từ một thư mục lên một Vector Store theo lô
    
    Args:
        vector_store_id: ID của vector store
        folder_path: Đường dẫn đến thư mục chứa các tệp để tải lên
        batch_size: Số lượng tệp để tải lên trong mỗi lô
        file_types: Danh sách tùy chọn các phần mở rộng tệp để lọc (ví dụ: ['.pdf', '.docx'])
        
    Returns:
        Danh sách kết quả tải lên
    """
    api_key = "YOUR_API_KEY"
    headers = {"x-api-key": api_key}
    
    # Lấy danh sách các tệp trong thư mục
    files = []
    for entry in os.scandir(folder_path):
        if entry.is_file():
            file_path = Path(entry.path)
            # Lọc theo phần mở rộng tệp nếu được chỉ định
            if file_types is None or file_path.suffix.lower() in file_types:
                files.append(file_path)
    
    print(f"Đã tìm thấy {len(files)} tệp để tải lên")
    results = []
    
    # Xử lý các tệp theo lô
    for i in range(0, len(files), batch_size):
        batch = files[i:i+batch_size]
        print(f"Đang xử lý lô {i//batch_size + 1}/{(len(files) + batch_size - 1)//batch_size}: {len(batch)} tệp")
        
        # Tải lên lô
        upload_files = []
        for file_path in batch:
            upload_files.append(('files', (file_path.name, open(file_path, 'rb'))))
        
        try:
            response = requests.post(
                f"https://api.rememberizer.ai/api/v1/vector-stores/{vector_store_id}/documents/upload",
                headers=headers,
                files=upload_files
            )
            
            # Đóng tất cả các tay cầm tệp
            for _, (_, file_obj) in upload_files:
                file_obj.close()
            
            if response.status_code in (200, 201, 207):
                batch_result = response.json()
                results.append(batch_result)
                print(f"Đã tải lên thành công lô - {len(batch_result.get('documents', []))} tài liệu đã được xử lý")
                
                # Kiểm tra lỗi
                if batch_result.get('errors') and len(batch_result['errors']) > 0:
                    print(f"Đã gặp lỗi: {len(batch_result['errors'])}")
                    for error in batch_result['errors']:
                        print(f"- {error['file']}: {error['error']}")
            else:
                print(f"Tải lên lô thất bại với mã trạng thái {response.status_code}: {response.text}")
                results.append({"error": f"Tải lên lô thất bại: {response.text}"})
                
        except Exception as e:
            print(f"Đã xảy ra ngoại lệ trong quá trình tải lên lô: {str(e)}")
            results.append({"error": str(e)})
            
            # Đóng bất kỳ tay cầm tệp nào còn lại trong trường hợp có ngoại lệ
            for _, (_, file_obj) in upload_files:
                try:
                    file_obj.close()
                except:
                    pass
        
        # Giới hạn tốc độ - tạm dừng giữa các lô
        if i + batch_size < len(files):
            print("Tạm dừng trước lô tiếp theo...")
            time.sleep(2)
    
    return results

# Ví dụ sử dụng
results = batch_upload_to_vector_store(
    'vs_abc123',
    '/path/to/documents/folder',
    batch_size=5,
    file_types=['.pdf', '.docx', '.txt']
)

/**
 * Tải lên tệp vào Vector Store theo lô
 * 
 * @param {string} vectorStoreId - ID của Vector Store
 * @param {FileList|File[]} files - Tệp để tải lên
 * @param {Object} options - Tùy chọn cấu hình
 * @returns {Promise<Array>} - Danh sách kết quả tải lên
 */
async function batchUploadToVectorStore(vectorStoreId, files, options = {}) {
  const {
    batchSize = 5,
    delayBetweenBatches = 2000,
    onProgress = null
  } = options;
  
  const apiKey = 'YOUR_API_KEY';
  const results = [];
  const fileList = Array.from(files);
  const totalBatches = Math.ceil(fileList.length / batchSize);
  
  console.log(`Chuẩn bị tải lên ${fileList.length} tệp trong ${totalBatches} lô`);
  
  // Xử lý tệp theo lô
  for (let i = 0; i < fileList.length; i += batchSize) {
    const batch = fileList.slice(i, i + batchSize);
    const batchNumber = Math.floor(i / batchSize) + 1;
    
    console.log(`Đang xử lý lô ${batchNumber}/${totalBatches}: ${batch.length} tệp`);
    
    if (onProgress) {
      onProgress({
        currentBatch: batchNumber,
        totalBatches: totalBatches,
        filesInBatch: batch.length,
        totalFiles: fileList.length,
        completedFiles: i
      });
    }
    
    // Tạo FormData cho lô này
    const formData = new FormData();
    batch.forEach(file => {
      formData.append('files', file);
    });
    
    try {
      const response = await fetch(
        `https://api.rememberizer.ai/api/v1/vector-stores/${vectorStoreId}/documents/upload`,
        {
          method: 'POST',
          headers: {
            'x-api-key': apiKey
          },
          body: formData
        }
      );
      
      if (response.ok) {
        const batchResult = await response.json();
        results.push(batchResult);
        
        console.log(`Tải lên lô thành công - ${batchResult.documents?.length || 0} tài liệu đã được xử lý`);
        
        // Kiểm tra lỗi
        if (batchResult.errors && batchResult.errors.length > 0) {
          console.warn(`Đã gặp lỗi: ${batchResult.errors.length}`);
          batchResult.errors.forEach(error => {
            console.warn(`- ${error.file}: ${error.error}`);
          });
        }
      } else {
        console.error(`Tải lên lô thất bại với trạng thái ${response.status}: ${await response.text()}`);
        results.push({ error: `Lô thất bại với trạng thái: ${response.status}` });
      }
    } catch (error) {
      console.error(`Ngoại lệ trong quá trình tải lên lô: ${error.message}`);
      results.push({ error: error.message });
    }
    
    // Thêm độ trễ giữa các lô để tránh giới hạn tốc độ
    if (i + batchSize < fileList.length) {
      console.log(`Tạm dừng trong ${delayBetweenBatches}ms trước lô tiếp theo...`);
      await new Promise(resolve => setTimeout(resolve, delayBetweenBatches));
    }
  }
  
  console.log(`Tải lên hoàn tất. Đã xử lý ${fileList.length} tệp.`);
  return results;
}

// Ví dụ sử dụng với phần tử đầu vào tệp
document.getElementById('upload-button').addEventListener('click', async () => {
  const fileInput = document.getElementById('file-input');
  const vectorStoreId = 'vs_abc123';
  
  const progressBar = document.getElementById('progress-bar');
  
  try {
    const results = await batchUploadToVectorStore(vectorStoreId, fileInput.files, {
      batchSize: 5,
      onProgress: (progress) => {
        // Cập nhật giao diện tiến trình
        const percentage = Math.round((progress.completedFiles / progress.totalFiles) * 100);
        progressBar.style.width = `${percentage}%`;
        progressBar.textContent = `${percentage}% (Lô ${progress.currentBatch}/${progress.totalBatches})`;
      }
    });
    
    console.log('Kết quả tải lên hoàn tất:', results);
  } catch (error) {
    console.error('Tải lên thất bại:', error);
  }
});

require 'net/http'
require 'uri'
require 'json'
require 'mime/types'

# Tải tệp lên một Vector Store theo lô
#
# @param vector_store_id [String] ID của Vector Store
# @param folder_path [String] Đường dẫn đến thư mục chứa các tệp để tải lên
# @param batch_size [Integer] Số lượng tệp để tải lên trong mỗi lô
# @param file_types [Array<String>] Mảng tùy chọn các phần mở rộng tệp để lọc theo
# @param delay_between_batches [Float] Số giây để chờ giữa các lô
# @return [Array] Danh sách kết quả tải lên
def batch_upload_to_vector_store(vector_store_id, folder_path, batch_size: 5, file_types: nil, delay_between_batches: 2.0)
  api_key = 'YOUR_API_KEY'
  results = []
  
  # Lấy danh sách các tệp trong thư mục
  files = Dir.entries(folder_path)
    .select { |f| File.file?(File.join(folder_path, f)) }
    .select { |f| file_types.nil? || file_types.include?(File.extname(f).downcase) }
    .map { |f| File.join(folder_path, f) }
  
  puts "Đã tìm thấy #{files.count} tệp để tải lên"
  total_batches = (files.count.to_f / batch_size).ceil
  
  # Xử lý các tệp theo lô
  files.each_slice(batch_size).with_index do |batch, batch_index|
    puts "Đang xử lý lô #{batch_index + 1}/#{total_batches}: #{batch.count} tệp"
    
    # Chuẩn bị yêu cầu HTTP
    uri = URI("https://api.rememberizer.ai/api/v1/vector-stores/#{vector_store_id}/documents/upload")
    request = Net::HTTP::Post.new(uri)
    request['x-api-key'] = api_key
    
    # Tạo một ranh giới đa phần
    boundary = "RubyBoundary#{rand(1000000)}"
    request['Content-Type'] = "multipart/form-data; boundary=#{boundary}"
    
    # Xây dựng nội dung yêu cầu
    body = []
    batch.each do |file_path|
      file_name = File.basename(file_path)
      mime_type = MIME::Types.type_for(file_path).first&.content_type || 'application/octet-stream'
      
      begin
        file_content = File.binread(file_path)
        
        body << "--#{boundary}\r\n"
        body << "Content-Disposition: form-data; name=\"files\"; filename=\"#{file_name}\"\r\n"
        body << "Content-Type: #{mime_type}\r\n\r\n"
        body << file_content
        body << "\r\n"
      rescue => e
        puts "Lỗi khi đọc tệp #{file_path}: #{e.message}"
      end
    end
    body << "--#{boundary}--\r\n"
    
    request.body = body.join
    
    # Gửi yêu cầu
    begin
      http = Net::HTTP.new(uri.host, uri.port)
      http.use_ssl = true
      response = http.request(request)
      
      if response.code.to_i == 200 || response.code.to_i == 201 || response.code.to_i == 207
        batch_result = JSON.parse(response.body)
        results << batch_result
        
        puts "Tải lên lô thành công - #{batch_result['documents']&.count || 0} tài liệu đã được xử lý"
        
        # Kiểm tra lỗi
        if batch_result['errors'] && !batch_result['errors'].empty?
          puts "Đã gặp lỗi: #{batch_result['errors'].count}"
          batch_result['errors'].each do |error|
            puts "- #{error['file']}: #{error['error']}"
          end
        end
      else
        puts "Tải lên lô thất bại với mã trạng thái #{response.code}: #{response.body}"
        results << { "error" => "Lô thất bại: #{response.body}" }
      end
    rescue => e
      puts "Ngoại lệ trong quá trình tải lên lô: #{e.message}"
      results << { "error" => e.message }
    end
    
    # Giới hạn tốc độ - tạm dừng giữa các lô
    if batch_index < total_batches - 1
      puts "Tạm dừng trong #{delay_between_batches} giây trước lô tiếp theo..."
      sleep(delay_between_batches)
    end
  end
  
  puts "Tải lên hoàn tất. Đã xử lý #{files.count} tệp."
  results
end

# Ví dụ sử dụng
results = batch_upload_to_vector_store(
  'vs_abc123',
  '/path/to/documents/folder',
  batch_size: 5,
  file_types: ['.pdf', '.docx', '.txt'],
  delay_between_batches: 2.0
)

Thực Hành Tải Lên Theo Lô Tốt Nhất

Để tối ưu hóa hiệu suất và độ tin cậy khi tải lên khối lượng lớn tệp:

Quản Lý Kích Thước Lô: Giữ kích thước lô từ 5-10 tệp để đạt hiệu suất tối ưu. Quá nhiều tệp trong một yêu cầu duy nhất làm tăng nguy cơ bị hết thời gian chờ.
Thực Hiện Giới Hạn Tốc Độ: Thêm độ trễ giữa các lô (khuyến nghị 2-3 giây) để tránh chạm vào giới hạn tốc độ API.
Thêm Logic Thử Lại Lỗi: Đối với các hệ thống sản xuất, triển khai logic thử lại cho các tải lên thất bại với phương pháp tăng dần.
Xác Thực Loại Tệp: Lọc trước các tệp để đảm bảo chúng là loại được hỗ trợ trước khi cố gắng tải lên.
Giám Sát Tiến Trình Lô: Đối với các ứng dụng hướng tới người dùng, cung cấp phản hồi tiến trình về các hoạt động lô.
Xử Lý Thành Công Một Phần: API có thể trả về mã trạng thái 207 cho thành công một phần. Luôn kiểm tra trạng thái từng tài liệu.
Dọn Dẹp Tài Nguyên: Đảm bảo tất cả các tay cầm tệp được đóng đúng cách, đặc biệt khi xảy ra lỗi.
Tối Ưu Hóa Song Song: Đối với các tải lên rất lớn (nghìn tệp), xem xét nhiều quy trình lô đồng thời nhắm đến các kho vector khác nhau, sau đó kết hợp kết quả sau nếu cần.
Triển Khai Kiểm Tra Tính Toàn Vẹn: Đối với dữ liệu quan trọng, xác minh tính toàn vẹn của tệp trước và sau khi tải lên bằng cách sử dụng kiểm tra tính toàn vẹn.
Ghi Nhận Kết Quả Toàn Diện: Duy trì nhật ký chi tiết của tất cả các hoạt động tải lên để khắc phục sự cố.

Bằng cách tuân theo những thực hành tốt nhất này, bạn có thể quản lý hiệu quả việc tiếp nhận tài liệu quy mô lớn vào các kho vector của mình.

PreviousThêm tài liệu văn bản mới vào Lưu Trữ Vector NextCập nhật nội dung tệp trong Lưu Trữ Vector

Last updated 2 months ago