[Langchain] Crawling and Collection data

taipm November 30, 2024No Comments

Bài toán: Thu thập dữ liệu

Thu thập dữ liệu là một trong những bài toán quan trọng và thường gặp trong thực tế, chẳng hạn khi làm việc với các mô hình ngôn ngữ lớn để truy vấn thông tin bổ sung mà các AI MODEL hiện tại không có được. Đây cũng là bài toán thường gặp trong việc xây dựng các chatbot hỏi-đáp hay các kỹ thuật liên quan đến RAG.

langchain-community

Cài đặt:

pip install langchain_community beautifulsoup4

WebBaseLoader

Dùng để lấy thông tin

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_path = "https://vnexpress.net/cam-thuoc-la-dien-tu-tu-nam-2025-4822280.html"
    # header_template = None,
    # verify_ssl = True,
    # proxies = None,
    # continue_on_failure = False,
    # autoset_encoding = True,
    # encoding = None,
    # web_paths = (),
    # requests_per_second = 2,
    # default_parser = "html.parser",
    # requests_kwargs = None,
    # raise_for_status = False,
    # bs_get_text_kwargs = None,
    # bs_kwargs = None,
    # session = None,
    # show_progress = True,
)

docs = []
docs_lazy = loader.lazy_load()

# async variant:
# docs_lazy = await loader.alazy_load()

for doc in docs_lazy:
    print(doc.page_content)
    docs.append(doc)
print(docs[0].page_content[:100])
print(docs[0].metadata)

Last updated on December 4, 2024

Sáng lập và điều hành MicroAI.Club, MicroTrade.Club, ... Lập trình là một sở thích nhằm giết thời gian rảnh rỗi ...

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply Cancel reply