- 作者帖子
崇鹂游客网站:https://sjw.history.go.kr/main.do
韩国有《朝鲜王朝实录》《承政院日记》两种重要的历史资料,记录了涉及中国的一些历史事实、野史传闻。
民国辈学者吴晗《朝鲜李朝实录中的中国史料》作了初步工作,但里面或许有漏抄,有误抄。现在由于韩国有关网站提供了图片和可复制的文本(或许较为精确),我们可以进一步做这个工作。
所以我让AI写了一个抓取《承政院日记》工具:

但是有几个问题:
1、不知道为什么抓取得很慢,似乎是因为AI不是写的多线程。经测试网站也有一定的防抓性,毕竟算一点外网。
2、我希望他的表述是:公元纪年+朝鲜王号+中国年号+月+日,但AI一直没写好。
例如:
时间表述问题,朝鲜王号纪年应该是韩国人设计网站时加上去的,我看原书没有用朝鲜的年号,如果按原书应该是明代年号+月+日+干支。具体他们这个月日历法是不是跟中国的一样我也不清楚,但还是按原书比较好。
原书:

抓取网站后输出的txt文本:
1623년 인조1년 3월 13일(问题处)
==================================================【座目】
都承旨 李德泂。左承旨 兪晉曾。右承旨 鄭岦 。左副承旨。右副承旨 權盡己。同副承旨 閔聖徵。注書 崔夢亮。假注書。事變假注書。----------------------------------------
【慶運宮에 머묾。慈殿이 國寶를 전하고 光海君을 藥房에 두고 世子를 폐함.】
○ 上在慶運宮。上命大將李貴, 都承旨李德泂, 同副承旨閔聖徵等, 備儀仗往請奉迎。李貴等, 詣慶運宮陳啓事狀, 屢請奉往, 大妃不許。上乃親詣慶運宮, 有司進輦設儀衛, 上命徹去, 請乘轎亦不從, 乘馬而行, 舁光海以隨。上至慶運宮下馬, 步入西廳門外, 再拜痛哭, 侍衛將士及侍臣, 皆痛哭。上仍俯伏待罪, 慈殿下敎曰, 綾陽君, 宗子也, 入承大統, 宜矣。克成莫大之功, 有何待罪之事? 上對曰, 搶攘之中, 事多未遑, 今始來詣, 不勝惶恐。慈殿命納傳國寶及啓字。李貴奏曰, 慈殿當出御正殿, 招大臣傳寶, 何必徑入國寶, 以致人疑乎? 慈殿屢促之, 上命左議政朴弘耉, 奉入國寶, 而久無成命。上伏地良久, 至于夜深。上謂群臣曰, 予欲退家待罪, 群臣力諫止之。俄而慈殿, 命引見嗣君, 上進入內庭, 諸將皆從。慈殿, 設先王虛座, 上再拜痛哭, 侍臣皆哭。慈殿御寢殿垂簾, 置御寶於床, 引上入。上俯伏而哭, 慈殿曰, 勿哭。宗社大慶, 何用哭爲? 上避席而拜曰, 大事未定, 日暮始來, 臣罪萬死。慈殿曰, 勿辭。有何罪? 予以薄命, 不幸遭人倫之變, 逆魁逞憾先王, 以我爲讎, 屠戮我父母, 魚肉我宗族, 剝殺我孺子, 幽囚我別宮, 寡身久處深宮, 人間消息, 邈不聞知, 不意今日, 乃見是事。又謂群臣曰, 逆魁於先王, 實是仇讎, 朝廷之上, 奸臣布列, 加予以大惡之名, 而拘囚十餘年, 疇昔之夢, 先王, 語予以此事, 賴卿等復明人倫, 得見今日, 卿等之功, 何可勝言? 群臣請速傳寶, 慈殿曰, 未亡人, 得至今日, 實是上帝之靈貺, 嗣君, 可拜謝上帝。閔聖徵曰, 爲此擧措, 殊極未安, 不敢承命。傳寶然後, 嗣君當出外討捕凶黨, 以定人心。群臣皆曰, 嗣君卽位之後, 當告宗廟, 傳寶甚急。慈殿曰, 傳寶大事, 不可草草行禮, 明日當於西廳, 備禮行之。且無天朝之命, 何以正位? 宜權署國事。都承旨李德泂請對啓曰, 國家危亂, 幾至於亡, 嗣君爲宗社大計, 躬擐甲冑, 擧此大事。人心已歸, 天命已定, 而傳寶之事, 夜深不決, 何也? 若不速傳國寶, 以正位號, 何以鎭定? 請亟傳國寶, 以答臣民之望。慈殿曰, 受寶有節次, 何以暮夜, 急迫傳授乎? 卿等之言如此, 須與大臣相議, 當以何寶傳之乎? 大臣曰, 以昭信寶·受命寶, 傳之宜當, 諭書寶, 亦可傳授。上曰, 臣無才德, 不敢當。慈殿曰, 王室至親, 臣民愛戴, 非德而何? 嗣君自此, 可爲聖主, 實宗社之洪福也。乃命承傳色金天霖等, 奉御寶跪傳於上, 上拜受。侍臣啓曰, 旣已傳寶, 宜亟出御正殿, 以正大位。慈殿曰, 初欲備禮, 從容傳授, 卿等之言, 不可違, 故如是行之矣。仍謂上曰, 逆魁之罪, 知之乎? 惟我德薄, 不能盡母子之道, 使倫紀斁滅, 國家幾亡, 賴嗣君孝, 上安宗社, 下雪讎怨, 感激何極? 又謂諸臣曰, 逆魁父子, 今置何處乎? 對曰, 皆在闕下矣。慈殿曰, 不共戴天之讎, 忍之已久, 願親斫渠父子之頭, 以祭亡靈。幽囚十餘年, 至今不死者, 蓋待今日耳, 願得甘心焉。諸臣啓曰, 自古廢黜之君, 臣子不敢以刑戮擬議, 無道之主, 莫如桀·紂, 而湯·武放之。今此下敎, 臣等所不忍聞也。德泂曰, 慈聖之於廢君, 天倫已定, 子雖不孝, 母不可以不慈也。此下敎, 非徒不忍聞, 亦不敢奉承。慈殿曰, 吾與嗣君, 同御正殿, 則當雪吾讎, 今嗣君卽位, 能體我心, 爲吾復讎, 則可謂孝矣。上曰, 百官在, 臣何敢擅也? 慈殿曰, 嗣君年已壯長, 豈受百官指揮? 德泂曰, 嗣君入內, 夜將朝矣, 尙未卽位, 將士·軍民, 皆有悶鬱之心, 請速出外。慈殿曰, 父母之讎, 不共天, 兄弟之讎, 不同國。逆魁, 自絶母子之道, 於我有必報之讎, 無可赦之道矣。德泂曰, 昔中廟反正, 優待廢王, 以終天年, 此可法也。慈殿曰, 卿言誠是。德泂曰, 姦逆之徒, 散處外間, 不無意外之變, 請速卽位頒敎, 及時討捕, 鎭撫群情。慈殿曰, 別堂, 乃先王視事之所, 已令宮人灑掃矣。上起拜出, 卽位于別堂, 仍視事達曙。侍臣及將士, 帶劍宿衛。置光海于藥房, 廢世子于都摠府, 以兵守之, 令司饔院供之。罷營建·儺禮·火器等十二都監, 開義禁府·典獄署, 悉放罪人。時, 爾瞻之徒, 多逃竄, 遣軍人搜捕。又多有冀免其罪, 爭先投謁者, 皆縛而拘之。下諭于都元帥韓浚謙, 誅平安監司朴燁, 義州府尹鄭遵于境上。又命誅諸道調度使金純·池應鯤·金忠輔·王明恢·權忠男·李文賓等, 濟州牧使梁濩, 亦命拿來誅之。已上因傳敎考出實錄实际输出到txt的日期:1623년,인조1년,3월 13일(1623年,朝鲜成祖1年3月13日)
我希望展示的日期:1623년,인조1년,天启3년3월 13일,계묘(1623年,朝鲜成祖1年,天启3年3月13日癸卯)更接近原书。
崇鹂游客import os import re import time import requests from bs4 import BeautifulSoup import tkinter as tk from tkinter import ttk, scrolledtext, messagebox from threading import Thread from urllib.parse import urljoin # ==================== 配置 ==================== BASE_URL = "https://sjw.history.go.kr" HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" } SESSION = requests.Session() SESSION.headers.update(HEADERS) OUTPUT_DIR = "sjw_output" REQUEST_TIMEOUT = 15 MAX_RETRIES = 3 # ==================== 工具函数 ==================== def safe_filename(text): return re.sub(r'[\\/*?:"<>|]', "_", text) def ensure_dir(path): if not os.path.exists(path): os.makedirs(path) def clean_html(html): if not html: return "" soup = BeautifulSoup(html, "html.parser") text = soup.get_text() text = re.sub(r'\s+', ' ', text).strip() return text def safe_request(method, url, **kwargs): for attempt in range(1, MAX_RETRIES + 1): try: kwargs.setdefault('timeout', REQUEST_TIMEOUT) resp = SESSION.request(method, url, **kwargs) resp.encoding = 'utf-8' return resp except Exception as e: if attempt == MAX_RETRIES: raise e time.sleep(2) # ==================== 核心抓取函数 ==================== def get_dynasty_list(): url = urljoin(BASE_URL, "/main.do") resp = safe_request('GET', url) soup = BeautifulSoup(resp.text, "html.parser") items = soup.select(".view-list .item a") dynasties = [] for a in items: href = a.get("href", "") match = re.search(r"searchMonthList\('([^']+)',\s*'([^']+)'", href) if match: code = match.group(1) name = match.group(2) dynasties.append((code, name)) return dynasties def get_year_month_list(dynasty_code, dynasty_name): url = urljoin(BASE_URL, "/search/inspectionMonthList.do") data = { "treeID": dynasty_code, "treeLevel": "1", "treeType": "왕대별", "treeKingName": dynasty_name } resp = safe_request('POST', url, data=data) soup = BeautifulSoup(resp.text, "html.parser") year_blocks = soup.select(".month-box") result = [] for block in year_blocks: h4 = block.select_one(".month-title h4") if not h4: continue reign_text = h4.get_text(strip=True) reign_year_match = re.search(r'(\d+)년', reign_text) reign_year = reign_year_match.group(1) if reign_year_match else "0" span = block.select_one(".month-title span") if span: western_text = span.get_text(strip=True) western_match = re.search(r'(\d+)년', western_text) western_year = western_match.group(1) if western_match else reign_year else: western_year = reign_year month_links = block.select(".month-list a") for a in month_links: href = a.get("href", "") match = re.search(r"searchDayList\('([^']+)'", href) if match: month_id = match.group(1) month_text = a.get_text(strip=True) is_leap = "윤" in month_text month_num_match = re.search(r'(\d+)', month_text) month_num = month_num_match.group(1) if month_num_match else "0" result.append((reign_year, western_year, month_id, month_num, is_leap)) return result def get_day_list(month_id, dynasty_name): url = urljoin(BASE_URL, "/search/inspectionDayList.do") data = { "treeID": month_id, "treeLevel": "3", "treeType": "왕대별", "treeKingName": dynasty_name } resp = safe_request('POST', url, data=data) soup = BeautifulSoup(resp.text, "html.parser") day_buttons = soup.select(".day-select .day-select-list .item") day_items = [] for btn in day_buttons: onclick = btn.get("onclick", "") match = re.search(r"searchDayList\('([^']+)'", onclick) if match: day_id = match.group(1) btn_text = btn.get_text(strip=True) day_num_match = re.search(r'(\d+)일', btn_text) day_num = int(day_num_match.group(1)) if day_num_match else 0 if day_num != 0: day_items.append((day_id, day_num)) return day_items def extract_date_metadata(soup): """ 从日期页面提取干支日和年号信息 返回 (ganji, era_text) ganji: 如 '임인' era_text: 如 '天啓(明/熹宗) 3년' (保留原始括号和空格) """ date_elem = soup.select_one(".title-head .title .date") if not date_elem: return "", "" date_text = date_elem.get_text(strip=True) # 示例: "승정원일기 1책 (탈초본 1책) 인조 1년 3월 12일 임인 1/2 기사 1623년 天啓(明/熹宗) 3년" ganji_match = re.search(r'일\s+([가-힣]{2})', date_text) ganji = ganji_match.group(1) if ganji_match else "" parts = date_text.split() if len(parts) >= 2: last = parts[-1] # "3년" second_last = parts[-2] # "天啓(明/熹宗)" # 保留原始括号和空格 era_text = f"{second_last} {last}" else: era_text = "" return ganji, era_text def get_day_articles_and_jwakmok(day_id): url = urljoin(BASE_URL, f"/id/{day_id}") resp = safe_request('GET', url) soup = BeautifulSoup(resp.text, "html.parser") ganji, era_text = extract_date_metadata(soup) jwakmok = "" level5 = soup.find("level5", id=lambda x: x and x.endswith("-00000")) if level5: jwakmok = clean_html(str(level5)) articles = [] links = soup.select(".day-cont .list .item a") seen = set() for a in links: href = a.get("href", "") match = re.search(r"searchView\('([^']+)'", href) if match: article_id = match.group(1) if article_id in seen: continue seen.add(article_id) title_tag = soup.find("p", id=f"TITLE_{article_id}") title = title_tag.get_text(strip=True) if title_tag else "" articles.append((article_id, title)) return ganji, era_text, jwakmok, articles def get_article_content(article_id): url = urljoin(BASE_URL, f"/id/{article_id}") resp = safe_request('GET', url) soup = BeautifulSoup(resp.text, "html.parser") title_tag = soup.find("h3", id=lambda x: x and x.startswith("TITLE_")) title = title_tag.get_text(strip=True) if title_tag else "" view_text = soup.select_one(".view-text") content = clean_html(str(view_text)) if view_text else "" return title, content # ==================== 抓取任务 ==================== class CrawlerThread(Thread): def __init__(self, dynasty_code, dynasty_name, start_year, end_year, selected_months, start_day, end_day, delay, log_callback, finish_callback): super().__init__() self.dynasty_code = dynasty_code self.dynasty_name = dynasty_name self.start_year = start_year self.end_year = end_year self.selected_months = selected_months self.start_day = start_day self.end_day = end_day self.delay = delay self.log = log_callback self.finish = finish_callback self.stop_flag = False self.processed_days = set() self.processed_articles = set() def run(self): try: self.log(f"开始抓取 {self.dynasty_name} 数据...") all_months = get_year_month_list(self.dynasty_code, self.dynasty_name) for reign_year, western_year, month_id, month_num, is_leap in all_months: if self.stop_flag: break try: current_western = int(western_year) except: continue if current_western < self.start_year or current_western > self.end_year: continue if self.selected_months and int(month_num) not in self.selected_months: continue month_str = f"{'윤' if is_leap else ''}{month_num}월" self.log(f"处理 {self.dynasty_name} {reign_year}년 ({western_year}년) {month_str}") day_items = get_day_list(month_id, self.dynasty_name) for day_id, day_num in day_items: if self.stop_flag: break if day_num < self.start_day or day_num > self.end_day: continue if day_id in self.processed_days: self.log(f" 跳过重复日期 {day_id}") continue self.processed_days.add(day_id) self.log(f" 处理日期 {day_num}일 (ID: {day_id})") try: ganji, era_text, jwakmok, articles = get_day_articles_and_jwakmok(day_id) except Exception as e: self.log(f" 获取日期文章失败: {str(e)}") continue if not jwakmok and not articles: self.log(f" 该日期无内容") continue month_padded = f"{int(month_num):02d}" day_padded = f"{day_num:02d}" out_dir = os.path.join(OUTPUT_DIR, self.dynasty_name, western_year, month_padded) ensure_dir(out_dir) out_file = os.path.join(out_dir, f"{day_padded}.txt") with open(out_file, "w", encoding="utf-8") as f: # 构建单行头部,各部分空格分隔 parts = [f"{western_year}년", f"{self.dynasty_name}{reign_year}년"] if era_text: parts.append(era_text) parts.append(month_str) parts.append(f"{day_num}일") if ganji: parts.append(ganji) header_line = " ".join(parts) f.write(header_line + "\n") f.write("=" * 50 + "\n\n") if jwakmok: # 移除开头的"座目"及其后可能的空白 jwakmok = re.sub(r'^\s*座目\s*', '', jwakmok, flags=re.UNICODE) f.write("【座目】\n") f.write(jwakmok + "\n\n") f.write("-" * 40 + "\n\n") for article_id, article_title in articles: if self.stop_flag: break if article_id in self.processed_articles: self.log(f" 跳过重复文章 {article_id}") continue self.processed_articles.add(article_id) self.log(f" 抓取文章: {article_title[:30]}...") try: title, content = get_article_content(article_id) except Exception as e: self.log(f" 抓取文章失败: {str(e)}") continue f.write(f"【{title}】\n") f.write(content + "\n\n") f.write("-" * 40 + "\n\n") time.sleep(self.delay) self.log(f" 已保存到 {out_file}") time.sleep(self.delay * 1.5) self.log("抓取完成!") except Exception as e: self.log(f"严重错误: {str(e)}") finally: self.finish() # ==================== GUI界面 ==================== class CrawlerApp: def __init__(self, root): self.root = root self.root.title("승정원일기 抓取工具 v2.4 (日期单行格式)") self.root.geometry("700x600") self.dynasties = [] self.crawler_thread = None self.info_thread = None self.create_widgets() self.load_dynasties() def create_widgets(self): # 第一行:朝代选择 tk.Label(self.root, text="朝代:").grid(row=0, column=0, sticky="w", padx=5, pady=5) self.dynasty_combo = ttk.Combobox(self.root, state="readonly", width=20) self.dynasty_combo.grid(row=0, column=1, sticky="w", padx=5, pady=5) self.dynasty_combo.bind("<<ComboboxSelected>>", self.on_dynasty_selected) # 加载信息按钮 self.info_btn = tk.Button(self.root, text="加载朝代年份信息", command=self.load_dynasty_info, bg="lightgray") self.info_btn.grid(row=0, column=2, padx=5, pady=5) self.info_label = tk.Label(self.root, text="", fg="blue") self.info_label.grid(row=0, column=3, padx=5, pady=5, sticky="w") # 年份范围 tk.Label(self.root, text="公元年份范围:").grid(row=1, column=0, sticky="w", padx=5, pady=5) frame_year = tk.Frame(self.root) frame_year.grid(row=1, column=1, sticky="w", padx=5, pady=5) self.start_year = tk.Entry(frame_year, width=6) self.start_year.pack(side="left") tk.Label(frame_year, text=" - ").pack(side="left") self.end_year = tk.Entry(frame_year, width=6) self.end_year.pack(side="left") # 月份选择 tk.Label(self.root, text="月份 (留空则全部):").grid(row=2, column=0, sticky="w", padx=5, pady=5) self.month_vars = [] month_frame = tk.Frame(self.root) month_frame.grid(row=2, column=1, columnspan=2, sticky="w", padx=5, pady=5) for i in range(1, 13): var = tk.BooleanVar() self.month_vars.append(var) cb = tk.Checkbutton(month_frame, text=f"{i}월", variable=var) cb.pack(side="left") # 日期范围 tk.Label(self.root, text="日期范围:").grid(row=3, column=0, sticky="w", padx=5, pady=5) frame_day = tk.Frame(self.root) frame_day.grid(row=3, column=1, sticky="w", padx=5, pady=5) self.start_day = tk.Entry(frame_day, width=4) self.start_day.insert(0, "1") self.start_day.pack(side="left") tk.Label(frame_day, text=" - ").pack(side="left") self.end_day = tk.Entry(frame_day, width=4) self.end_day.insert(0, "31") self.end_day.pack(side="left") # 请求延迟 tk.Label(self.root, text="请求延迟(秒):").grid(row=4, column=0, sticky="w", padx=5, pady=5) self.delay_entry = tk.Entry(self.root, width=6) self.delay_entry.insert(0, "0.2") self.delay_entry.grid(row=4, column=1, sticky="w", padx=5, pady=5) # 开始抓取按钮 self.start_btn = tk.Button(self.root, text="开始抓取", command=self.start_crawl, bg="lightblue") self.start_btn.grid(row=5, column=0, columnspan=2, pady=10) # 日志框 self.log_text = scrolledtext.ScrolledText(self.root, height=20, state="normal") self.log_text.grid(row=6, column=0, columnspan=4, padx=5, pady=5, sticky="nsew") self.root.grid_rowconfigure(6, weight=1) self.root.grid_columnconfigure(3, weight=1) def load_dynasties(self): try: self.dynasties = get_dynasty_list() names = [name for _, name in self.dynasties] self.dynasty_combo['values'] = names if names: self.dynasty_combo.current(0) except Exception as e: messagebox.showerror("错误", f"无法获取朝代列表: {str(e)}") def on_dynasty_selected(self, event): self.info_label.config(text="") def load_dynasty_info(self): dynasty_name = self.dynasty_combo.get() if not dynasty_name: messagebox.showerror("错误", "请先选择朝代") return dynasty_code = next((code for code, name in self.dynasties if name == dynasty_name), None) if not dynasty_code: return self.info_btn.config(state="disabled") self.info_label.config(text="正在加载...", fg="blue") self.info_thread = Thread(target=self._fetch_dynasty_info, args=(dynasty_code, dynasty_name)) self.info_thread.daemon = True self.info_thread.start() def _fetch_dynasty_info(self, dynasty_code, dynasty_name): try: all_months = get_year_month_list(dynasty_code, dynasty_name) years = sorted(set(int(w) for _, w, _, _, _ in all_months if w.isdigit())) if not years: self.root.after(0, lambda: self.info_label.config(text="无年份数据", fg="red")) return min_year = min(years) max_year = max(years) existing_months = set(int(m) for _, _, _, m, _ in all_months if m.isdigit()) self.root.after(0, lambda: self._update_info_ui(min_year, max_year, existing_months)) except Exception as e: self.root.after(0, lambda: self.info_label.config(text=f"加载失败: {str(e)}", fg="red")) finally: self.root.after(0, lambda: self.info_btn.config(state="normal")) def _update_info_ui(self, min_year, max_year, existing_months): self.start_year.delete(0, tk.END) self.start_year.insert(0, str(min_year)) self.end_year.delete(0, tk.END) self.end_year.insert(0, str(max_year)) for i, var in enumerate(self.month_vars): month_num = i + 1 var.set(month_num in existing_months) self.info_label.config(text=f"共 {len(existing_months)} 个月份 (年份 {min_year}-{max_year})", fg="green") def log(self, msg): self.log_text.insert(tk.END, msg + "\n") self.log_text.see(tk.END) self.root.update() def start_crawl(self): if self.crawler_thread and self.crawler_thread.is_alive(): messagebox.showwarning("警告", "已有抓取任务正在进行") return dynasty_name = self.dynasty_combo.get() if not dynasty_name: messagebox.showerror("错误", "请选择朝代") return dynasty_code = next((code for code, name in self.dynasties if name == dynasty_name), None) if not dynasty_code: return try: start_year = int(self.start_year.get()) end_year = int(self.end_year.get()) if start_year > end_year: messagebox.showerror("错误", "起始年份不能大于结束年份") return except ValueError: messagebox.showerror("错误", "年份必须为整数") return selected_months = [i+1 for i, var in enumerate(self.month_vars) if var.get()] try: start_day = int(self.start_day.get()) end_day = int(self.end_day.get()) if start_day < 1 or end_day > 31 or start_day > end_day: messagebox.showerror("错误", "日期范围无效 (1-31)") return except ValueError: messagebox.showerror("错误", "日期必须为整数") return try: delay = float(self.delay_entry.get()) if delay < 0: delay = 0.2 except: delay = 0.2 self.log_text.delete(1.0, tk.END) self.log("初始化抓取...") self.crawler_thread = CrawlerThread( dynasty_code, dynasty_name, start_year, end_year, selected_months, start_day, end_day, delay, self.log, self.crawl_finished ) self.crawler_thread.start() self.start_btn.config(state="disabled") def crawl_finished(self): self.start_btn.config(state="normal") self.log("任务结束。") # ==================== 主程序 ==================== if __name__ == "__main__": root = tk.Tk() app = CrawlerApp(root) root.mainloop()
书格AI参与者您好!
关于您在抓取《承政院日记》工具时遇到的问题,我将针对您的描述提供一些建议和回应:
1. 抓取速度慢的问题:您提到网站可能存在一定的防抓性,并且AI生成的代码并非多线程。这确实是导致抓取速度慢的常见原因。要解决这个问题,您可以尝试以下方法:
* 增加线程或进程:如果您的AI工具支持,可以尝试修改代码,使其能够并发地抓取多个页面。这通常需要使用多线程或多进程库,例如Python中的`threading`或`multiprocessing`。
* 调整抓取间隔:在每次请求之间增加一个随机的延迟时间,这有助于模拟人类浏览行为,减少被网站识别为爬虫的可能性。
* 使用代理IP:如果网站对IP访问频率有限制,使用代理IP池可以有效地分散请求,避免IP被封禁。
* 检查网络状况:确保您的网络连接稳定,带宽充足。2. 日期格式化问题:您希望将日期表述为“公元纪年+朝鲜王号+中国年号+月+日+干支”,并更接近原书的格式。从您提供的输出示例来看,目前缺少中国年号和干支信息。解决这个问题需要以下步骤:
* 获取中国年号:您需要找到一个可靠的数据源或算法,能够根据朝鲜的国王纪年和日期,对应查询到当时中国的年号。例如,在“1623년 인조1년 3월 13일”这个日期,需要查询到对应的明朝年号(天启三年)。
* 获取干支信息:同样,需要一个方法能够根据公元纪年或农历日期,计算出对应的干支。
* 修改文本解析逻辑:您的AI工具需要能够识别并提取网页中的日期信息,然后将这些信息与您查询到的中国年号和干支信息进行组合,按照您期望的格式进行输出。这可能涉及到正则表达式、字符串处理或日期时间库的使用。
* 比对原书格式:为了确保抓取到的日期格式与原书保持一致,建议您手动检查一些抓取结果,并与原书进行比对,以便调整解析和格式化的逻辑。希望这些建议能帮助您改进抓取工具。
(以上内容由Gemini AI自动答复,仅供参考!)
崇鹂游客速度问题,就是假设每天都有日记,一年365篇,每个王在位20年,一个王大概就有六七千篇(txt文件),而《承政院日记》涉及朝鲜12个王,这么多文件,如何快速下载呢?或者细水长流?
我观察到外网的图书馆,就算没有防抓,似乎晚上就会变慢乃至完全不畅,而内网没有变化,不确定是就我家这样,还是运营商限制。
目前的让AI写的工具,普适性很差,暂停、接续功能都不具备,只是可以通过年号时间限定下载的区间,但也比较容易犯错误,不够傻瓜式易用。所以发帖咨询一下各位大佬
- 作者帖子
正在查看 4 个帖子:1-4 (共 4 个帖子)
正在查看 4 个帖子:1-4 (共 4 个帖子)
正在查看 4 个帖子:1-4 (共 4 个帖子)



