AI News Hub Logo

AI News Hub

πŸ”— Build a Link Extractor & Broken Link Checker (Python + PySide6)

DEV Community
Mate Technologies

In this tutorial, we’ll build a desktop app that: βœ… Extracts links from files (.txt, .pdf, .html) πŸ“¦ Step 1: Install Dependencies First, install required packages: pip install PySide6 requests PyPDF2 🧠 Step 2: Import Required Libraries We start by importing everything we need: import os import sys import re import requests import time import platform import subprocess from PySide6.QtWidgets import * from PySide6.QtCore import Qt, QThread, Signal, QTimer from PySide6.QtGui import QColor, QIcon, QGuiApplication import PyPDF2 πŸ’‘ Explanation: We use a thread so the UI doesn’t freeze while scanning. class LinkWorker(QThread): found = Signal(str, bool) progress = Signal(int) finished = Signal() πŸ’‘ Why? GUI apps must stay responsive, so heavy work runs in a thread. πŸ” Step 3.1: Initialize Worker def __init__(self, folder, file_types, check_broken, include_words=None, exclude_words=None): super().__init__() self.folder = folder self.file_types = file_types self.check_broken = check_broken self.include_words = include_words or [] self.exclude_words = exclude_words or [] self.seen_links = set() self._running = True πŸ’‘ Features: def run(self): all_files = [] for root, _, files in os.walk(self.folder): for f in files: ext = os.path.splitext(f)[1].lower() if (ext == '.txt' and self.file_types['txt']) or \ (ext == '.pdf' and self.file_types['pdf']) or \ (ext in ['.html', '.htm'] and self.file_types['html']): all_files.append(os.path.join(root, f)) πŸ’‘ What happens: urls = re.findall(r'https?://[^\s"\'>]+', text) πŸ’‘ Regex explained: reader = PyPDF2.PdfReader(f) for page in reader.pages: text = page.extract_text() 🎯 Step 3.4: Apply Filters if self.include_words and not any(w in url for w in self.include_words): continue if self.exclude_words and any(w in url for w in self.exclude_words): continue πŸ’‘ Example: def check_link(self, url): try: res = requests.get(url, timeout=10) return not (200 <= res.status_code < 400) except: return True πŸ’‘ Logic: Create the main window: class LinkApp(QWidget): def __init__(self): super().__init__() self.setWindowTitle("LinkGuardian") self.setMinimumSize(1000, 600) πŸ“ Step 4.1: Folder Selection self.path_input = QLineEdit() self.path_input.setReadOnly(True) browse_btn = QPushButton("Browse") browse_btn.clicked.connect(self.browse_folder) def browse_folder(self): folder = QFileDialog.getExistingDirectory(self) if folder: self.path_input.setText(folder) self.folder = folder βš™οΈ Step 4.2: Options (Checkboxes) self.txt_checkbox = QCheckBox(".txt") self.pdf_checkbox = QCheckBox(".pdf") self.html_checkbox = QCheckBox(".html") self.check_broken_checkbox = QCheckBox("Check Broken Links") πŸ” Step 4.3: Filters self.include_input = QLineEdit() self.include_input.setPlaceholderText("Include words") self.exclude_input = QLineEdit() self.exclude_input.setPlaceholderText("Exclude words") ▢️ Step 4.4: Start Scan def start_scan(self): self.worker = LinkWorker( self.folder, { 'txt': self.txt_checkbox.isChecked(), 'pdf': self.pdf_checkbox.isChecked(), 'html': self.html_checkbox.isChecked() }, self.check_broken_checkbox.isChecked(), self.include_input.text().split(","), self.exclude_input.text().split(",") ) self.worker.found.connect(self.add_link) self.worker.start() 🎨 Step 5: Display Results def add_link(self, link, is_broken): item = QListWidgetItem(link) color = QColor("red") if is_broken else QColor("green") item.setForeground(color) self.results_list.addItem(item) πŸ’‘ Result: self.progress_bar = QProgressBar() self.progress_bar.setMaximum(100) Update it from the worker: self.worker.progress.connect(self.progress_bar.setValue) def copy_all_links(self): links = "\n".join( self.results_list.item(i).text() for i in range(self.results_list.count()) ) QGuiApplication.clipboard().setText(links) 🌍 Step 8: Open Links on Double Click def open_item(self, item): url = item.text() if platform.system() == "Windows": os.startfile(url) else: subprocess.Popen(["xdg-open", url]) πŸš€ Step 9: Run the App if __name__ == "__main__": app = QApplication(sys.argv) window = LinkApp() window.show() sys.exit(app.exec()) πŸŽ‰ Final Result You now have a professional desktop tool that: βœ” Extracts links from files πŸ’‘ Bonus Ideas Want to upgrade it further? Export results to CSV Add domain grouping Add link preview Add multi-threaded link checking (faster πŸš€)