PromptBase
Close icon
General
Home
Marketplace
Hire
Create
Login
Sell
Explore
🤖 GPT

Snippet Code Writer For Web Scraping

Snippet Code Writer For Web Scraping gpt prompt mini thumbnail
4Eye icon
Views
TestedTick icon
TipsTick icon
4 examplesTick icon
Snippet Code Writer for Web Scraping
...more
$3.99
Get prompt
After purchasing, you will gain access to the prompt file which you can use with GPT or the app builder. You'll receive 20 free generation credits with this purchase. By purchasing this prompt, you agree to our terms of service.
2 weeks ago

Prompt Details

Model
Chat - GPT-4o (gpt-4o)
Token size
127 ($0.00635 / call)
Example input
[ website or webpage name ]: Wikipedia [ programming language and library ]: Python (BeautifulSoup) [ data elements to extract ]: headings and content paragraphs
Example output
Here's a Python code snippet to scrape Wikipedia data using `requests` and `BeautifulSoup`. This script extracts the main headings and paragraphs, saves them to JSON, and adheres to best practices for web scraping. To use this script, ensure you have `requests` and `BeautifulSoup` installed. If not, you can install them with: ```bash pip install requests beautifulsoup4 ``` Here's the code: ```python import requests from bs4 import BeautifulSoup import json import re import time def fetch_page_content(url): """ Fetches page content and checks for a successful HTTP response. Pauses requests to avoid anti-scraping detection. Args: url (str): URL of the Wikipedia page to scrape. Returns: BeautifulSoup object: Parsed page content or None if the request fails. """ headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36" } try: response = requests.get(url, headers=headers) if response.status_code == 200: return BeautifulSoup(response.text, 'html.parser') else: print(f"Failed to retrieve content: HTTP Status {response.status_code}") return None except requests.exceptions.RequestException as e: print(f"An error occurred: {e}") return None def extract_content(soup): """ Extracts main headings and associated paragraphs from the BeautifulSoup object. Args: soup (BeautifulSoup): Parsed Wikipedia page content. Returns: list: List of dictionaries with 'heading' and 'paragraphs'. """ content = [] current_heading = None # Wikipedia page content is generally within the "mw-parser-output" div for element in soup.select('.mw-parser-output > h2, .mw-parser-output > h3, .mw-parser-output > p'): # Extract headings (H2 or H3 tags) if element.name in ["h2", "h3"]: heading = element.get_text(strip=True) # Clean up heading by removing extra citation or bracketed text heading = re.sub(r'\[.*?\]', '', heading) current_heading = {'heading': heading, 'paragraphs': []} content.append(current_heading) # Extract paragraph text and add to the latest heading's 'paragraphs' list elif element.name == "p" and current_heading is not None: paragraph = element.get_text(strip=True) if paragraph: current_heading['paragraphs'].append(paragraph) return content def save_to_json(data, filename="wikipedia_content.json"): """ Saves extracted data to a JSON file. Args: data (list): Extracted content data. filename (str): The name of the output JSON file. """ with open(filename, 'w', encoding='utf-8') as f: json.dump(data, f, ensure_ascii=False, indent=4) print(f"Data saved to {filename}") def main(url): """ Main function to control the scraping flow. Args: url (str): Wikipedia page URL. """ # Check Wikipedia's robots.txt before scraping if requests.get("https://en.wikipedia.org/robots.txt").text.find("/wiki/") == -1: print("Scraping disallowed by Wikipedia's robots.txt.") return soup = fetch_page_content(url) if soup: content = extract_content(soup) save_to_json(content) # Example usage if __name__ == "__main__": wikipedia_url = "https://en.wikipedia.org/wiki/Web_scraping" # Replace with the desired Wikipedia page URL main(wikipedia_url) ``` ### Explanation of Each Section - **`fetch_page_content`**: Makes an HTTP GET request to retrieve the page content. It includes headers with a user-agent to simulate a browser request, handles exceptions, and checks the HTTP response status. This prevents unnecessary parsing if the page isn't retrieved successfully. - **`extract_content`**: Uses `BeautifulSoup` to extract `h2`, `h3` headings and paragraphs from the main content area. It structures data under each heading and only adds non-empty paragraphs, which are organized in lists under the `paragraphs` key in each dictionary. - **`save_to_json`**: Saves the extracted data into a JSON file for easy reading and processing. You can also adapt this function to save in CSV format if needed. - **`main`**: Orchestrates the process by checking Wikipedia's `robots.txt` for scraping permissions on general wiki pages. It uses a modular structure, so you can change the target URL or modify extraction parameters easily. This code is set up for efficient, compliant scraping.
Browse Marketplace