Example output
Here's a Python code snippet to scrape Wikipedia data using `requests` and `BeautifulSoup`. This script extracts the main headings and paragraphs, saves them to JSON, and adheres to best practices for web scraping.
To use this script, ensure you have `requests` and `BeautifulSoup` installed. If not, you can install them with:
```bash
pip install requests beautifulsoup4
```
Here's the code:
```python
import requests
from bs4 import BeautifulSoup
import json
import re
import time
def fetch_page_content(url):
"""
Fetches page content and checks for a successful HTTP response.
Pauses requests to avoid anti-scraping detection.
Args:
url (str): URL of the Wikipedia page to scrape.
Returns:
BeautifulSoup object: Parsed page content or None if the request fails.
"""
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36"
}
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return BeautifulSoup(response.text, 'html.parser')
else:
print(f"Failed to retrieve content: HTTP Status {response.status_code}")
return None
except requests.exceptions.RequestException as e:
print(f"An error occurred: {e}")
return None
def extract_content(soup):
"""
Extracts main headings and associated paragraphs from the BeautifulSoup object.
Args:
soup (BeautifulSoup): Parsed Wikipedia page content.
Returns:
list: List of dictionaries with 'heading' and 'paragraphs'.
"""
content = []
current_heading = None
# Wikipedia page content is generally within the "mw-parser-output" div
for element in soup.select('.mw-parser-output > h2, .mw-parser-output > h3, .mw-parser-output > p'):
# Extract headings (H2 or H3 tags)
if element.name in ["h2", "h3"]:
heading = element.get_text(strip=True)
# Clean up heading by removing extra citation or bracketed text
heading = re.sub(r'\[.*?\]', '', heading)
current_heading = {'heading': heading, 'paragraphs': []}
content.append(current_heading)
# Extract paragraph text and add to the latest heading's 'paragraphs' list
elif element.name == "p" and current_heading is not None:
paragraph = element.get_text(strip=True)
if paragraph:
current_heading['paragraphs'].append(paragraph)
return content
def save_to_json(data, filename="wikipedia_content.json"):
"""
Saves extracted data to a JSON file.
Args:
data (list): Extracted content data.
filename (str): The name of the output JSON file.
"""
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)
print(f"Data saved to {filename}")
def main(url):
"""
Main function to control the scraping flow.
Args:
url (str): Wikipedia page URL.
"""
# Check Wikipedia's robots.txt before scraping
if requests.get("https://en.wikipedia.org/robots.txt").text.find("/wiki/") == -1:
print("Scraping disallowed by Wikipedia's robots.txt.")
return
soup = fetch_page_content(url)
if soup:
content = extract_content(soup)
save_to_json(content)
# Example usage
if __name__ == "__main__":
wikipedia_url = "https://en.wikipedia.org/wiki/Web_scraping" # Replace with the desired Wikipedia page URL
main(wikipedia_url)
```
### Explanation of Each Section
- **`fetch_page_content`**: Makes an HTTP GET request to retrieve the page content. It includes headers with a user-agent to simulate a browser request, handles exceptions, and checks the HTTP response status. This prevents unnecessary parsing if the page isn't retrieved successfully.
- **`extract_content`**: Uses `BeautifulSoup` to extract `h2`, `h3` headings and paragraphs from the main content area. It structures data under each heading and only adds non-empty paragraphs, which are organized in lists under the `paragraphs` key in each dictionary.
- **`save_to_json`**: Saves the extracted data into a JSON file for easy reading and processing. You can also adapt this function to save in CSV format if needed.
- **`main`**: Orchestrates the process by checking Wikipedia's `robots.txt` for scraping permissions on general wiki pages. It uses a modular structure, so you can change the target URL or modify extraction parameters easily.
This code is set up for efficient, compliant scraping.