Example output
Certainly! Let's start by importing necessary libraries, loading the movie reviews dataset, and preprocessing the text data using spaCy. Ensure you have spaCy installed (`pip install spacy`) and download the English model (`python -m spacy download en_core_web_sm`).
```python
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score
# Load spaCy English model
nlp = spacy.load('en_core_web_sm')
# Load the movie reviews dataset (adjust the path accordingly)
df = pd.read_csv('movie_reviews.csv')
# Split the dataset into training and testing sets
train_data, test_data, train_labels, test_labels = train_test_split(
df['review'], df['sentiment'], test_size=0.2, random_state=42
)
# Continue
```
Continue
Great! Now let's preprocess the text data using spaCy. We'll tokenize the text, remove stop words, and lemmatize the words to create a cleaner representation of the text data.
```python
def preprocess_text(text):
doc = nlp(text)
tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
return ' '.join(tokens)
# Apply the preprocessing function to the training and testing data
train_data = train_data.apply(preprocess_text)
test_data = test_data.apply(preprocess_text)
# Continue
```
Now that we've preprocessed the text data, we can move on to building a machine learning model. Let's use the TF-IDF vectorizer for feature extraction and a Support Vector Machine (SVM) classifier.
```python
# Create a pipeline with TF-IDF vectorizer and SVM classifier
model = make_pipeline(TfidfVectorizer(), SVC())
# Train the model
model.fit(train_data, train_labels)
# Make predictions on the test set
predictions = model.predict(test_data)
# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions, average='weighted')
recall = recall_score(test_labels, predictions, average='weighted')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
```
Now you have a basic sentiment analysis model using spaCy, TF-IDF, and SVM. If you are satisfied with the results, we can proceed with hyperparameter tuning to improve the model's performance. If not, we can explore different models or features.
Let me know if you'd like to continue with hyperparameter tuning or explore other options!
Great! Let's continue by exploring hyperparameter tuning for the Support Vector Machine (SVM) classifier. We'll use grid search to find the best combination of hyperparameters.
```python
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters to tune
param_grid = {
'tfidfvectorizer__ngram_range': [(1, 1), (1, 2)],
'svc__C': [1, 10, 100],
'svc__kernel': ['linear', 'rbf']
}
# Create a pipeline with TF-IDF vectorizer and SVM classifier
model = make_pipeline(TfidfVectorizer(), SVC())
# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(train_data, train_labels)
# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
# Continue
```
Now that we have the best hyperparameters, let's train the model with these parameters and evaluate its performance.
```python
# Train the model with the best hyperparameters
best_model = grid_search.best_estimator_
best_model.fit(train_data, train_labels)
# Make predictions on the test set
predictions = best_model.predict(test_data)
# Evaluate the model
accuracy = accuracy_score(test_labels, predictions)
precision = precision_score(test_labels, predictions, average='weighted')
recall = recall_score(test_labels, predictions, average='weighted')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
```
This should give you a more fine-tuned model. If you're satisfied with the performance, you can use this model for sentiment analysis. If you want to explore further improvements or additional steps, feel free to let me know!