Welcome to Chapter 11! In the previous chapter, 5-Clustering, we learned how to group similar data points together even if we didn't know what they were.
We have now mastered numbers. We can predict prices (Regression) and sort categories (Classification). But there is one type of data that humans generate more than anything else, and it is very messy for computers: Language.
Computers speak in 1s and 0s. Humans speak in words, slang, sarcasm, and emojis. How do we bridge this gap?
This brings us to the folder 6-NLP (Natural Language Processing).
Imagine you own a hotel.
Natural Language Processing (NLP) is the branch of AI that gives computers the ability to understand text and spoken words in much the same way human beings can.
A computer cannot do math on the word "Cat." It needs to turn that word into a number first.
Imagine a sentence is a Lego castle. To understand it, we first smash it apart into individual bricks.
["I"], ["love"], ["coding"], ["."].Some words are like filler in a sandwich. Words like "the", "is", "and", or "a" appear everywhere but don't carry much meaning. We usually throw these away to save space.
This is the magic trick. We don't try to teach the computer grammar. Instead, we just count how many times a word appears.
In this chapter, we will use our trusty toolkit Scikit-learn to turn text into numbers. This process is called Vectorization.
Let's pretend we have three short reviews.
# Our dataset of reviews
reviews = [
"I love this hotel",
"I hate this hotel",
"The hotel is okay"
]
Explanation: This is a simple list of strings. Computers can't analyze this yet.
We use a tool called CountVectorizer. It looks at all the reviews and builds a dictionary of every unique word it sees.
from sklearn.feature_extraction.text import CountVectorizer
# 1. Create the tool
vectorizer = CountVectorizer()
# 2. Teach the tool our vocabulary
vectorizer.fit(reviews)
# 3. Print the dictionary it learned
print(vectorizer.vocabulary_)
Output:
{'love': 2, 'this': 5, 'hotel': 1, 'hate': 0, 'is': 3, 'okay': 4, ...}
Explanation: The computer assigned an ID number to every word. "Hate" is word #0. "Hotel" is word #1.
Now we convert our sentences into "Vectors" (lists of numbers).
# Transform the reviews into numbers
numbers = vectorizer.transform(reviews)
# Show the array (The Matrix)
print(numbers.toarray())
Output:
[[0 1 1 0 0 1] <-- "I love this hotel"
[1 1 0 0 0 1] <-- "I hate this hotel"
[0 1 0 1 1 0]] <-- "The hotel is okay"
Explanation:
Look closely at the first row [0 1 1 0 0 1].
0 means: Does "Hate" appear? No.1 means: Does "Hotel" appear? Yes.1 means: Does "Love" appear? Yes.We have successfully turned English into Math! Now we can feed these numbers into a classifier (like we learned in Chapter 9) to predict if the review is happy or sad.
How does the machine read? It uses a pipeline.
Counting words (Bag of Words) is great, but it has a flaw.
In the 6-NLP lessons, you will learn about TF-IDF. This is a math equation that upgrades our counting.
from sklearn.feature_extraction.text import TfidfTransformer
# We use the counts from the previous step
tfidf = TfidfTransformer()
# Calculate the new weighted scores
weighted_numbers = tfidf.fit_transform(numbers)
# Print the new math
print(weighted_numbers.toarray())
Explanation:
Instead of simple 1s and 0s, you will now see decimal numbers like 0.54 or 0.21.
You interact with NLP every single day.
In this chapter, we explored 6-NLP. We learned that:
Now that we can analyze numbers, categories, groups, and text, we have one final frontier. All of our data so far has been static. But the real world moves and changes over time.
How do we predict the weather for tomorrow or the stock price for next week?
Generated by Code IQ