Java Singular Value Decomposition Example
Singular Value Decomposition — with source code
So guys, in today's blog we will see how we can perform Singular Value Decomposition of some book titles we are having in our dataset using TruncatedSVD. This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD).
Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently . When we perform SVD (Singular Value Decomposition) on text data it is also called LSA (Latent Semantic Analysis). So without wasting any time.
Read the full article with source code here — https://machinelearningprojects.net/singular-value-decomposition/
Let's do it…
Step 1 — Importing libraries required for Singular Value Decomposition.
import nltk
from nltk.stem import WordNetLemmatizer
import numpy as np
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inlinenltk.download('punkt')
nltk.download('wordnet')
Step 2 — Reading lines from our text file.
titles = [line.strip() for line in open('all_book_titles.txt')]
titles
Step 3 — Creating Stopwords set.
stopwords = set(word.strip() for word in open('stopwords.txt'))
stopwords = stopwords.union({
'introduction', 'edition', 'series', 'application',
'approach', 'card', 'access', 'package', 'plus', 'etext',
'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',
'third', 'second', 'fourth', })word_lemmatizer = WordNetLemmatizer()
- We have our default stopwords like a, is, that, this, etc. in stopwords.txt. So first of all we are loading all those in a variable called stopwords.
- Secondly, we are adding some more stopwords like edition, introduction, series, etc. manually in the stopwords set. These are some very common words that occur in Book Titles.
- Lastly we are initializing WordNetLemmatizer().
Step 4 — Creating tokenizer function.
def tokenizer(s):
s = s.lower()
tokens = nltk.tokenize.word_tokenize(s)
tokens = [t for t in tokens if len(t)>2]
tokens = [word_lemmatizer.lemmatize(t) for t in tokens]
tokens = [t for t in tokens if t not in stopwords]
tokens = [t for t in tokens if not any(c.isdigit() for c in t)]
return tokens
- This will take a string, convert it to lowercase, tokenize it, remove words having a length less than 2, lemmatize words, remove stopwords and finally remove all those words having any digit in them.
Step 5 — Checking tokenizer.
tokenizer('my name is abhishek and i am 19 years old!!')
Step 6 — Creating word_2_int and int_2_word dictionaries.
word_2_int = {}
int_2_words = {}
ind = 0
error_count = 0for title in titles:
try:
title = title.encode('ascii', 'ignore').decode('utf-8') # this will throw exception if bad characters
tokens = tokenizer(title)
for token in tokens:
if token not in word_2_int:
word_2_int[token] = ind
int_2_words[ind]=token
ind += 1
except Exception as e:
print(e)
print(title)
error_count += 1
- We have used try-except because there could be some titles that have some special characters. Those will throw exceptions.
- Then we simply take the title and tokenize it and then simply traverse in those tokens.
- If the token is not in our vocabulary (word_2_int), append it and give a token number to it.
- Similarly, create an apposite dictionary int_2_word also for future use.
- And then simply increment the index.
Step 7 — Creating tokens_2_vectors function.
def tokens_2_vectors(tokens):
X = np.zeros(len(word_2_int))
for t in tokens:
try:
index = word_2_int[t]
X[index]=1
except:
pass
return X
- This function simply converts a title in the token form to a vector.
- It will create an array of our vocabulary size with all elements as 0.
- It will then replace 0s with 1s for the words that are present in the title whose vector we are creating.
Step 8 — Creating a final matrix and fitting it into our SVD.
final_matrix = np.zeros((len(titles),len(word_2_int)))for i in range(len(titles)):
title = titles[i]
token = tokenizer(title)
final_matrix[i,:] = tokens_2_vectors(token)
svd = TruncatedSVD()
Z = svd.fit_transform(final_matrix)
Z.shape
- Here we are creating a final matrix with rows as no of titles (2373) and columns as no of words in our vocabulary.
- Then we are filling this matrix with vectors of each and every title.
- Finally fitting our TruncatedSVD with our final matrix.
- It means that our data is having 2373 titles and 2 represents the position of that token in the plane.
Step 9 — Visualize the results.
fig = plt.figure(figsize=(15,9))
plt.scatter(Z[:,0],Z[:,1])
for i in range(len(word_2_int)):
plt.annotate(int_2_words[i],(Z[i,0],Z[i,1]))
- You will see that similar words will be closer in this plot.
Do let me know if there's any query regarding Singular Value Decomposition by contacting me on email or LinkedIn. I have tried my best to explain this code.
To explore more Machine Learning, Deep Learning, Computer Vision, NLP, Flask Projects visit my blog — Machine Learning Projects
For further code explanation and source code visit here — https://machinelearningprojects.net/singular-value-decomposition/
So this is all for this blog folks, thanks for reading it and I hope you are taking something with you after reading this and till the next time 👋…
Read my previous post: TOPIC MODELING USING LATENT DIRICHLET ALLOCATION
Source: https://towardsdev.com/singular-value-decomposition-with-source-code-easiest-way-54ea6823a5e8
0 Response to "Java Singular Value Decomposition Example"
Post a Comment