import nltk
12 Advanced Topics
12.1 Text Analysis with nltk
(by Shivaram Karandikar)
12.1.1 Introduction
nltk
, or Natural Language Toolkit, is a Python package which provides a set of tools for text analysis. nltk
is used in Natural Language Processing (NLP), a field of computer science which focuses on the interaction between computers and human languages. nltk
is a very powerful tool for text analysis, and is used by many researchers and data scientists. In this tutorial, we will learn how to use nltk
to analyze text.
12.1.2 Getting Started
First, we must install nltk
using pip
.
python -m pip install nltk
Necessary datasets/models are needed for specific functions to work. We can download a popular subset with
python -m nltk.downloader popular
12.1.3 Tokenizing
To analyze text, it needs to be broken down into smaller pieces. This is called tokenization. nltk
offers two ways to tokenize text: sentence tokenization and word tokenization.
To demonstrate this, we will use the following text, a passage from the 1951 science fiction novel Foundation by Isaac Asimov.
= """The sum of human knowing is beyond any one man; any thousand men. With the destruction of our social fabric, science will be broken into a million pieces. Individuals will know much of exceedingly tiny facets of what there is to know. They will be helpless and useless by themselves. The bits of lore, meaningless, will not be passed on. They will be lost through the generations. But, if we now prepare a giant summary of all knowledge, it will never be lost. Coming generations will build on it, and will not have to rediscover it for themselves. One millennium will do the work of thirty thousand.""" fd_string
12.1.3.1 Sentence Tokenization
from nltk import sent_tokenize, word_tokenize
"popular") # only needs to download once
nltk.download(= sent_tokenize(fd_string)
fd_sent print(fd_sent)
[nltk_data] Downloading collection 'popular'
[nltk_data] |
[nltk_data] | Downloading package cmudict to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package cmudict is already up-to-date!
[nltk_data] | Downloading package gazetteers to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package gazetteers is already up-to-date!
[nltk_data] | Downloading package genesis to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package genesis is already up-to-date!
[nltk_data] | Downloading package gutenberg to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package gutenberg is already up-to-date!
[nltk_data] | Downloading package inaugural to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package inaugural is already up-to-date!
[nltk_data] | Downloading package movie_reviews to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package movie_reviews is already up-to-date!
[nltk_data] | Downloading package names to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package names is already up-to-date!
[nltk_data] | Downloading package shakespeare to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package shakespeare is already up-to-date!
[nltk_data] | Downloading package stopwords to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package stopwords is already up-to-date!
[nltk_data] | Downloading package treebank to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package treebank is already up-to-date!
[nltk_data] | Downloading package twitter_samples to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package twitter_samples is already up-to-date!
[nltk_data] | Downloading package omw to /Users/junyan/nltk_data...
[nltk_data] | Package omw is already up-to-date!
[nltk_data] | Downloading package omw-1.4 to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package omw-1.4 is already up-to-date!
[nltk_data] | Downloading package wordnet to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package wordnet is already up-to-date!
[nltk_data] | Downloading package wordnet2021 to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package wordnet2021 is already up-to-date!
[nltk_data] | Downloading package wordnet31 to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package wordnet31 is already up-to-date!
[nltk_data] | Downloading package wordnet_ic to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package wordnet_ic is already up-to-date!
[nltk_data] | Downloading package words to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package words is already up-to-date!
[nltk_data] | Downloading package maxent_ne_chunker to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package maxent_ne_chunker is already up-to-date!
[nltk_data] | Downloading package punkt to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package punkt is already up-to-date!
[nltk_data] | Downloading package snowball_data to
[nltk_data] | /Users/junyan/nltk_data...
[nltk_data] | Package snowball_data is already up-to-date!
[nltk_data] | Downloading package averaged_perceptron_tagger to
[nltk_data] | /Users/junyan/nltk_data...
['The sum of human knowing is beyond any one man; any thousand men.', 'With the destruction of our social fabric, science will be broken into a million pieces.', 'Individuals will know much of exceedingly tiny facets of what there is to know.', 'They will be helpless and useless by themselves.', 'The bits of lore, meaningless, will not be passed on.', 'They will be lost through the generations.', 'But, if we now prepare a giant summary of all knowledge, it will never be lost.', 'Coming generations will build on it, and will not have to rediscover it for themselves.', 'One millennium will do the work of thirty thousand.']
[nltk_data] | Package averaged_perceptron_tagger is already up-
[nltk_data] | to-date!
[nltk_data] |
[nltk_data] Done downloading collection popular
12.1.3.2 Word Tokenization
= word_tokenize(fd_string)
fd_word print(fd_word)
['The', 'sum', 'of', 'human', 'knowing', 'is', 'beyond', 'any', 'one', 'man', ';', 'any', 'thousand', 'men', '.', 'With', 'the', 'destruction', 'of', 'our', 'social', 'fabric', ',', 'science', 'will', 'be', 'broken', 'into', 'a', 'million', 'pieces', '.', 'Individuals', 'will', 'know', 'much', 'of', 'exceedingly', 'tiny', 'facets', 'of', 'what', 'there', 'is', 'to', 'know', '.', 'They', 'will', 'be', 'helpless', 'and', 'useless', 'by', 'themselves', '.', 'The', 'bits', 'of', 'lore', ',', 'meaningless', ',', 'will', 'not', 'be', 'passed', 'on', '.', 'They', 'will', 'be', 'lost', 'through', 'the', 'generations', '.', 'But', ',', 'if', 'we', 'now', 'prepare', 'a', 'giant', 'summary', 'of', 'all', 'knowledge', ',', 'it', 'will', 'never', 'be', 'lost', '.', 'Coming', 'generations', 'will', 'build', 'on', 'it', ',', 'and', 'will', 'not', 'have', 'to', 'rediscover', 'it', 'for', 'themselves', '.', 'One', 'millennium', 'will', 'do', 'the', 'work', 'of', 'thirty', 'thousand', '.']
Both the sentence tokenization and word tokenization functions return a list of strings. We can use these lists to perform further analysis.
12.1.4 Removing Stopwords
The output of the word tokenization gave us a list of words. However, some of these words are not useful for our analysis. These words are called stopwords. nltk
provides a list of stopwords for several languages. We can use this list to remove stopwords from our text.
from nltk.corpus import stopwords
= set(stopwords.words("english"))
stop_words print(stop_words)
{"you'd", 'being', 'at', 'after', 'their', 's', 'all', 'couldn', 'here', 'same', 'she', "weren't", 'some', 'too', 'i', 'can', 'of', 'you', 'than', 'd', "you'll", 'the', 't', 'above', "should've", 'has', 'ours', 'where', 'from', 'wouldn', 'is', 'hasn', 'am', 'its', 'ain', 'our', 'should', 'in', 'your', 'those', 'had', 'if', 'weren', 'into', 'have', "doesn't", "isn't", 'aren', 'me', 'whom', 'how', 'll', 'themselves', 'and', 'myself', 'over', 'once', 'did', 'then', 'so', 'we', 'mustn', 'won', 'both', 'between', 'now', 'with', "she's", 'shan', 'about', 'his', 'itself', 'or', 'up', "wouldn't", "couldn't", "aren't", 'having', 'these', 'to', 'each', 'own', 'were', 'until', 'been', 'are', 'yours', 'off', 'this', 'such', 'don', 've', 'they', 'was', "shan't", 'hadn', 'an', "hadn't", 'under', 'when', 'haven', "mustn't", 'who', "didn't", "wasn't", 'ma', 'just', 'mightn', 'very', 'be', 'no', 'more', 'other', 'out', "don't", "won't", 'm', 'ourselves', "haven't", 'by', 'herself', 'o', 'a', 'most', 'hers', 'few', 'during', "you're", "you've", 'further', 'why', "hasn't", 'as', "needn't", "that'll", "mightn't", 'nor', "it's", 'only', 'doesn', 'that', 'through', 'theirs', 'isn', 're', 'her', 'my', 'them', 'what', 'which', 'y', 'before', 'there', 'didn', 'not', 'while', 'down', 'against', 'below', 'himself', 'does', 'because', 'again', 'needn', 'shouldn', 'yourselves', 'wasn', 'will', 'do', 'but', 'doing', 'on', 'he', 'it', 'any', 'him', "shouldn't", 'for', 'yourself'}
= [w for w in fd_word if w.casefold() not in stop_words]
fd_filtered print(fd_filtered)
['sum', 'human', 'knowing', 'beyond', 'one', 'man', ';', 'thousand', 'men', '.', 'destruction', 'social', 'fabric', ',', 'science', 'broken', 'million', 'pieces', '.', 'Individuals', 'know', 'much', 'exceedingly', 'tiny', 'facets', 'know', '.', 'helpless', 'useless', '.', 'bits', 'lore', ',', 'meaningless', ',', 'passed', '.', 'lost', 'generations', '.', ',', 'prepare', 'giant', 'summary', 'knowledge', ',', 'never', 'lost', '.', 'Coming', 'generations', 'build', ',', 'rediscover', '.', 'One', 'millennium', 'work', 'thirty', 'thousand', '.']
The resulting list is significantly shorter. There are some words that nltk
considers stopwords that we may want to keep, depending on the objective of our analysis. Reducing the size of our data can help us to reduce the time it takes to perform our analysis. However, removing too many words can reduce the accuracy, which is especially important when we are trying to perform sentiment analysis.
12.1.5 Stemming
Stemming is a method which allows us to reduce the number of variants of a word. For example, the words connecting, connected, and connection are all variants of the same word connect. nltk
includes a few different stemmers based on different algorithms. We will use the Snowball stemmer, an improved version of the 1979 Porter stemmer.
from nltk.stem.snowball import SnowballStemmer
= SnowballStemmer(language='english')
snow_stem = [snow_stem.stem(w) for w in fd_word]
fd_stem print(fd_stem)
['the', 'sum', 'of', 'human', 'know', 'is', 'beyond', 'ani', 'one', 'man', ';', 'ani', 'thousand', 'men', '.', 'with', 'the', 'destruct', 'of', 'our', 'social', 'fabric', ',', 'scienc', 'will', 'be', 'broken', 'into', 'a', 'million', 'piec', '.', 'individu', 'will', 'know', 'much', 'of', 'exceed', 'tini', 'facet', 'of', 'what', 'there', 'is', 'to', 'know', '.', 'they', 'will', 'be', 'helpless', 'and', 'useless', 'by', 'themselv', '.', 'the', 'bit', 'of', 'lore', ',', 'meaningless', ',', 'will', 'not', 'be', 'pass', 'on', '.', 'they', 'will', 'be', 'lost', 'through', 'the', 'generat', '.', 'but', ',', 'if', 'we', 'now', 'prepar', 'a', 'giant', 'summari', 'of', 'all', 'knowledg', ',', 'it', 'will', 'never', 'be', 'lost', '.', 'come', 'generat', 'will', 'build', 'on', 'it', ',', 'and', 'will', 'not', 'have', 'to', 'rediscov', 'it', 'for', 'themselv', '.', 'one', 'millennium', 'will', 'do', 'the', 'work', 'of', 'thirti', 'thousand', '.']
Stemming algorithms are susceptible to errors. Related words that should share a stem may not, which is known as understemming, which is a false negative. Unrelated words that should not share a stem may, which is known as overstemming, which is a false positive.
12.1.6 POS Tagging
nltk
also enables us to label the parts of speech of each word in a text. This is known as part-of-speech (POS) tagging. nltk
uses the Penn Treebank tagset, which is a set of tags that are used to label words in a text. The tags are as follows:
help.upenn_tagset() nltk.
$: dollar
$ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
' ''
(: opening parenthesis
( [ {
): closing parenthesis
) ] }
,: comma
,
--: dash
--
.: sentence terminator
. ! ?
:: colon or ellipsis
: ; ...
CC: conjunction, coordinating
& 'n and both but either et for less minus neither nor or plus so
therefore times v. versus vs. whether yet
CD: numeral, cardinal
mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
EX: existential there
there
FW: foreign word
gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
astride among uppon whether out inside pro despite on by throughout
below within for towards near behind atop around if like until below
next into if beside ...
JJ: adjective or numeral, ordinal
third ill-mannered pre-war regrettable oiled calamitous first separable
ectoplasmic battery-powered participatory fourth still-to-be-named
multilingual multi-disciplinary ...
JJR: adjective, comparative
bleaker braver breezier briefer brighter brisker broader bumper busier
calmer cheaper choosier cleaner clearer closer colder commoner costlier
cozier creamier crunchier cuter ...
JJS: adjective, superlative
calmest cheapest choicest classiest cleanest clearest closest commonest
corniest costliest crassest creepiest crudest cutest darkest deadliest
dearest deepest densest dinkiest ...
LS: list item marker
A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
SP-44007 Second Third Three Two * a b c d first five four one six three
two
MD: modal auxiliary
can cannot could couldn't dare may might must need ought shall should
shouldn't will would
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
Apache Apaches Apocrypha ...
NNS: noun, common, plural
undergraduates scotches bric-a-brac products bodyguards facets coasts
divestitures storehouses designs clubs fragrances averages
subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
all both half many quite such sure this
POS: genitive marker
' 's
PRP: pronoun, personal
hers herself him himself hisself it itself me myself one oneself ours
ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
her his mine my our ours their thy your
RB: adverb
occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly ...
RBR: adverb, comparative
further gloomier grander graver greater grimmer harder harsher
healthier heavier higher however larger later leaner lengthier less-
perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
best biggest bluntest earliest farthest first furthest hardest
heartiest highest largest least less most nearest second tightest worst
RP: particle
aboard about across along apart around aside at away back before behind
by crop down ever fast for forth from go high i.e. in into just later
low more off on open out over per pie raising start teeth that through
under unto up up-pp upon whole with you
SYM: symbol
% & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
to
UH: interjection
Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
man baby diddle hush sonuvabitch ...
VB: verb, base form
ask assemble assess assign assume atone attention avoid bake balkanize
bank begin behold believe bend benefit bevel beware bless boil bomb
boost brace break bring broil brush build ...
VBD: verb, past tense
dipped pleaded swiped regummed soaked tidied convened halted registered
cushioned exacted snubbed strode aimed adopted belied figgered
speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
telegraphing stirring focusing angering judging stalling lactating
hankerin' alleging veering capping approaching traveling besieging
encrypting interrupting erasing wincing ...
VBN: verb, past participle
multihulled dilapidated aerosolized chaired languished panelized used
experimented flourished imitated reunifed factored condensed sheared
unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
predominate wrap resort sue twist spill cure lengthen brush terminate
appear tend stray glisten obtain comprise detest tease attract
emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
bases reconstructs marks mixes displeases seals carps weaves snatches
slumps stretches authorizes smolders pictures emerges stockpiles
seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
that what whatever which whichever
WP: WH-pronoun
that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
whose
WRB: Wh-adverb
how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
` ``
We can use the function nltk.pos_tag()
on our list of tokenized words. This will return a list of tuples, where each tuple contains a word and its corresponding tag.
= nltk.pos_tag(fd_word)
fd_tag print(fd_tag)
[('The', 'DT'), ('sum', 'NN'), ('of', 'IN'), ('human', 'JJ'), ('knowing', 'NN'), ('is', 'VBZ'), ('beyond', 'IN'), ('any', 'DT'), ('one', 'CD'), ('man', 'NN'), (';', ':'), ('any', 'DT'), ('thousand', 'CD'), ('men', 'NNS'), ('.', '.'), ('With', 'IN'), ('the', 'DT'), ('destruction', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('social', 'JJ'), ('fabric', 'NN'), (',', ','), ('science', 'NN'), ('will', 'MD'), ('be', 'VB'), ('broken', 'VBN'), ('into', 'IN'), ('a', 'DT'), ('million', 'CD'), ('pieces', 'NNS'), ('.', '.'), ('Individuals', 'NNS'), ('will', 'MD'), ('know', 'VB'), ('much', 'RB'), ('of', 'IN'), ('exceedingly', 'RB'), ('tiny', 'JJ'), ('facets', 'NNS'), ('of', 'IN'), ('what', 'WP'), ('there', 'EX'), ('is', 'VBZ'), ('to', 'TO'), ('know', 'VB'), ('.', '.'), ('They', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('helpless', 'JJ'), ('and', 'CC'), ('useless', 'JJ'), ('by', 'IN'), ('themselves', 'PRP'), ('.', '.'), ('The', 'DT'), ('bits', 'NNS'), ('of', 'IN'), ('lore', 'NN'), (',', ','), ('meaningless', 'NN'), (',', ','), ('will', 'MD'), ('not', 'RB'), ('be', 'VB'), ('passed', 'VBN'), ('on', 'IN'), ('.', '.'), ('They', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('lost', 'VBN'), ('through', 'IN'), ('the', 'DT'), ('generations', 'NNS'), ('.', '.'), ('But', 'CC'), (',', ','), ('if', 'IN'), ('we', 'PRP'), ('now', 'RB'), ('prepare', 'VBP'), ('a', 'DT'), ('giant', 'JJ'), ('summary', 'NN'), ('of', 'IN'), ('all', 'DT'), ('knowledge', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('never', 'RB'), ('be', 'VB'), ('lost', 'VBN'), ('.', '.'), ('Coming', 'VBG'), ('generations', 'NNS'), ('will', 'MD'), ('build', 'VB'), ('on', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('will', 'MD'), ('not', 'RB'), ('have', 'VB'), ('to', 'TO'), ('rediscover', 'VB'), ('it', 'PRP'), ('for', 'IN'), ('themselves', 'PRP'), ('.', '.'), ('One', 'CD'), ('millennium', 'NN'), ('will', 'MD'), ('do', 'VB'), ('the', 'DT'), ('work', 'NN'), ('of', 'IN'), ('thirty', 'JJ'), ('thousand', 'NN'), ('.', '.')]
The tokenized words from the quote should be easy to tag correctly. The function may encounter difficulty with less conventional words (e.g. Old English), but it will attempt to tag based on context.
12.1.7 Lemmatizing
Lemmatizing is similar to stemming, but it is more accurate. Lemmatizing is a process which reduces words to their lemma, which is the base form of a word.nltk
includes a lemmatizer based on the WordNet database. We can demonstrate this using a quote from the 1868 novel Little Women by Louisa May Alcott.
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
lemmatizer = "The dim, dusty room, with the busts staring down from the tall book-cases, the cosy chairs, the globes, and, best of all, the wilderness of books, in which she could wander where she liked, made the library a region of bliss to her."
quote = word_tokenize(quote)
quote_token = [lemmatizer.lemmatize(w) for w in quote_token]
quote_lemma print(quote_lemma)
['The', 'dim', ',', 'dusty', 'room', ',', 'with', 'the', 'bust', 'staring', 'down', 'from', 'the', 'tall', 'book-cases', ',', 'the', 'cosy', 'chair', ',', 'the', 'globe', ',', 'and', ',', 'best', 'of', 'all', ',', 'the', 'wilderness', 'of', 'book', ',', 'in', 'which', 'she', 'could', 'wander', 'where', 'she', 'liked', ',', 'made', 'the', 'library', 'a', 'region', 'of', 'bliss', 'to', 'her', '.']
12.1.8 Chunking/Chinking
While tokenizing allows us to distinguish individual words and sentences within a larger body of text, Chunking allows us to identify phrases based on grammar we specify.
#nltk.download("averaged_perceptron_tagger")
= nltk.pos_tag(quote_token) quote_tag
We can then name grammar rules to apply to the text. These use regular expressions, which are listed below:
Operator | Behavior |
. | Wildcard, matches any character |
^abc | Matches some pattern abc at the start of a string |
abc$ | Matches some pattern abc at the end of a string |
[abc] | Matches one of a set of characters |
[A-Z0-9] | Matches one of a range of characters |
ed|ing|s | Matches one of the specified strings (disjunction) |
* | Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure) |
+ | One or more of previous item, e.g. a+, [a-z]+ |
? | Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]? |
{n} | Exactly n repeats where n is a non-negative integer |
{n,} | At least n repeats |
{,n} | No more than n repeats |
{m,n} | At least m and no more than n repeats |
a(b|c)+ | Parentheses that indicate the scope of the operators |
import re
import regex
= r"""
grammar NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP
"""
= nltk.RegexpParser(grammar)
chunk_parser = chunk_parser.parse(quote_tag)
tree =True) tree.pretty_print(unicodelines
S
┌───┬───────┬─────────┬─────┬───┬───┬────┬─────┬─────┬──────┬───┬────┬───────┬────────┬───────┬─────────┬─────────┬────────┬────────┬──────┬─────┬───────┬──────┬──────┬──────────┬────┴──────────────┬────────────────────┬──────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────────────────┬─────────────────┬───────────────────────┬───────────────────────┬────────────────────────────┐
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ PP PP │ │ PP │ PP │ PP
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌──────┴─────┐ ┌────────────┴─────┐ │ │ ┌────┴────┐ │ ┌────┴──────┐ │ ┌────┴─────┐
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ NP NP │ NP │ NP NP NP │ NP NP │ NP NP │ NP
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌─────┴────┐ ┌──────┴─────┐ │ ┌─────┴──────┐ │ ┌───────────┼──────────┐ ┌───────┼────────┐ ┌─────┴──────┐ │ │ ┌─────┴────────┐ │ │ ┌────────┼───────┬───────┐ │ │
,/, ,/, staring/VBG down/RP ,/, ,/, ,/, and/CC ,/, best/JJS ,/, ,/, in/IN which/WDT she/PRP could/MD wander/VB where/WRB she/PRP liked/VBD ,/, made/VBD to/TO her/PRP$ ./. The/DT dim/NN dusty/JJ room/NN with/IN the/DT busts/NNS from/IN the/DT tall/JJ book-cases/NNS the/DT cosy/JJ chairs/NNS the/DT globes/NNS of/IN all/DT the/DT wilderness/NN of/IN books/NNS the/DT library/NN a/DT region/NN of/IN bliss/NN
As you can see, the generated tree shows the chunks that were identified by the grammar rules. There also is a chink
operator, which is the opposite of chunk
. It allows us to remove a chunk from a larger chunk.
12.1.9 Named Entity Recognition
Previous methods have been able to identify the parts of speech of each word in a text. However, we may want to identify specific entities within the text. For example, we may want to identify the names of people, places, and organizations. nltk
includes a named entity recognizer which can identify these entities. We can demonstrate this using a quote from The Iliad by Homer.
= "In the war of Troy, the Greeks having sacked some of the neighbouring towns, and taken from thence two beautiful captives, Chryseïs and Briseïs, allotted the first to Agamemnon, and the last to Achilles."
homer = word_tokenize(homer)
homer_token = nltk.pos_tag(homer_token) homer_tag
#nltk.download("maxent_ne_chunker")
#nltk.download("words")
= nltk.ne_chunk(homer_tag)
tree2 =True) tree2.pretty_print(unicodelines
S
┌─────┬──────┬──────┬────┬────┬────────┬──────────┬─────────┬──────┬─────┬───────────┬────────────┬──────┬────┬────────┬────────┬────────┬───────┬─────────┼────────────┬────────┬────┬─────┬───────┬─────────┬───────┬───────┬────┬────┬──────┬───────┬──────┬────┬─────┬─────────┬───────────┬────────────┬────────────┬────────────┐
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ GPE GPE PERSON GPE GPE GPE
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
In/IN the/DT war/NN of/IN ,/, the/DT having/VBG sacked/VBN some/DT of/IN the/DT neighbouring/JJ towns/NNS ,/, and/CC taken/VBN from/IN thence/NN two/CD beautiful/JJ captives/NNS ,/, and/CC ,/, allotted/VBD the/DT first/JJ to/TO ,/, and/CC the/DT last/JJ to/TO ./. Troy/NNP Greeks/NNP Chryseïs/NNP Briseïs/NNP Agamemnon/NNP Achilles/NNP
In the tree, some of the words that should be tagged as PERSON
are tagged as GPE
, or Geo-Political Entity. In these cases, we can also generate a tree which does not specify the type of named entity.
= nltk.ne_chunk(homer_tag, binary=True)
tree3 =True) tree3.pretty_print(unicodelines
S
┌─────┬──────┬──────┬────┬────┬────────┬──────────┬─────────┬──────┬─────┬───────────┬────────────┬──────┬────┬────────┬────────┬────────┬───────┬─────────┼────────────┬────────┬────┬─────┬───────┬─────────┬───────┬───────┬────┬────┬──────┬───────┬──────┬────┬─────┬─────────┬───────────┬────────────┬────────────┬────────────┐
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ NE NE NE NE NE NE
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
In/IN the/DT war/NN of/IN ,/, the/DT having/VBG sacked/VBN some/DT of/IN the/DT neighbouring/JJ towns/NNS ,/, and/CC taken/VBN from/IN thence/NN two/CD beautiful/JJ captives/NNS ,/, and/CC ,/, allotted/VBD the/DT first/JJ to/TO ,/, and/CC the/DT last/JJ to/TO ./. Troy/NNP Greeks/NNP Chryseïs/NNP Briseïs/NNP Agamemnon/NNP Achilles/NNP
12.1.10 Analyzing Corpora
nltk
includes a number of corpora, which are large bodies of text. We will try out some methods on the 1851 novel Moby Dick by Herman Melville.
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
12.1.10.1 Concordance
concordance
allows us to find all instances of a word in a text. We can use this to find all instances of the word “whale” in Moby Dick.
"whale") text1.concordance(
Displaying 25 of 1226 matches:
s , and to teach them by what name a whale - fish is to be called in our tongue
t which is not true ." -- HACKLUYT " WHALE . ... Sw . and Dan . HVAL . This ani
ulted ." -- WEBSTER ' S DICTIONARY " WHALE . ... It is more immediately from th
ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE
HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE
least , take the higgledy - piggledy whale statements , however authentic , in
dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a
patient Job ." -- RABELAIS . " This whale ' s liver was two cartloads ." -- ST
Touching that monstrous bulk of the whale or ork we have received nothing cert
of oil will be extracted out of one whale ." -- IBID . " HISTORY OF LIFE AND D
ise ." -- KING HENRY . " Very like a whale ." -- HAMLET . " Which to secure , n
restless paine , Like as the wounded whale to shore flies thro ' the maine ." -
. OF SPERMA CETI AND THE SPERMA CETI WHALE . VIDE HIS V . E . " Like Spencer '
t had been a sprat in the mouth of a whale ." -- PILGRIM ' S PROGRESS . " That
EN ' S ANNUS MIRABILIS . " While the whale is floating at the stern of the ship
e ship called The Jonas - in - the - Whale . ... Some say the whale can ' t ope
in - the - Whale . ... Some say the whale can ' t open his mouth , but that is
masts to see whether they can see a whale , for the first discoverer has a duc
for his pains . ... I was told of a whale taken near Shetland , that had above
oneers told me that he caught once a whale in Spitzbergen that was white all ov
2 , one eighty feet in length of the whale - bone kind came in , which ( as I w
n master and kill this Sperma - ceti whale , for I could never hear of any of t
. 1729 . "... and the breath of the whale is frequendy attended with such an i
ed with hoops and armed with ribs of whale ." -- RAPE OF THE LOCK . " If we com
contemptible in the comparison . The whale is doubtless the largest animal in c
12.1.10.2 Dispersion Plot
dispersion_plot
allows us to see how a word is used throughout a text. We can use this to see the representation of characters throughout Moby Dick.
"Ahab", "Ishmael", "Starbuck", "Queequeg"]) text1.dispersion_plot([
/usr/local/lib/python3.11/site-packages/nltk/draw/__init__.py:15: UserWarning: nltk.draw package not loaded (please install Tkinter library).
warnings.warn("nltk.draw package not loaded (please install Tkinter library).")
12.1.10.3 Frequency Distribution
FreqDist
allows us to see the frequency of each word in a text. We can use this to see the most common words in Moby Dick.
from nltk import FreqDist
= FreqDist(text1)
fdist1 print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
We can use the list of stop words generated previously to help us focus on meaningful words.
= [w for w in text1 if w not in stop_words and w.isalpha()]
text1_imp = FreqDist(text1_imp)
fdist2 20) fdist2.most_common(
[('I', 2124),
('whale', 906),
('one', 889),
('But', 705),
('like', 624),
('The', 612),
('upon', 538),
('man', 508),
('ship', 507),
('Ahab', 501),
('ye', 460),
('old', 436),
('sea', 433),
('would', 421),
('And', 369),
('head', 335),
('though', 335),
('boat', 330),
('time', 324),
('long', 318)]
We can visualize the frequency distribution using plot
.
20, cumulative=True) fdist2.plot(
<AxesSubplot: xlabel='Samples', ylabel='Cumulative Counts'>
12.1.10.4 Collocations
collocations
allows us to find words that commonly appear together. We can use this to find the most common collocations in Moby Dick.
text1.collocations()
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
12.1.11 Conclusion
In this tutorial, we have learned how to use nltk
to perform basic text analysis. There are many methods included in this package that help provide structure to text. These methods can be used in conjunction with other packages to perform more complex analysis. For example, a dataframe of open-ended customer feedback could be processed to identify common themes, as well as the polarity of the feedback.
12.1.12 Resources
12.2 Neural Networks with Tensorflow (by Giovanni Lunetta)
A neural network is a type of machine learning algorithm that is inspired by the structure and function of the human brain. It consists of layers of interconnected nodes, or neurons, that can learn to recognize patterns in data and make predictions or decisions based on that input.
Neural networks are used in a wide variety of applications, including image and speech recognition, natural language processing, predictive analytics, robotics, and more. They have been especially effective in tasks that require pattern recognition, such as identifying objects in images, translating between languages, and predicting future trends in data.
12.2.1 Neural Network Architecture
A neural network consists of one or more layers of neurons, each of which takes input from the previous layer and produces output for the next layer. The input layer receives raw data, while the output layer produces predictions or decisions based on that input. The hidden layers in between contain neurons that can learn to recognize patterns in the data and extract features that are useful for making predictions.
Each neuron in a neural network has a set of weights and biases that determine how it responds to input. These values are adjusted during training to improve the accuracy of the network’s predictions. The activation function of a neuron determines how it responds to input, such as by applying a threshold or sigmoid function.
Code
from IPython.display import Image
# Image(filename='ai-artificial-neural-network-alex-castrounis.png')
The input layer: The three blue nodes on the left side of the diagram represent the input layer. This layer receives input data, such as pixel values from an image or numerical features from a dataset.
The hidden layer: The four white nodes in the middle of the diagram represent the hidden layer. This layer performs computations on the input data and generates output values that are passed to the output layer.
The output layer: The orange node on the right side of the diagram represents the output layer. This layer generates the final output of the neural network, which can be a binary classification (0 or 1) or a continuous value.
The arrows: The arrows in the diagram represent the connections between nodes in adjacent layers. Each arrow has an associated weight, which is a parameter learned during the training process. The weights determine the strength of the connections between the nodes and are used to compute the output values of each node.
12.2.2 ReLu Activation Function
The ReLU (Rectified Linear Unit) activation function is used in neural networks to introduce non-linearity into the model. Non-linearity allows neural networks to learn more complex relationships between inputs and outputs.
ReLU is a simple function that returns the input if it is positive, and 0 otherwise. This means that ReLU “activates” (returns a non-zero output) only if the input is positive, which can be thought of as a way for the neuron to “turn on” when the input is significant enough. In contrast, a linear function would simply scale the input by a constant factor, which would not introduce any non-linearity into the model.
In simple terms, ReLU allows the neural network to selectively activate certain neurons based on the importance of the input, which helps it learn more complex patterns in the data.
import numpy as np
import matplotlib.pyplot as plt
def linear(x):
return x
def relu(x):
return np.maximum(0, x)
= np.linspace(-10, 10, 100)
x = linear(x)
y_linear = relu(x)
y_relu
='Linear')
plt.plot(x, y_linear, label='ReLU')
plt.plot(x, y_relu, label
plt.legend()'Input')
plt.xlabel('Output')
plt.ylabel( plt.show()
12.2.3 Demonstration
TensorFlow is an open-source software library developed by Google that is widely used for building and training machine learning models, including neural networks. TensorFlow provides a range of tools and abstractions that make it easier to build and optimize complex models, as well as tools for deploying models in production.
Here’s an example of how to use TensorFlow to build a neural network for a softmax regression model:
First we start by importing the proper packages:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import plot_model
from tensorflow.keras.losses import SparseCategoricalCrossentropy
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
2023-04-10 20:47:18.379479: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
TensorFlow and Keras are closely related, as Keras is a high-level API that is built on top of TensorFlow. Keras provides a user-friendly interface for building neural networks, making it easy to create, train, and evaluate models without needing to know the details of TensorFlow’s low-level API.
Keras was initially developed as a standalone library, but since version 2.0, it has been integrated into TensorFlow as its official high-level API. This means that Keras can now be used as a part of TensorFlow, providing a unified and comprehensive platform for deep learning.
In other words, Keras is essentially a wrapper around TensorFlow that provides a simpler and more intuitive interface for building neural networks. While TensorFlow provides a lower-level API that offers more control and flexibility, Keras makes it easier to get started with building deep learning models, especially for beginners.
# make dataset for example
= [[-5, 2], [-2, -2], [1, 2], [5, -2]]
centers = make_blobs(n_samples=2000, centers=centers, cluster_std=2.0,random_state=75)
X_train, y_train
# plot the example dataset
0], X_train[:, 1], c=y_train)
plt.scatter(X_train[:, 'Example Dataset')
plt.title('Feature 1')
plt.xlabel('Feature 2')
plt.ylabel( plt.show()
We will talk about three ways to implement a softmax regression machine learning model. The first using Stochastic Gradient Descent as the loss function. Next, using a potentially more efficient algoritm called the Adam Algoritm. Finally, using the Adam Algoritm again, but more efficiently.
12.2.3.1 Stochastic Gradient Descent
= tf.keras.Sequential([
sgd_model 10, activation = 'relu'),
Dense(5, activation = 'relu'),
Dense(4, activation = 'softmax') # <-- softmax activation here
Dense(
]
)compile(
sgd_model.=tf.keras.losses.SparseCategoricalCrossentropy(), # <-- Note
loss
)= sgd_model.fit(
sgd_history
X_train,y_train,=30
epochs )
Epoch 1/30
1/63 [..............................] - ETA: 38s - loss: 1.7720
35/63 [===============>..............] - ETA: 0s - loss: 1.4569
63/63 [==============================] - 1s 1ms/step - loss: 1.3344
Epoch 2/30
1/63 [..............................] - ETA: 0s - loss: 1.2658
42/63 [===================>..........] - ETA: 0s - loss: 1.0011
63/63 [==============================] - 0s 1ms/step - loss: 0.9616
Epoch 3/30
1/63 [..............................] - ETA: 0s - loss: 0.9663
49/63 [======================>.......] - ETA: 0s - loss: 0.8074
63/63 [==============================] - 0s 1ms/step - loss: 0.7756
Epoch 4/30
1/63 [..............................] - ETA: 0s - loss: 0.7792
48/63 [=====================>........] - ETA: 0s - loss: 0.6900
63/63 [==============================] - 0s 1ms/step - loss: 0.6672
Epoch 5/30
1/63 [..............................] - ETA: 0s - loss: 0.5995
49/63 [======================>.......] - ETA: 0s - loss: 0.6040
63/63 [==============================] - 0s 1ms/step - loss: 0.6121
Epoch 6/30
1/63 [..............................] - ETA: 0s - loss: 0.4462
49/63 [======================>.......] - ETA: 0s - loss: 0.5770
63/63 [==============================] - 0s 1ms/step - loss: 0.5784
Epoch 7/30
1/63 [..............................] - ETA: 0s - loss: 0.7130
49/63 [======================>.......] - ETA: 0s - loss: 0.5520
63/63 [==============================] - 0s 1ms/step - loss: 0.5533
Epoch 8/30
1/63 [..............................] - ETA: 0s - loss: 0.5232
45/63 [====================>.........] - ETA: 0s - loss: 0.5186
63/63 [==============================] - 0s 1ms/step - loss: 0.5324
Epoch 9/30
1/63 [..............................] - ETA: 0s - loss: 0.8772
46/63 [====================>.........] - ETA: 0s - loss: 0.5226
63/63 [==============================] - 0s 1ms/step - loss: 0.5147
Epoch 10/30
1/63 [..............................] - ETA: 0s - loss: 0.5530
46/63 [====================>.........] - ETA: 0s - loss: 0.4912
63/63 [==============================] - 0s 1ms/step - loss: 0.4989
Epoch 11/30
1/63 [..............................] - ETA: 0s - loss: 0.3820
47/63 [=====================>........] - ETA: 0s - loss: 0.4914
63/63 [==============================] - 0s 1ms/step - loss: 0.4848
Epoch 12/30
1/63 [..............................] - ETA: 0s - loss: 0.5388
47/63 [=====================>........] - ETA: 0s - loss: 0.4677
63/63 [==============================] - 0s 1ms/step - loss: 0.4727
Epoch 13/30
1/63 [..............................] - ETA: 0s - loss: 0.5586
47/63 [=====================>........] - ETA: 0s - loss: 0.4674
63/63 [==============================] - 0s 1ms/step - loss: 0.4623
Epoch 14/30
1/63 [..............................] - ETA: 0s - loss: 0.5675
47/63 [=====================>........] - ETA: 0s - loss: 0.4329
63/63 [==============================] - 0s 1ms/step - loss: 0.4523
Epoch 15/30
1/63 [..............................] - ETA: 0s - loss: 0.4606
46/63 [====================>.........] - ETA: 0s - loss: 0.4390
63/63 [==============================] - 0s 1ms/step - loss: 0.4448
Epoch 16/30
1/63 [..............................] - ETA: 0s - loss: 0.5161
49/63 [======================>.......] - ETA: 0s - loss: 0.4613
63/63 [==============================] - 0s 1ms/step - loss: 0.4384
Epoch 17/30
1/63 [..............................] - ETA: 0s - loss: 0.5498
49/63 [======================>.......] - ETA: 0s - loss: 0.4424
63/63 [==============================] - 0s 1ms/step - loss: 0.4326
Epoch 18/30
1/63 [..............................] - ETA: 0s - loss: 0.3196
49/63 [======================>.......] - ETA: 0s - loss: 0.4350
63/63 [==============================] - 0s 1ms/step - loss: 0.4280
Epoch 19/30
1/63 [..............................] - ETA: 0s - loss: 0.4926
50/63 [======================>.......] - ETA: 0s - loss: 0.4204
63/63 [==============================] - 0s 1ms/step - loss: 0.4238
Epoch 20/30
1/63 [..............................] - ETA: 0s - loss: 0.3793
50/63 [======================>.......] - ETA: 0s - loss: 0.4085
63/63 [==============================] - 0s 1ms/step - loss: 0.4200
Epoch 21/30
1/63 [..............................] - ETA: 0s - loss: 0.3395
49/63 [======================>.......] - ETA: 0s - loss: 0.4139
63/63 [==============================] - 0s 1ms/step - loss: 0.4169
Epoch 22/30
1/63 [..............................] - ETA: 0s - loss: 0.3211
36/63 [================>.............] - ETA: 0s - loss: 0.4134
61/63 [============================>.] - ETA: 0s - loss: 0.4142
63/63 [==============================] - 0s 2ms/step - loss: 0.4144
Epoch 23/30
1/63 [..............................] - ETA: 0s - loss: 0.5288
34/63 [===============>..............] - ETA: 0s - loss: 0.4221
63/63 [==============================] - 0s 1ms/step - loss: 0.4126
Epoch 24/30
1/63 [..............................] - ETA: 0s - loss: 0.4448
48/63 [=====================>........] - ETA: 0s - loss: 0.4133
63/63 [==============================] - 0s 1ms/step - loss: 0.4105
Epoch 25/30
1/63 [..............................] - ETA: 0s - loss: 0.4868
49/63 [======================>.......] - ETA: 0s - loss: 0.4047
63/63 [==============================] - 0s 1ms/step - loss: 0.4084
Epoch 26/30
1/63 [..............................] - ETA: 0s - loss: 0.3879
48/63 [=====================>........] - ETA: 0s - loss: 0.4123
63/63 [==============================] - 0s 1ms/step - loss: 0.4062
Epoch 27/30
1/63 [..............................] - ETA: 0s - loss: 0.3862
49/63 [======================>.......] - ETA: 0s - loss: 0.4194
63/63 [==============================] - 0s 1ms/step - loss: 0.4044
Epoch 28/30
1/63 [..............................] - ETA: 0s - loss: 0.4644
50/63 [======================>.......] - ETA: 0s - loss: 0.4052
63/63 [==============================] - 0s 1ms/step - loss: 0.4031
Epoch 29/30
1/63 [..............................] - ETA: 0s - loss: 0.3527
49/63 [======================>.......] - ETA: 0s - loss: 0.3962
63/63 [==============================] - 0s 1ms/step - loss: 0.4023
Epoch 30/30
1/63 [..............................] - ETA: 0s - loss: 0.5316
49/63 [======================>.......] - ETA: 0s - loss: 0.3994
63/63 [==============================] - 0s 1ms/step - loss: 0.4012
Here is a step-by-step explanation of the code:
First, we create a sequential model using the
tf.keras.Sequential()
function. This is a linear stack of layers where we can add layers using the.add()
method.Then we add three dense layers to the model using the
.add()
method. The first two layers have the relu activation function and the last layer has the softmax activation function.We import
SparseCategoricalCrossentropy
fromtensorflow.keras.losses
. This is our loss function, which will be used to evaluate the model during training.We compile the model using
model.compile()
, specifying theSparseCategoricalCrossentropy()
as our loss function.We fit the model to the training data using
model.fit()
, specifying the training data (X_train and y_train) and the number of epochs* (10).
In summary, the code creates a sequential model with three dense layers, using the relu activation function in the first two layers and the softmax activation function in the output layer. The model is then compiled using the SparseCategoricalCrossentropy()
loss function, and finally, the model is trained for 10 epochs using the model.fit()
method.
*In machine learning, the term “epochs” refers to the number of times the entire training dataset is used to train the model. During each epoch, the model processes the entire dataset, updates its parameters based on the computed errors, and moves on to the next epoch until the desired level of accuracy is achieved. Increasing the number of epochs may improve the model accuracy, but it also increases the risk of overfitting on the training data. Therefore, the number of epochs is a hyperparameter that must be tuned to achieve the best possible results.
sgd_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 10) 30
dense_1 (Dense) (None, 5) 55
dense_2 (Dense) (None, 4) 24
=================================================================
Total params: 109
Trainable params: 109
Non-trainable params: 0
_________________________________________________________________
In this example, the first hidden layer has 10 neurons, so there are 10 * 3 = 30 parameters (3 input features). The second hidden layer has 5 neurons, so there are 5 * 10 + 5 = 55 parameters (10 inputs from the previous layer, plus 5 bias terms). The output layer has 4 neurons, so there are 5 * 4 + 4 = 24 parameters (5 inputs from the previous layer, plus 4 bias terms).
The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.
The None values in the output shape column represent the variable batch size that is inputted during the training process.
= sgd_model.predict(X_train)
p_nonpreferred print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
1/63 [..............................] - ETA: 4s
61/63 [============================>.] - ETA: 0s
63/63 [==============================] - 0s 857us/step
[[3.0265430e-05 9.9000406e-01 7.7044405e-03 2.2612682e-03]
[9.8559028e-09 2.8865002e-03 4.3772280e-02 9.5334125e-01]]
largest value 0.9999842 smallest value 1.1022385e-20
p_nonpreferred = model.predict(X_train)
: This line uses the predict method of the model object to make predictions on the input data X_train. The resulting predictions are stored in the p_nonpreferred variable.
print(p_nonpreferred [:2])
: This line prints the first two rows of p_nonpreferred. Each row represents the predicted probabilities for a single observation in the training set. The four columns represent the predicted probabilities for each of the four classes in the dataset.
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
: This line prints out the largest and smallest values from p_nonpreferred, which can give an idea of the range of the predictions. The np.max and np.min functions from NumPy are used to find the maximum and minimum values in p_nonpreferred.
The output is a matrix with two rows (because we have two input examples) and four columns (because the output layer has four neurons). Each element of the matrix is the probability that the input example belongs to the corresponding class. For example, the probability that the first input example belongs to class 3 (which has the highest probability) is 0.99254191.
12.2.3.2 ADAM Algoritm
= Sequential(
adam_model
[ 25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'softmax') # < softmax activation here
Dense(
]
)compile(
adam_model.=tf.keras.losses.SparseCategoricalCrossentropy(),
loss=tf.keras.optimizers.Adam(0.001), # < change to 0.01 and rerun
optimizer
)
= adam_model.fit(
adam_history
X_train,y_train,=30
epochs )
Epoch 1/30
1/63 [..............................] - ETA: 29s - loss: 2.1227
46/63 [====================>.........] - ETA: 0s - loss: 1.4394
63/63 [==============================] - 1s 1ms/step - loss: 1.3151
Epoch 2/30
1/63 [..............................] - ETA: 0s - loss: 0.9224
43/63 [===================>..........] - ETA: 0s - loss: 0.7678
63/63 [==============================] - 0s 1ms/step - loss: 0.7279
Epoch 3/30
1/63 [..............................] - ETA: 0s - loss: 0.5913
49/63 [======================>.......] - ETA: 0s - loss: 0.5599
63/63 [==============================] - 0s 1ms/step - loss: 0.5584
Epoch 4/30
1/63 [..............................] - ETA: 0s - loss: 0.5087
49/63 [======================>.......] - ETA: 0s - loss: 0.5084
63/63 [==============================] - 0s 1ms/step - loss: 0.5000
Epoch 5/30
1/63 [..............................] - ETA: 0s - loss: 0.5003
48/63 [=====================>........] - ETA: 0s - loss: 0.4758
63/63 [==============================] - 0s 1ms/step - loss: 0.4721
Epoch 6/30
1/63 [..............................] - ETA: 0s - loss: 0.2643
48/63 [=====================>........] - ETA: 0s - loss: 0.4650
63/63 [==============================] - 0s 1ms/step - loss: 0.4582
Epoch 7/30
1/63 [..............................] - ETA: 0s - loss: 0.5475
47/63 [=====================>........] - ETA: 0s - loss: 0.4494
63/63 [==============================] - 0s 1ms/step - loss: 0.4458
Epoch 8/30
1/63 [..............................] - ETA: 0s - loss: 0.3605
47/63 [=====================>........] - ETA: 0s - loss: 0.4394
63/63 [==============================] - 0s 1ms/step - loss: 0.4361
Epoch 9/30
1/63 [..............................] - ETA: 0s - loss: 0.4289
48/63 [=====================>........] - ETA: 0s - loss: 0.4337
63/63 [==============================] - 0s 1ms/step - loss: 0.4278
Epoch 10/30
1/63 [..............................] - ETA: 0s - loss: 0.5932
48/63 [=====================>........] - ETA: 0s - loss: 0.4265
63/63 [==============================] - 0s 1ms/step - loss: 0.4206
Epoch 11/30
1/63 [..............................] - ETA: 0s - loss: 0.3905
47/63 [=====================>........] - ETA: 0s - loss: 0.4015
63/63 [==============================] - 0s 1ms/step - loss: 0.4156
Epoch 12/30
1/63 [..............................] - ETA: 0s - loss: 0.3275
47/63 [=====================>........] - ETA: 0s - loss: 0.4049
63/63 [==============================] - 0s 1ms/step - loss: 0.4085
Epoch 13/30
1/63 [..............................] - ETA: 0s - loss: 0.4582
48/63 [=====================>........] - ETA: 0s - loss: 0.4046
63/63 [==============================] - 0s 1ms/step - loss: 0.4050
Epoch 14/30
1/63 [..............................] - ETA: 0s - loss: 0.3297
47/63 [=====================>........] - ETA: 0s - loss: 0.4181
63/63 [==============================] - 0s 1ms/step - loss: 0.4021
Epoch 15/30
1/63 [..............................] - ETA: 0s - loss: 0.3931
48/63 [=====================>........] - ETA: 0s - loss: 0.3867
63/63 [==============================] - 0s 1ms/step - loss: 0.4025
Epoch 16/30
1/63 [..............................] - ETA: 0s - loss: 0.3410
40/63 [==================>...........] - ETA: 0s - loss: 0.4134
63/63 [==============================] - 0s 1ms/step - loss: 0.3997
Epoch 17/30
1/63 [..............................] - ETA: 0s - loss: 0.2734
48/63 [=====================>........] - ETA: 0s - loss: 0.3849
63/63 [==============================] - 0s 1ms/step - loss: 0.3969
Epoch 18/30
1/63 [..............................] - ETA: 0s - loss: 0.3366
47/63 [=====================>........] - ETA: 0s - loss: 0.3955
63/63 [==============================] - 0s 1ms/step - loss: 0.3970
Epoch 19/30
1/63 [..............................] - ETA: 0s - loss: 0.4878
48/63 [=====================>........] - ETA: 0s - loss: 0.3868
63/63 [==============================] - 0s 1ms/step - loss: 0.3957
Epoch 20/30
1/63 [..............................] - ETA: 0s - loss: 0.2970
47/63 [=====================>........] - ETA: 0s - loss: 0.3980
63/63 [==============================] - 0s 1ms/step - loss: 0.3939
Epoch 21/30
1/63 [..............................] - ETA: 0s - loss: 0.4061
47/63 [=====================>........] - ETA: 0s - loss: 0.4047
63/63 [==============================] - 0s 1ms/step - loss: 0.3944
Epoch 22/30
1/63 [..............................] - ETA: 0s - loss: 0.2295
47/63 [=====================>........] - ETA: 0s - loss: 0.3838
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 23/30
1/63 [..............................] - ETA: 0s - loss: 0.4797
48/63 [=====================>........] - ETA: 0s - loss: 0.3941
63/63 [==============================] - 0s 1ms/step - loss: 0.3928
Epoch 24/30
1/63 [..............................] - ETA: 0s - loss: 0.6253
47/63 [=====================>........] - ETA: 0s - loss: 0.3789
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 25/30
1/63 [..............................] - ETA: 0s - loss: 0.3452
47/63 [=====================>........] - ETA: 0s - loss: 0.3896
63/63 [==============================] - 0s 1ms/step - loss: 0.3923
Epoch 26/30
1/63 [..............................] - ETA: 0s - loss: 0.2694
47/63 [=====================>........] - ETA: 0s - loss: 0.4114
63/63 [==============================] - 0s 1ms/step - loss: 0.3926
Epoch 27/30
1/63 [..............................] - ETA: 0s - loss: 0.3815
48/63 [=====================>........] - ETA: 0s - loss: 0.3874
63/63 [==============================] - 0s 1ms/step - loss: 0.3908
Epoch 28/30
1/63 [..............................] - ETA: 0s - loss: 0.5244
48/63 [=====================>........] - ETA: 0s - loss: 0.3865
63/63 [==============================] - 0s 1ms/step - loss: 0.3920
Epoch 29/30
1/63 [..............................] - ETA: 0s - loss: 0.4065
48/63 [=====================>........] - ETA: 0s - loss: 0.4086
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 30/30
1/63 [..............................] - ETA: 0s - loss: 0.3673
48/63 [=====================>........] - ETA: 0s - loss: 0.3790
63/63 [==============================] - 0s 1ms/step - loss: 0.3907
adam_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_3 (Dense) (None, 25) 75
dense_4 (Dense) (None, 15) 390
dense_5 (Dense) (None, 4) 64
=================================================================
Total params: 529
Trainable params: 529
Non-trainable params: 0
_________________________________________________________________
The None values in the output shape column represent the variable batch size that is inputted during the training process. The number of parameters in each layer depends on the number of inputs and the number of neurons in the layer, along with any additional bias terms.
In this example, the first hidden layer has 25 neurons, so there are 25 * 3 = 75 parameters (3 input features). The second hidden layer has 15 neurons, so there are 15 * 25 + 15 = 390 parameters (25 inputs from the previous layer, plus 15 bias terms). The output layer has 4 neurons, so there are 15 * 4 + 4 = 64 parameters (15 inputs from the previous layer, plus 4 bias terms).
The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.
= adam_model.predict(X_train)
p_nonpreferred print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
1/63 [..............................] - ETA: 2s
60/63 [===========================>..] - ETA: 0s
63/63 [==============================] - 0s 866us/step
[[3.7956154e-03 9.6981263e-01 1.5898595e-02 1.0493187e-02]
[4.6749294e-05 3.6971366e-03 6.8161853e-02 9.2809433e-01]]
largest value 0.999983 smallest value 1.492862e-13
Here, the only difference between the these two machine learning models is the optimizer. That line of code, optimizer=tf.keras.optimizers.Adam(0.001), specifies the optimizer to be used during training. In this case, it uses the Adam optimizer with a learning rate of 0.001. The Adam optimizer is an adaptive optimization algorithm that is commonly used in deep learning for its ability to dynamically adjust the learning rate during training, which can help prevent the model from getting stuck in local minima.
Code
import numpy as np
import matplotlib.pyplot as plt
# Define the objective function (quadratic)
def objective(x, y):
return x**2 + y**2
# Define the Adam update rule
def adam_update(x, y, m, v, t, alpha=0.1, beta1=0.9, beta2=0.999, eps=1e-8):
= np.array([2*x, 2*y])
g = beta1 * m + (1 - beta1) * g
m = beta2 * v + (1 - beta2) * g**2
v = m / (1 - beta1**t)
m_hat = v / (1 - beta2**t)
v_hat = - alpha * m_hat[0] / (np.sqrt(v_hat[0]) + eps)
dx = - alpha * m_hat[1] / (np.sqrt(v_hat[1]) + eps)
dy return dx, dy, m, v
# Define the parameters for the optimization
= np.array([2.0, 2.0])
theta = np.zeros(2)
m = np.zeros(2)
v = 0
t = 0.1
alpha = 0.9
beta1 = 0.999
beta2 = 1e-8
eps
# Generate the parameter space grid
= np.linspace(-3, 3, 100)
x = np.linspace(-3, 3, 100)
y = np.meshgrid(x, y)
X, Y = objective(X, Y)
Z
# Generate the parameter space plot
= plt.subplots()
fig, ax =30, cmap='jet')
ax.contour(X, Y, Z, levels'x')
ax.set_xlabel('y')
ax.set_ylabel('Parameter Space of Adam')
ax.set_title(
# Perform several iterations of Adam and plot the updates
for i in range(20):
+= 1
t = adam_update(theta[0], theta[1], m, v, t, alpha, beta1, beta2, eps)
dx, dy, m, v += np.array([dx, dy])
theta 0]-dx, theta[1]-dy, dx, dy, head_width=0.1, head_length=0.1, fc='b', ec='b')
ax.arrow(theta[ plt.show()
'loss'], label='SGD')
plt.plot(sgd_history.history['loss'], label='Adam')
plt.plot(adam_history.history[
plt.legend()'Epoch')
plt.xlabel('Loss')
plt.ylabel( plt.show()
12.2.3.3 Preferred ADAM Algorithm
As we have talked about in class before, numerical roundoff errors happen when coding in python due to memory overflow.
= 2.0 / 10000
x1 print(f"{x1:.18f}") # print 18 digits to the right of the decimal point
0.000200000000000000
= 1 + (1/10000) - (1 - 1/10000)
x2 print(f"{x2:.18f}")
0.000199999999999978
It turns out that while the implementation of the loss function for softmax was correct, there is a different and better way of reducing numerical roundoff errors which leads to more accurate computations.
If we go back to how a loss function for softmax regression is implemented we see that the loss function is expressed in the following formula: \[ \text{loss}(a_1, a_2, \dots, a_n, y) = \begin{cases} -\log(a_1) & \text{if } y = 1 \\ -\log(a_2) & \text{if } y = 2 \\ \vdots & \vdots \\ -\log(a_n) & \text{if } y = n \end{cases} \]
where \(a_j\) is computed from: \[ a_j = \frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}} = P(y=j \mid \vec{x}) \]
This can lead to numerical roundoff errors in tensorflow as the loss function is not directly computing \(a_j\).
In terms of code, that is exactly what loss=SparseCategoricalCrossentropy()
is doing. Therefore, it would be more accurate if we could implement the loss function as follows: \[
\text{loss}(a_1, a_2, \dots, a_n, y) =
\begin{cases}
-\log(\frac{e^{z_1}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 1 \\
-\log(\frac{e^{z_2}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 2 \\
\vdots & \vdots \\
-\log(\frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}}) & \text{if } y = n
\end{cases}
\]
We achieve this in two steps. The first is making the output layer a linear activation, and additionally adding a from_logits=True
parameter to the loss=tf.keras.losses.SparseCategoricalCrossentropy
line of code. By using a linear activation function instead of softmax, the model will output a vector of real numbers rather than probabilities.
= Sequential(
preferred_model
[ 25, activation = 'relu'),
Dense(15, activation = 'relu'),
Dense(4, activation = 'linear') #<-- Note
Dense(
]
)compile(
preferred_model.=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), #<-- Note
loss=tf.keras.optimizers.Adam(0.001),
optimizer
)
= preferred_model.fit(
preferred_history
X_train,y_train,=30
epochs )
Epoch 1/30
1/63 [..............................] - ETA: 28s - loss: 1.7850
48/63 [=====================>........] - ETA: 0s - loss: 1.1980
63/63 [==============================] - 1s 1ms/step - loss: 1.1161
Epoch 2/30
1/63 [..............................] - ETA: 0s - loss: 0.6643
48/63 [=====================>........] - ETA: 0s - loss: 0.6635
63/63 [==============================] - 0s 1ms/step - loss: 0.6332
Epoch 3/30
1/63 [..............................] - ETA: 0s - loss: 0.5495
49/63 [======================>.......] - ETA: 0s - loss: 0.5082
63/63 [==============================] - 0s 1ms/step - loss: 0.5024
Epoch 4/30
1/63 [..............................] - ETA: 0s - loss: 0.3845
47/63 [=====================>........] - ETA: 0s - loss: 0.4624
63/63 [==============================] - 0s 1ms/step - loss: 0.4612
Epoch 5/30
1/63 [..............................] - ETA: 0s - loss: 0.4256
35/63 [===============>..............] - ETA: 0s - loss: 0.4366
63/63 [==============================] - 0s 1ms/step - loss: 0.4412
Epoch 6/30
1/63 [..............................] - ETA: 0s - loss: 0.6007
37/63 [================>.............] - ETA: 0s - loss: 0.4405
63/63 [==============================] - 0s 1ms/step - loss: 0.4306
Epoch 7/30
1/63 [..............................] - ETA: 0s - loss: 0.5292
41/63 [==================>...........] - ETA: 0s - loss: 0.4281
63/63 [==============================] - 0s 1ms/step - loss: 0.4233
Epoch 8/30
1/63 [..............................] - ETA: 0s - loss: 0.2345
35/63 [===============>..............] - ETA: 0s - loss: 0.4112
63/63 [==============================] - 0s 2ms/step - loss: 0.4162
Epoch 9/30
1/63 [..............................] - ETA: 0s - loss: 0.3684
37/63 [================>.............] - ETA: 0s - loss: 0.4130
63/63 [==============================] - 0s 1ms/step - loss: 0.4114
Epoch 10/30
1/63 [..............................] - ETA: 0s - loss: 0.5446
45/63 [====================>.........] - ETA: 0s - loss: 0.4132
63/63 [==============================] - 0s 1ms/step - loss: 0.4089
Epoch 11/30
1/63 [..............................] - ETA: 0s - loss: 0.4163
45/63 [====================>.........] - ETA: 0s - loss: 0.4109
63/63 [==============================] - 0s 1ms/step - loss: 0.4047
Epoch 12/30
1/63 [..............................] - ETA: 0s - loss: 0.4489
46/63 [====================>.........] - ETA: 0s - loss: 0.3998
63/63 [==============================] - 0s 1ms/step - loss: 0.4031
Epoch 13/30
1/63 [..............................] - ETA: 0s - loss: 0.4602
48/63 [=====================>........] - ETA: 0s - loss: 0.3974
63/63 [==============================] - 0s 1ms/step - loss: 0.4015
Epoch 14/30
1/63 [..............................] - ETA: 0s - loss: 0.4532
47/63 [=====================>........] - ETA: 0s - loss: 0.4028
63/63 [==============================] - 0s 1ms/step - loss: 0.3983
Epoch 15/30
1/63 [..............................] - ETA: 0s - loss: 0.6738
46/63 [====================>.........] - ETA: 0s - loss: 0.3814
63/63 [==============================] - 0s 1ms/step - loss: 0.3974
Epoch 16/30
1/63 [..............................] - ETA: 0s - loss: 0.4042
47/63 [=====================>........] - ETA: 0s - loss: 0.3991
63/63 [==============================] - 0s 1ms/step - loss: 0.3970
Epoch 17/30
1/63 [..............................] - ETA: 0s - loss: 0.4041
49/63 [======================>.......] - ETA: 0s - loss: 0.3998
63/63 [==============================] - 0s 1ms/step - loss: 0.3951
Epoch 18/30
1/63 [..............................] - ETA: 0s - loss: 0.3383
47/63 [=====================>........] - ETA: 0s - loss: 0.3877
63/63 [==============================] - 0s 1ms/step - loss: 0.3955
Epoch 19/30
1/63 [..............................] - ETA: 0s - loss: 0.6438
48/63 [=====================>........] - ETA: 0s - loss: 0.4018
63/63 [==============================] - 0s 1ms/step - loss: 0.3941
Epoch 20/30
1/63 [..............................] - ETA: 0s - loss: 0.1844
48/63 [=====================>........] - ETA: 0s - loss: 0.3957
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 21/30
1/63 [..............................] - ETA: 0s - loss: 0.2338
47/63 [=====================>........] - ETA: 0s - loss: 0.3884
63/63 [==============================] - 0s 1ms/step - loss: 0.3910
Epoch 22/30
1/63 [..............................] - ETA: 0s - loss: 0.3001
48/63 [=====================>........] - ETA: 0s - loss: 0.4025
63/63 [==============================] - 0s 1ms/step - loss: 0.3906
Epoch 23/30
1/63 [..............................] - ETA: 0s - loss: 0.5729
48/63 [=====================>........] - ETA: 0s - loss: 0.3979
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 24/30
1/63 [..............................] - ETA: 0s - loss: 0.5538
47/63 [=====================>........] - ETA: 0s - loss: 0.3958
63/63 [==============================] - 0s 1ms/step - loss: 0.3909
Epoch 25/30
1/63 [..............................] - ETA: 0s - loss: 0.3733
47/63 [=====================>........] - ETA: 0s - loss: 0.4052
63/63 [==============================] - 0s 1ms/step - loss: 0.3908
Epoch 26/30
1/63 [..............................] - ETA: 0s - loss: 0.3424
43/63 [===================>..........] - ETA: 0s - loss: 0.3974
63/63 [==============================] - 0s 1ms/step - loss: 0.3890
Epoch 27/30
1/63 [..............................] - ETA: 0s - loss: 0.3206
45/63 [====================>.........] - ETA: 0s - loss: 0.3886
63/63 [==============================] - 0s 1ms/step - loss: 0.3918
Epoch 28/30
1/63 [..............................] - ETA: 0s - loss: 0.5887
40/63 [==================>...........] - ETA: 0s - loss: 0.3983
63/63 [==============================] - 0s 1ms/step - loss: 0.3898
Epoch 29/30
1/63 [..............................] - ETA: 0s - loss: 0.4373
42/63 [===================>..........] - ETA: 0s - loss: 0.3609
63/63 [==============================] - 0s 1ms/step - loss: 0.3899
Epoch 30/30
1/63 [..............................] - ETA: 0s - loss: 0.1973
47/63 [=====================>........] - ETA: 0s - loss: 0.4034
63/63 [==============================] - 0s 1ms/step - loss: 0.3897
= preferred_model.predict(X_train)
p_preferred print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))
1/63 [..............................] - ETA: 2s
61/63 [============================>.] - ETA: 0s
63/63 [==============================] - 0s 862us/step
two example output vectors:
[[-0.6257403 5.117499 0.07288799 1.0743215 ]
[-2.9822855 0.81028026 3.5368693 6.0620565 ]]
largest value 18.288116 smallest value -6.9841967
Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability.
If the desired output are probabilities, the output should be be processed by a softmax.
= tf.nn.softmax(p_preferred).numpy()
sm_preferred print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))
two example output vectors:
[[3.11955088e-03 9.73529756e-01 6.27339305e-03 1.70773119e-02]
[1.08768334e-04 4.82606189e-03 7.37454817e-02 9.21319604e-01]]
largest value 0.999989 smallest value 1.80216e-11
This code applies the softmax activation function to the output of a neural network model p_preferred, and then converts the resulting tensor to a numpy array using the .numpy()
method. The resulting array sm_preferred contains the probabilities for each of the possible output classes for the input data.
The second line of code then prints the first two rows of sm_preferred, which correspond to the probabilities for the first two input examples in the dataset.
Lets check the loss functions one final time:
'loss'], label='ADAM')
plt.plot(adam_history.history['loss'], label='Pref_ADAM')
plt.plot(preferred_history.history['loss'], label='SGD')
plt.plot(sgd_history.history[
plt.legend()'Epoch')
plt.xlabel('Loss')
plt.ylabel( plt.show()
12.2.4 References
- https://www.tensorflow.org/api_docs/python/tf/nn/softmax
- https://www.tensorflow.org/
- https://www.whyofai.com/blog/ai-explained
- https://www.coursera.org/specializations/machine-learning-introduction
12.3 Web Scraping with Selenium (by Michael Zheng)
Selenium is a free, open-source automation testing suite for web applications across different browsers and platforms. Selenium focuses on automating web-based applications.
12.3.1 Selenium vs BeautifulSoup?
Selenium is a web browser automation tool that can interact with web pages like a human user, whereas BeautifulSoup is a library for parsing HTML and XML documents. This means Selenium has more functionality since it can automate browser actions such as clicking buttons, filling out forms and navigating between pages.
However, Selenium is not as fast as BeautifulSoup. Thus, if your web scraping problem can be solved with BeautifulSoup, use that.
An example of a website that can’t be scraped by BeautifulSoup is a website that doesn’t fully load unless prompted to: https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid
.
- Go to the link and inspect the first photo
- Collapse the ‘TaxonPhoto undefined’ div container and scroll to the last ‘TaxonPhoto undefined’
- Go back to the web page and scroll down to load new images
See those ‘TaxonPhoto undefined’ elements that are popping up on the right side of the screen as we scroll? Those are more photos that are being rendered as we directly interact with the web page. BeautifulSoup can only scrape HTML elements from what’s already loaded on the web page. It cannot dynamically interact with the page to load more HTML elements. Luckily, Selenium can do that!
12.3.2 Example: Plant Images Scraper
I will demonstrate the functionalities of Selenium by building a program to scrape plant images from a website. Hopefully, this example will be enough for anybody listening to get started with Selenium.
12.3.2.1 Components of a Website
Websites are developed using 3 main languages: javascript
, html
, and css
.
We don’t need to get too much into what each of these languages do, but just know that html
tells a browser how to display the content of a website; and that is what we will interact with to extract data from the website.
12.3.2.2 HTML
In HTML, the contents of a website are organized into containers called div
.
These div
containers are given identifiers using class
and id
<div class="widget"></div>
In this example, the div
container is given the class
name “widget”.
<div id="widget"></div>
In this example, the div
container is given the id
name “widget”.
We can use the find_elements
method in Selenium to retrieve the containers that we want by using their XPATH, which is the address to the containers specified in the HTML file.
Say we want to retrieve all the “widget” containers on a web page. Then, we can use the find_elements
method. The method can locate containers based on many techniques, but we want to specify By.XPATH
here. Then we want to locate the containers whose ids have the name “widget”; we can do this with classes as well by replacing @id
with @class
.
find_elements(By.XPATH, "//*[starts-with(@id, 'widget')]")
12.3.2.3 Additional Selenium Functionalities
Selenium is very powerful and contains many useful features for interacting with browsers. We will not be using most of them in this project, but they’re still good to know.
As we mentioned earlier, find_elements
will retrieve all specified elements on the page. But there is also find_element
, note that element is singular, which will only return one element of the specified type; the first one that it comes across.
Besides XPATH
, there are other techniques for locating div
containers. For instance, we can also use:
# Find the element with name "my-element"
element = driver.find_element(By.NAME, 'my-element')
# Find the element with ID "my-element"
element = driver.find_element(By.ID, 'my-element')
# Find the element with class name "my-element"
element = driver.find_element(By.CLASS_NAME, 'my-element')
# Find the element with CSS selector "#my-element .my-class"
element = driver.find_element(By.CSS_SELECTOR, '#my-element .my-class')
==========================================================================================
You can also interact with text fields in browsers via Selenium.
Say you are automating a scraper that needs to login to a website. Well we know how to find the elements using the find_element
method:
# Find the username and password fields
username_field = driver.find_element(By.NAME, 'username')
password_field = driver.find_element(By.NAME, 'password')
Now those two variables are pointing to the corresponding text fields on the page. So, we can enter in our username and password by using the send_keys
method:
username_field.send_keys('myusername')
password_field.send_keys('mypassword')
To complete the login, we need to click on the login button. We can do this by using the click
method:
# Find the login button and click it
login_button = driver.find_element(By.XPATH, '//button[@type="submit"]')
login_button.click()
==========================================================================================
Sometimes you may need to wait for an element to appear on the page before you can interact with it. You can do this using the WebDriverWait class provided by Selenium. For example:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
search_results = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'search')))
Selenium will wait for a maximum of 10 seconds for the element with the id
“search” to appear on the page. If 10 seconds pass and the element doesn’t appear, then an error will be returned. Otherwise, the driver will retrieve the element and store it in the variable search_results
.
Now that we have an understanding of how to interact with HTML elements using Selenium. Let’s get started with building the program!
- Step 1: Import Libraries
import time # will be used to allow sufficient time for web pages to load
import requests # will be used to send requests to web pages to download images
# selenium functions
from selenium import webdriver # how selenium uses the browser on your laptop
from selenium.webdriver.chrome.service import Service # tells selenium what browser to use
from webdriver_manager.chrome import ChromeDriverManager # a package to manage chrome driver dependencies so you don't have to
from selenium.webdriver.common.by import By # method for using XPATHS to locate div elements
- Step 2: Scrape Image Links
Let’s make a plan for how we are gonna scrape these images:
Go to this link:
https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid
Scroll down; notice how the page takes some time to load more images (this is where the ‘time’ library will come into play)
Right click on a picture and Inspect
Navigate to the div container with id that starts with ‘cover-image…’
Notice that the images are stored in a AWS S3 data lake with the link to the image encapsulated by url(…)
Copy and paste the link into browser to open the image
But another important point, the image url is mixed in with a bunch of other text; starts with “width: 100%…” (so we need to remove all the text surrounding the link)
Let’s define a function called image_links_scraper
. It’s job will be to extract the image links for each image on the website. It will take in 2 parameters: link
= link to the website that we wanna scrape and max_images
= total number of images we wanna scrape
def image_links_scraper(link, max_images):
= webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver # 1. downloads the latest google chrome driver (executable that selenium uses to launch google chrome)
# 2. service is responsible for starting the webdriver, an interface for interacting with browsers, using the chrome driver
# 3. once the webdriver is started, we can use it to interact with chrome
# whenever we want to interact with the browser we call a method from driver
driver.get(link) # get method opens the browser to the specified link
= []
image_links # we will store the scraped image links in this list
### ISSUE (step 2) ###
= driver.execute_script("return document.body.scrollHeight")
current_height # executes a javascript command to get the current height of the page (which is the length of the page from the top to the bottom before it loads new images)
while True: # keep scrolling down on the browser to load new images until we reach the end of the page
f"window.scrollTo({current_height}, document.body.scrollHeight);")
driver.execute_script(# run javascript command to scroll to the bottom of the page
= driver.find_elements(By.XPATH, "//*[starts-with(@id, 'cover')]")
elements # find all elements where the 'id' tag starts with the string 'cover' because these div containers have the image links
if len(elements) >= max_images: # check to see if we have scraped enough image links, as specified by the max_images parameter
break # if so, stop scolling
5)
time.sleep(# wait for page to load; dependent on internet speed
= driver.execute_script("return document.body.scrollHeight")
new_height # get new page height after scrolling
if current_height == new_height: # check to see if the page height has stopped changing
break # if so, we've reached the end of the page and need to stop scrolling
else:
= new_height # otherwise, we need to keep scrolling
current_height
# at this point, we have not scraped any images
# we only have the div container elements that contain the image links we want to extract
# now we go through each element and extract the links
for element in elements:
# ### ISSUE (step 7) ###
= element.get_attribute('style')
s # returns the text in the 'style' attribute
= 'width: 100%; min-height: 183px; background-size: cover; background-position: center center; background-repeat: no-repeat; background-image: url("'
start # the useless text before the link
= '");'
end # the useless text after the link
= s[len(start):-len(end)]
link # perform string splicing to get only the URL from the entire string
image_links.append(link) # add the image link to the list
print(link)
# print the links as we extract them to visualize function in real-time
driver.quit()# once we're done automating the browser, we should close it using the quit() method of the driver object
return image_links
- Step 3: Download the Images
Now, we take the image links extracted from the previous step and download the images located at each link.
Let’s define a method called download_images
that takes in 2 parameters: image_links
= whatever image_links_scraper returns and folder_name
= name of the folder to save the scraped images to
def download_images(image_links, folder_name):
= 1 # keep track of the image number to give each image an identifier
i
for link in image_links: # iterate through all the image links
= requests.get(link).content # retrieve the image content from URL by sending a request to the website
r = f'{folder_name}/{i}.jpg' # generate image file name (image number) and directory
file_name
with open(file_name, 'wb') as f:
# save the image
f.write(r)
+= 1 # update the image number for the next iteration i
- Step 4: Run Everything All Together
The result is a dataset of plant images saved in a folder called _selenium_download
.
# if __name__ == '__main__':
= 'https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid' # website to scrape images from
link = 20 # number of images to scrape
max_images = '_selenium_download' # name of folder to save images to
folder_name = image_links_scraper(link, max_images)
image_links download_images(image_links, folder_name)
https://inaturalist-open-data.s3.amazonaws.com/photos/2418087/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/7400447/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/7400452/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/148667986/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148667997/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668006/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668015/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668025/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668036/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668042/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668053/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668061/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668071/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668077/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668090/medium.jpeg
https://static.inaturalist.org/photos/114478526/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630561/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630572/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630582/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630593/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/101461689/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/194174354/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/194174425/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/51452451/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455593/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455616/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455665/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/98405618/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/101535531/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/101535544/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/150810477/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/163894686/medium.jpg