12  Advanced Topics

12.1 Text Analysis with nltk (by Shivaram Karandikar)

12.1.1 Introduction

nltk, or Natural Language Toolkit, is a Python package which provides a set of tools for text analysis. nltk is used in Natural Language Processing (NLP), a field of computer science which focuses on the interaction between computers and human languages. nltk is a very powerful tool for text analysis, and is used by many researchers and data scientists. In this tutorial, we will learn how to use nltk to analyze text.

12.1.2 Getting Started

First, we must install nltk using pip.

python -m pip install nltk

Necessary datasets/models are needed for specific functions to work. We can download a popular subset with

python -m nltk.downloader popular

12.1.3 Tokenizing

To analyze text, it needs to be broken down into smaller pieces. This is called tokenization. nltk offers two ways to tokenize text: sentence tokenization and word tokenization.

import nltk

To demonstrate this, we will use the following text, a passage from the 1951 science fiction novel Foundation by Isaac Asimov.

fd_string = """The sum of human knowing is beyond any one man; any thousand men. With the destruction of our social fabric, science will be broken into a million pieces. Individuals will know much of exceedingly tiny facets of what there is to know. They will be helpless and useless by themselves. The bits of lore, meaningless, will not be passed on. They will be lost through the generations. But, if we now prepare a giant summary of all knowledge, it will never be lost. Coming generations will build on it, and will not have to rediscover it for themselves. One millennium will do the work of thirty thousand."""

12.1.3.1 Sentence Tokenization

from nltk import sent_tokenize, word_tokenize
nltk.download("popular") # only needs to download once
fd_sent = sent_tokenize(fd_string)
print(fd_sent)
[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package shakespeare is already up-to-date!
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package stopwords is already up-to-date!
[nltk_data]    | Downloading package treebank to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package treebank is already up-to-date!
[nltk_data]    | Downloading package twitter_samples to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package twitter_samples is already up-to-date!
[nltk_data]    | Downloading package omw to /Users/junyan/nltk_data...
[nltk_data]    |   Package omw is already up-to-date!
[nltk_data]    | Downloading package omw-1.4 to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet2021 to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package wordnet2021 is already up-to-date!
[nltk_data]    | Downloading package wordnet31 to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package wordnet31 is already up-to-date!
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package wordnet_ic is already up-to-date!
[nltk_data]    | Downloading package words to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package words is already up-to-date!
[nltk_data]    | Downloading package maxent_ne_chunker to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package maxent_ne_chunker is already up-to-date!
[nltk_data]    | Downloading package punkt to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package punkt is already up-to-date!
[nltk_data]    | Downloading package snowball_data to
[nltk_data]    |     /Users/junyan/nltk_data...
[nltk_data]    |   Package snowball_data is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/junyan/nltk_data...
['The sum of human knowing is beyond any one man; any thousand men.', 'With the destruction of our social fabric, science will be broken into a million pieces.', 'Individuals will know much of exceedingly tiny facets of what there is to know.', 'They will be helpless and useless by themselves.', 'The bits of lore, meaningless, will not be passed on.', 'They will be lost through the generations.', 'But, if we now prepare a giant summary of all knowledge, it will never be lost.', 'Coming generations will build on it, and will not have to rediscover it for themselves.', 'One millennium will do the work of thirty thousand.']
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | 
[nltk_data]  Done downloading collection popular

12.1.3.2 Word Tokenization

fd_word = word_tokenize(fd_string)
print(fd_word)
['The', 'sum', 'of', 'human', 'knowing', 'is', 'beyond', 'any', 'one', 'man', ';', 'any', 'thousand', 'men', '.', 'With', 'the', 'destruction', 'of', 'our', 'social', 'fabric', ',', 'science', 'will', 'be', 'broken', 'into', 'a', 'million', 'pieces', '.', 'Individuals', 'will', 'know', 'much', 'of', 'exceedingly', 'tiny', 'facets', 'of', 'what', 'there', 'is', 'to', 'know', '.', 'They', 'will', 'be', 'helpless', 'and', 'useless', 'by', 'themselves', '.', 'The', 'bits', 'of', 'lore', ',', 'meaningless', ',', 'will', 'not', 'be', 'passed', 'on', '.', 'They', 'will', 'be', 'lost', 'through', 'the', 'generations', '.', 'But', ',', 'if', 'we', 'now', 'prepare', 'a', 'giant', 'summary', 'of', 'all', 'knowledge', ',', 'it', 'will', 'never', 'be', 'lost', '.', 'Coming', 'generations', 'will', 'build', 'on', 'it', ',', 'and', 'will', 'not', 'have', 'to', 'rediscover', 'it', 'for', 'themselves', '.', 'One', 'millennium', 'will', 'do', 'the', 'work', 'of', 'thirty', 'thousand', '.']

Both the sentence tokenization and word tokenization functions return a list of strings. We can use these lists to perform further analysis.

12.1.4 Removing Stopwords

The output of the word tokenization gave us a list of words. However, some of these words are not useful for our analysis. These words are called stopwords. nltk provides a list of stopwords for several languages. We can use this list to remove stopwords from our text.

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
print(stop_words)
{"you'd", 'being', 'at', 'after', 'their', 's', 'all', 'couldn', 'here', 'same', 'she', "weren't", 'some', 'too', 'i', 'can', 'of', 'you', 'than', 'd', "you'll", 'the', 't', 'above', "should've", 'has', 'ours', 'where', 'from', 'wouldn', 'is', 'hasn', 'am', 'its', 'ain', 'our', 'should', 'in', 'your', 'those', 'had', 'if', 'weren', 'into', 'have', "doesn't", "isn't", 'aren', 'me', 'whom', 'how', 'll', 'themselves', 'and', 'myself', 'over', 'once', 'did', 'then', 'so', 'we', 'mustn', 'won', 'both', 'between', 'now', 'with', "she's", 'shan', 'about', 'his', 'itself', 'or', 'up', "wouldn't", "couldn't", "aren't", 'having', 'these', 'to', 'each', 'own', 'were', 'until', 'been', 'are', 'yours', 'off', 'this', 'such', 'don', 've', 'they', 'was', "shan't", 'hadn', 'an', "hadn't", 'under', 'when', 'haven', "mustn't", 'who', "didn't", "wasn't", 'ma', 'just', 'mightn', 'very', 'be', 'no', 'more', 'other', 'out', "don't", "won't", 'm', 'ourselves', "haven't", 'by', 'herself', 'o', 'a', 'most', 'hers', 'few', 'during', "you're", "you've", 'further', 'why', "hasn't", 'as', "needn't", "that'll", "mightn't", 'nor', "it's", 'only', 'doesn', 'that', 'through', 'theirs', 'isn', 're', 'her', 'my', 'them', 'what', 'which', 'y', 'before', 'there', 'didn', 'not', 'while', 'down', 'against', 'below', 'himself', 'does', 'because', 'again', 'needn', 'shouldn', 'yourselves', 'wasn', 'will', 'do', 'but', 'doing', 'on', 'he', 'it', 'any', 'him', "shouldn't", 'for', 'yourself'}
fd_filtered = [w for w in fd_word if w.casefold() not in stop_words]
print(fd_filtered)
['sum', 'human', 'knowing', 'beyond', 'one', 'man', ';', 'thousand', 'men', '.', 'destruction', 'social', 'fabric', ',', 'science', 'broken', 'million', 'pieces', '.', 'Individuals', 'know', 'much', 'exceedingly', 'tiny', 'facets', 'know', '.', 'helpless', 'useless', '.', 'bits', 'lore', ',', 'meaningless', ',', 'passed', '.', 'lost', 'generations', '.', ',', 'prepare', 'giant', 'summary', 'knowledge', ',', 'never', 'lost', '.', 'Coming', 'generations', 'build', ',', 'rediscover', '.', 'One', 'millennium', 'work', 'thirty', 'thousand', '.']

The resulting list is significantly shorter. There are some words that nltk considers stopwords that we may want to keep, depending on the objective of our analysis. Reducing the size of our data can help us to reduce the time it takes to perform our analysis. However, removing too many words can reduce the accuracy, which is especially important when we are trying to perform sentiment analysis.

12.1.5 Stemming

Stemming is a method which allows us to reduce the number of variants of a word. For example, the words connecting, connected, and connection are all variants of the same word connect. nltk includes a few different stemmers based on different algorithms. We will use the Snowball stemmer, an improved version of the 1979 Porter stemmer.

from nltk.stem.snowball import SnowballStemmer
snow_stem = SnowballStemmer(language='english')
fd_stem = [snow_stem.stem(w) for w in fd_word]
print(fd_stem)
['the', 'sum', 'of', 'human', 'know', 'is', 'beyond', 'ani', 'one', 'man', ';', 'ani', 'thousand', 'men', '.', 'with', 'the', 'destruct', 'of', 'our', 'social', 'fabric', ',', 'scienc', 'will', 'be', 'broken', 'into', 'a', 'million', 'piec', '.', 'individu', 'will', 'know', 'much', 'of', 'exceed', 'tini', 'facet', 'of', 'what', 'there', 'is', 'to', 'know', '.', 'they', 'will', 'be', 'helpless', 'and', 'useless', 'by', 'themselv', '.', 'the', 'bit', 'of', 'lore', ',', 'meaningless', ',', 'will', 'not', 'be', 'pass', 'on', '.', 'they', 'will', 'be', 'lost', 'through', 'the', 'generat', '.', 'but', ',', 'if', 'we', 'now', 'prepar', 'a', 'giant', 'summari', 'of', 'all', 'knowledg', ',', 'it', 'will', 'never', 'be', 'lost', '.', 'come', 'generat', 'will', 'build', 'on', 'it', ',', 'and', 'will', 'not', 'have', 'to', 'rediscov', 'it', 'for', 'themselv', '.', 'one', 'millennium', 'will', 'do', 'the', 'work', 'of', 'thirti', 'thousand', '.']

Stemming algorithms are susceptible to errors. Related words that should share a stem may not, which is known as understemming, which is a false negative. Unrelated words that should not share a stem may, which is known as overstemming, which is a false positive.

12.1.6 POS Tagging

nltk also enables us to label the parts of speech of each word in a text. This is known as part-of-speech (POS) tagging. nltk uses the Penn Treebank tagset, which is a set of tags that are used to label words in a text. The tags are as follows:

nltk.help.upenn_tagset()
$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``

We can use the function nltk.pos_tag() on our list of tokenized words. This will return a list of tuples, where each tuple contains a word and its corresponding tag.

fd_tag = nltk.pos_tag(fd_word)
print(fd_tag)
[('The', 'DT'), ('sum', 'NN'), ('of', 'IN'), ('human', 'JJ'), ('knowing', 'NN'), ('is', 'VBZ'), ('beyond', 'IN'), ('any', 'DT'), ('one', 'CD'), ('man', 'NN'), (';', ':'), ('any', 'DT'), ('thousand', 'CD'), ('men', 'NNS'), ('.', '.'), ('With', 'IN'), ('the', 'DT'), ('destruction', 'NN'), ('of', 'IN'), ('our', 'PRP$'), ('social', 'JJ'), ('fabric', 'NN'), (',', ','), ('science', 'NN'), ('will', 'MD'), ('be', 'VB'), ('broken', 'VBN'), ('into', 'IN'), ('a', 'DT'), ('million', 'CD'), ('pieces', 'NNS'), ('.', '.'), ('Individuals', 'NNS'), ('will', 'MD'), ('know', 'VB'), ('much', 'RB'), ('of', 'IN'), ('exceedingly', 'RB'), ('tiny', 'JJ'), ('facets', 'NNS'), ('of', 'IN'), ('what', 'WP'), ('there', 'EX'), ('is', 'VBZ'), ('to', 'TO'), ('know', 'VB'), ('.', '.'), ('They', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('helpless', 'JJ'), ('and', 'CC'), ('useless', 'JJ'), ('by', 'IN'), ('themselves', 'PRP'), ('.', '.'), ('The', 'DT'), ('bits', 'NNS'), ('of', 'IN'), ('lore', 'NN'), (',', ','), ('meaningless', 'NN'), (',', ','), ('will', 'MD'), ('not', 'RB'), ('be', 'VB'), ('passed', 'VBN'), ('on', 'IN'), ('.', '.'), ('They', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('lost', 'VBN'), ('through', 'IN'), ('the', 'DT'), ('generations', 'NNS'), ('.', '.'), ('But', 'CC'), (',', ','), ('if', 'IN'), ('we', 'PRP'), ('now', 'RB'), ('prepare', 'VBP'), ('a', 'DT'), ('giant', 'JJ'), ('summary', 'NN'), ('of', 'IN'), ('all', 'DT'), ('knowledge', 'NN'), (',', ','), ('it', 'PRP'), ('will', 'MD'), ('never', 'RB'), ('be', 'VB'), ('lost', 'VBN'), ('.', '.'), ('Coming', 'VBG'), ('generations', 'NNS'), ('will', 'MD'), ('build', 'VB'), ('on', 'IN'), ('it', 'PRP'), (',', ','), ('and', 'CC'), ('will', 'MD'), ('not', 'RB'), ('have', 'VB'), ('to', 'TO'), ('rediscover', 'VB'), ('it', 'PRP'), ('for', 'IN'), ('themselves', 'PRP'), ('.', '.'), ('One', 'CD'), ('millennium', 'NN'), ('will', 'MD'), ('do', 'VB'), ('the', 'DT'), ('work', 'NN'), ('of', 'IN'), ('thirty', 'JJ'), ('thousand', 'NN'), ('.', '.')]

The tokenized words from the quote should be easy to tag correctly. The function may encounter difficulty with less conventional words (e.g. Old English), but it will attempt to tag based on context.

12.1.7 Lemmatizing

Lemmatizing is similar to stemming, but it is more accurate. Lemmatizing is a process which reduces words to their lemma, which is the base form of a word.nltk includes a lemmatizer based on the WordNet database. We can demonstrate this using a quote from the 1868 novel Little Women by Louisa May Alcott.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
quote = "The dim, dusty room, with the busts staring down from the tall book-cases, the cosy chairs, the globes, and, best of all, the wilderness of books, in which she could wander where she liked, made the library a region of bliss to her."
quote_token = word_tokenize(quote)
quote_lemma = [lemmatizer.lemmatize(w) for w in quote_token]
print(quote_lemma)
['The', 'dim', ',', 'dusty', 'room', ',', 'with', 'the', 'bust', 'staring', 'down', 'from', 'the', 'tall', 'book-cases', ',', 'the', 'cosy', 'chair', ',', 'the', 'globe', ',', 'and', ',', 'best', 'of', 'all', ',', 'the', 'wilderness', 'of', 'book', ',', 'in', 'which', 'she', 'could', 'wander', 'where', 'she', 'liked', ',', 'made', 'the', 'library', 'a', 'region', 'of', 'bliss', 'to', 'her', '.']

12.1.8 Chunking/Chinking

While tokenizing allows us to distinguish individual words and sentences within a larger body of text, Chunking allows us to identify phrases based on grammar we specify.

#nltk.download("averaged_perceptron_tagger")
quote_tag = nltk.pos_tag(quote_token)

We can then name grammar rules to apply to the text. These use regular expressions, which are listed below:

Operator Behavior
. Wildcard, matches any character
^abc Matches some pattern abc at the start of a string
abc$ Matches some pattern abc at the end of a string
[abc] Matches one of a set of characters
[A-Z0-9] Matches one of a range of characters
ed|ing|s Matches one of the specified strings (disjunction)
* Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)
+ One or more of previous item, e.g. a+, [a-z]+
? Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?
{n} Exactly n repeats where n is a non-negative integer
{n,} At least n repeats
{,n} No more than n repeats
{m,n} At least m and no more than n repeats
a(b|c)+ Parentheses that indicate the scope of the operators
import re
import regex
grammar = r"""
  NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
  PP: {<IN><NP>}               # Chunk prepositions followed by NP
  VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
  CLAUSE: {<NP><VP>}           # Chunk NP, VP
  """
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(quote_tag)
tree.pretty_print(unicodelines=True)
                                                                                                                                                                                        S                                                                                                                                                                                                                                                                 
 ┌───┬───────┬─────────┬─────┬───┬───┬────┬─────┬─────┬──────┬───┬────┬───────┬────────┬───────┬─────────┬─────────┬────────┬────────┬──────┬─────┬───────┬──────┬──────┬──────────┬────┴──────────────┬────────────────────┬──────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────────────────┬─────────────────┬───────────────────────┬───────────────────────┬────────────────────────────┐           
 │   │       │         │     │   │   │    │     │     │      │   │    │       │        │       │         │         │        │        │      │     │       │      │      │          │                   │                    PP                                 PP                                  │                       │                    PP                │                       PP                      │                            PP         
 │   │       │         │     │   │   │    │     │     │      │   │    │       │        │       │         │         │        │        │      │     │       │      │      │          │                   │             ┌──────┴─────┐               ┌────────────┴─────┐                             │                       │               ┌────┴────┐            │                  ┌────┴──────┐                │                       ┌────┴─────┐     
 │   │       │         │     │   │   │    │     │     │      │   │    │       │        │       │         │         │        │        │      │     │       │      │      │          NP                  NP            │            NP              │                  NP                            NP                      NP              │         NP           NP                 │           NP               NP                      │          NP   
 │   │       │         │     │   │   │    │     │     │      │   │    │       │        │       │         │         │        │        │      │     │       │      │      │    ┌─────┴────┐       ┌──────┴─────┐       │      ┌─────┴──────┐        │      ┌───────────┼──────────┐          ┌───────┼────────┐        ┌─────┴──────┐        │         │      ┌─────┴────────┐         │           │       ┌────────┼───────┬───────┐       │          │     
,/, ,/, staring/VBG down/RP ,/, ,/, ,/, and/CC ,/, best/JJS ,/, ,/, in/IN which/WDT she/PRP could/MD wander/VB where/WRB she/PRP liked/VBD ,/, made/VBD to/TO her/PRP$ ./. The/DT     dim/NN dusty/JJ     room/NN with/IN the/DT     busts/NNS from/IN the/DT     tall/JJ book-cases/NNS the/DT cosy/JJ chairs/NNS the/DT     globes/NNS of/IN     all/DT the/DT     wilderness/NN of/IN     books/NNS the/DT library/NN a/DT region/NN of/IN     bliss/NN

As you can see, the generated tree shows the chunks that were identified by the grammar rules. There also is a chink operator, which is the opposite of chunk. It allows us to remove a chunk from a larger chunk.

12.1.9 Named Entity Recognition

Previous methods have been able to identify the parts of speech of each word in a text. However, we may want to identify specific entities within the text. For example, we may want to identify the names of people, places, and organizations. nltk includes a named entity recognizer which can identify these entities. We can demonstrate this using a quote from The Iliad by Homer.

homer = "In the war of Troy, the Greeks having sacked some of the neighbouring towns, and taken from thence two beautiful captives, Chryseïs and Briseïs, allotted the first to Agamemnon, and the last to Achilles."
homer_token = word_tokenize(homer)
homer_tag = nltk.pos_tag(homer_token)
#nltk.download("maxent_ne_chunker")
#nltk.download("words")
tree2 = nltk.ne_chunk(homer_tag)
tree2.pretty_print(unicodelines=True)
                                                                                                                                                             S                                                                                                                                                                                 
  ┌─────┬──────┬──────┬────┬────┬────────┬──────────┬─────────┬──────┬─────┬───────────┬────────────┬──────┬────┬────────┬────────┬────────┬───────┬─────────┼────────────┬────────┬────┬─────┬───────┬─────────┬───────┬───────┬────┬────┬──────┬───────┬──────┬────┬─────┬─────────┬───────────┬────────────┬────────────┬────────────┐       
  │     │      │      │    │    │        │          │         │      │     │           │            │      │    │        │        │        │       │         │            │        │    │     │       │         │       │       │    │    │      │       │      │    │    GPE       GPE        PERSON        GPE          GPE          GPE     
  │     │      │      │    │    │        │          │         │      │     │           │            │      │    │        │        │        │       │         │            │        │    │     │       │         │       │       │    │    │      │       │      │    │     │         │           │            │            │            │       
In/IN the/DT war/NN of/IN ,/, the/DT having/VBG sacked/VBN some/DT of/IN the/DT neighbouring/JJ towns/NNS ,/, and/CC taken/VBN from/IN thence/NN two/CD beautiful/JJ captives/NNS ,/, and/CC ,/, allotted/VBD the/DT first/JJ to/TO ,/, and/CC the/DT last/JJ to/TO ./. Troy/NNP Greeks/NNP Chryseïs/NNP Briseïs/NNP Agamemnon/NNP Achilles/NNP

In the tree, some of the words that should be tagged as PERSON are tagged as GPE, or Geo-Political Entity. In these cases, we can also generate a tree which does not specify the type of named entity.

tree3 = nltk.ne_chunk(homer_tag, binary=True)
tree3.pretty_print(unicodelines=True)
                                                                                                                                                             S                                                                                                                                                                                 
  ┌─────┬──────┬──────┬────┬────┬────────┬──────────┬─────────┬──────┬─────┬───────────┬────────────┬──────┬────┬────────┬────────┬────────┬───────┬─────────┼────────────┬────────┬────┬─────┬───────┬─────────┬───────┬───────┬────┬────┬──────┬───────┬──────┬────┬─────┬─────────┬───────────┬────────────┬────────────┬────────────┐       
  │     │      │      │    │    │        │          │         │      │     │           │            │      │    │        │        │        │       │         │            │        │    │     │       │         │       │       │    │    │      │       │      │    │     NE        NE          NE           NE           NE           NE     
  │     │      │      │    │    │        │          │         │      │     │           │            │      │    │        │        │        │       │         │            │        │    │     │       │         │       │       │    │    │      │       │      │    │     │         │           │            │            │            │       
In/IN the/DT war/NN of/IN ,/, the/DT having/VBG sacked/VBN some/DT of/IN the/DT neighbouring/JJ towns/NNS ,/, and/CC taken/VBN from/IN thence/NN two/CD beautiful/JJ captives/NNS ,/, and/CC ,/, allotted/VBD the/DT first/JJ to/TO ,/, and/CC the/DT last/JJ to/TO ./. Troy/NNP Greeks/NNP Chryseïs/NNP Briseïs/NNP Agamemnon/NNP Achilles/NNP

12.1.10 Analyzing Corpora

nltk includes a number of corpora, which are large bodies of text. We will try out some methods on the 1851 novel Moby Dick by Herman Melville.

from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

12.1.10.1 Concordance

concordance allows us to find all instances of a word in a text. We can use this to find all instances of the word “whale” in Moby Dick.

text1.concordance("whale")
Displaying 25 of 1226 matches:
s , and to teach them by what name a whale - fish is to be called in our tongue
t which is not true ." -- HACKLUYT " WHALE . ... Sw . and Dan . HVAL . This ani
ulted ." -- WEBSTER ' S DICTIONARY " WHALE . ... It is more immediately from th
ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE
HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE
least , take the higgledy - piggledy whale statements , however authentic , in 
 dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a
 patient Job ." -- RABELAIS . " This whale ' s liver was two cartloads ." -- ST
 Touching that monstrous bulk of the whale or ork we have received nothing cert
 of oil will be extracted out of one whale ." -- IBID . " HISTORY OF LIFE AND D
ise ." -- KING HENRY . " Very like a whale ." -- HAMLET . " Which to secure , n
restless paine , Like as the wounded whale to shore flies thro ' the maine ." -
. OF SPERMA CETI AND THE SPERMA CETI WHALE . VIDE HIS V . E . " Like Spencer ' 
t had been a sprat in the mouth of a whale ." -- PILGRIM ' S PROGRESS . " That 
EN ' S ANNUS MIRABILIS . " While the whale is floating at the stern of the ship
e ship called The Jonas - in - the - Whale . ... Some say the whale can ' t ope
 in - the - Whale . ... Some say the whale can ' t open his mouth , but that is
 masts to see whether they can see a whale , for the first discoverer has a duc
 for his pains . ... I was told of a whale taken near Shetland , that had above
oneers told me that he caught once a whale in Spitzbergen that was white all ov
2 , one eighty feet in length of the whale - bone kind came in , which ( as I w
n master and kill this Sperma - ceti whale , for I could never hear of any of t
 . 1729 . "... and the breath of the whale is frequendy attended with such an i
ed with hoops and armed with ribs of whale ." -- RAPE OF THE LOCK . " If we com
contemptible in the comparison . The whale is doubtless the largest animal in c

12.1.10.2 Dispersion Plot

dispersion_plot allows us to see how a word is used throughout a text. We can use this to see the representation of characters throughout Moby Dick.

text1.dispersion_plot(["Ahab", "Ishmael", "Starbuck", "Queequeg"])
/usr/local/lib/python3.11/site-packages/nltk/draw/__init__.py:15: UserWarning: nltk.draw package not loaded (please install Tkinter library).
  warnings.warn("nltk.draw package not loaded (please install Tkinter library).")

12.1.10.3 Frequency Distribution

FreqDist allows us to see the frequency of each word in a text. We can use this to see the most common words in Moby Dick.

from nltk import FreqDist
fdist1 = FreqDist(text1)
print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>

We can use the list of stop words generated previously to help us focus on meaningful words.

text1_imp = [w for w in text1 if w not in stop_words and w.isalpha()]
fdist2 = FreqDist(text1_imp)
fdist2.most_common(20)
[('I', 2124),
 ('whale', 906),
 ('one', 889),
 ('But', 705),
 ('like', 624),
 ('The', 612),
 ('upon', 538),
 ('man', 508),
 ('ship', 507),
 ('Ahab', 501),
 ('ye', 460),
 ('old', 436),
 ('sea', 433),
 ('would', 421),
 ('And', 369),
 ('head', 335),
 ('though', 335),
 ('boat', 330),
 ('time', 324),
 ('long', 318)]

We can visualize the frequency distribution using plot.

fdist2.plot(20, cumulative=True)

<AxesSubplot: xlabel='Samples', ylabel='Cumulative Counts'>

12.1.10.4 Collocations

collocations allows us to find words that commonly appear together. We can use this to find the most common collocations in Moby Dick.

text1.collocations()
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

12.1.11 Conclusion

In this tutorial, we have learned how to use nltk to perform basic text analysis. There are many methods included in this package that help provide structure to text. These methods can be used in conjunction with other packages to perform more complex analysis. For example, a dataframe of open-ended customer feedback could be processed to identify common themes, as well as the polarity of the feedback.

12.1.12 Resources

12.2 Neural Networks with Tensorflow (by Giovanni Lunetta)

A neural network is a type of machine learning algorithm that is inspired by the structure and function of the human brain. It consists of layers of interconnected nodes, or neurons, that can learn to recognize patterns in data and make predictions or decisions based on that input.

Neural networks are used in a wide variety of applications, including image and speech recognition, natural language processing, predictive analytics, robotics, and more. They have been especially effective in tasks that require pattern recognition, such as identifying objects in images, translating between languages, and predicting future trends in data.

12.2.1 Neural Network Architecture

A neural network consists of one or more layers of neurons, each of which takes input from the previous layer and produces output for the next layer. The input layer receives raw data, while the output layer produces predictions or decisions based on that input. The hidden layers in between contain neurons that can learn to recognize patterns in the data and extract features that are useful for making predictions.

Each neuron in a neural network has a set of weights and biases that determine how it responds to input. These values are adjusted during training to improve the accuracy of the network’s predictions. The activation function of a neuron determines how it responds to input, such as by applying a threshold or sigmoid function.

Code
from IPython.display import Image
# Image(filename='ai-artificial-neural-network-alex-castrounis.png')

The input layer: The three blue nodes on the left side of the diagram represent the input layer. This layer receives input data, such as pixel values from an image or numerical features from a dataset.

The hidden layer: The four white nodes in the middle of the diagram represent the hidden layer. This layer performs computations on the input data and generates output values that are passed to the output layer.

The output layer: The orange node on the right side of the diagram represents the output layer. This layer generates the final output of the neural network, which can be a binary classification (0 or 1) or a continuous value.

The arrows: The arrows in the diagram represent the connections between nodes in adjacent layers. Each arrow has an associated weight, which is a parameter learned during the training process. The weights determine the strength of the connections between the nodes and are used to compute the output values of each node.

12.2.2 ReLu Activation Function

The ReLU (Rectified Linear Unit) activation function is used in neural networks to introduce non-linearity into the model. Non-linearity allows neural networks to learn more complex relationships between inputs and outputs.

ReLU is a simple function that returns the input if it is positive, and 0 otherwise. This means that ReLU “activates” (returns a non-zero output) only if the input is positive, which can be thought of as a way for the neuron to “turn on” when the input is significant enough. In contrast, a linear function would simply scale the input by a constant factor, which would not introduce any non-linearity into the model.

In simple terms, ReLU allows the neural network to selectively activate certain neurons based on the importance of the input, which helps it learn more complex patterns in the data.

import numpy as np
import matplotlib.pyplot as plt

def linear(x):
    return x

def relu(x):
    return np.maximum(0, x)

x = np.linspace(-10, 10, 100)
y_linear = linear(x)
y_relu = relu(x)

plt.plot(x, y_linear, label='Linear')
plt.plot(x, y_relu, label='ReLU')
plt.legend()
plt.xlabel('Input')
plt.ylabel('Output')
plt.show()

12.2.3 Demonstration

TensorFlow is an open-source software library developed by Google that is widely used for building and training machine learning models, including neural networks. TensorFlow provides a range of tools and abstractions that make it easier to build and optimize complex models, as well as tools for deploying models in production.

Here’s an example of how to use TensorFlow to build a neural network for a softmax regression model:

First we start by importing the proper packages:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import plot_model
from tensorflow.keras.losses import SparseCategoricalCrossentropy

import numpy as np

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt
2023-04-10 20:47:18.379479: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

TensorFlow and Keras are closely related, as Keras is a high-level API that is built on top of TensorFlow. Keras provides a user-friendly interface for building neural networks, making it easy to create, train, and evaluate models without needing to know the details of TensorFlow’s low-level API.

Keras was initially developed as a standalone library, but since version 2.0, it has been integrated into TensorFlow as its official high-level API. This means that Keras can now be used as a part of TensorFlow, providing a unified and comprehensive platform for deep learning.

In other words, Keras is essentially a wrapper around TensorFlow that provides a simpler and more intuitive interface for building neural networks. While TensorFlow provides a lower-level API that offers more control and flexibility, Keras makes it easier to get started with building deep learning models, especially for beginners.

# make dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=2.0,random_state=75)

# plot the example dataset
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.title('Example Dataset')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

We will talk about three ways to implement a softmax regression machine learning model. The first using Stochastic Gradient Descent as the loss function. Next, using a potentially more efficient algoritm called the Adam Algoritm. Finally, using the Adam Algoritm again, but more efficiently.

12.2.3.1 Stochastic Gradient Descent

sgd_model = tf.keras.Sequential([
        Dense(10, activation = 'relu'),
        Dense(5, activation = 'relu'),
        Dense(4, activation = 'softmax')    # <-- softmax activation here
    ]
)
sgd_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),  # <-- Note
)
sgd_history = sgd_model.fit(
                    X_train,y_train,
                    epochs=30
)
Epoch 1/30
 1/63 [..............................] - ETA: 38s - loss: 1.7720
35/63 [===============>..............] - ETA: 0s - loss: 1.4569 
63/63 [==============================] - 1s 1ms/step - loss: 1.3344
Epoch 2/30
 1/63 [..............................] - ETA: 0s - loss: 1.2658
42/63 [===================>..........] - ETA: 0s - loss: 1.0011
63/63 [==============================] - 0s 1ms/step - loss: 0.9616
Epoch 3/30
 1/63 [..............................] - ETA: 0s - loss: 0.9663
49/63 [======================>.......] - ETA: 0s - loss: 0.8074
63/63 [==============================] - 0s 1ms/step - loss: 0.7756
Epoch 4/30
 1/63 [..............................] - ETA: 0s - loss: 0.7792
48/63 [=====================>........] - ETA: 0s - loss: 0.6900
63/63 [==============================] - 0s 1ms/step - loss: 0.6672
Epoch 5/30
 1/63 [..............................] - ETA: 0s - loss: 0.5995
49/63 [======================>.......] - ETA: 0s - loss: 0.6040
63/63 [==============================] - 0s 1ms/step - loss: 0.6121
Epoch 6/30
 1/63 [..............................] - ETA: 0s - loss: 0.4462
49/63 [======================>.......] - ETA: 0s - loss: 0.5770
63/63 [==============================] - 0s 1ms/step - loss: 0.5784
Epoch 7/30
 1/63 [..............................] - ETA: 0s - loss: 0.7130
49/63 [======================>.......] - ETA: 0s - loss: 0.5520
63/63 [==============================] - 0s 1ms/step - loss: 0.5533
Epoch 8/30
 1/63 [..............................] - ETA: 0s - loss: 0.5232
45/63 [====================>.........] - ETA: 0s - loss: 0.5186
63/63 [==============================] - 0s 1ms/step - loss: 0.5324
Epoch 9/30
 1/63 [..............................] - ETA: 0s - loss: 0.8772
46/63 [====================>.........] - ETA: 0s - loss: 0.5226
63/63 [==============================] - 0s 1ms/step - loss: 0.5147
Epoch 10/30
 1/63 [..............................] - ETA: 0s - loss: 0.5530
46/63 [====================>.........] - ETA: 0s - loss: 0.4912
63/63 [==============================] - 0s 1ms/step - loss: 0.4989
Epoch 11/30
 1/63 [..............................] - ETA: 0s - loss: 0.3820
47/63 [=====================>........] - ETA: 0s - loss: 0.4914
63/63 [==============================] - 0s 1ms/step - loss: 0.4848
Epoch 12/30
 1/63 [..............................] - ETA: 0s - loss: 0.5388
47/63 [=====================>........] - ETA: 0s - loss: 0.4677
63/63 [==============================] - 0s 1ms/step - loss: 0.4727
Epoch 13/30
 1/63 [..............................] - ETA: 0s - loss: 0.5586
47/63 [=====================>........] - ETA: 0s - loss: 0.4674
63/63 [==============================] - 0s 1ms/step - loss: 0.4623
Epoch 14/30
 1/63 [..............................] - ETA: 0s - loss: 0.5675
47/63 [=====================>........] - ETA: 0s - loss: 0.4329
63/63 [==============================] - 0s 1ms/step - loss: 0.4523
Epoch 15/30
 1/63 [..............................] - ETA: 0s - loss: 0.4606
46/63 [====================>.........] - ETA: 0s - loss: 0.4390
63/63 [==============================] - 0s 1ms/step - loss: 0.4448
Epoch 16/30
 1/63 [..............................] - ETA: 0s - loss: 0.5161
49/63 [======================>.......] - ETA: 0s - loss: 0.4613
63/63 [==============================] - 0s 1ms/step - loss: 0.4384
Epoch 17/30
 1/63 [..............................] - ETA: 0s - loss: 0.5498
49/63 [======================>.......] - ETA: 0s - loss: 0.4424
63/63 [==============================] - 0s 1ms/step - loss: 0.4326
Epoch 18/30
 1/63 [..............................] - ETA: 0s - loss: 0.3196
49/63 [======================>.......] - ETA: 0s - loss: 0.4350
63/63 [==============================] - 0s 1ms/step - loss: 0.4280
Epoch 19/30
 1/63 [..............................] - ETA: 0s - loss: 0.4926
50/63 [======================>.......] - ETA: 0s - loss: 0.4204
63/63 [==============================] - 0s 1ms/step - loss: 0.4238
Epoch 20/30
 1/63 [..............................] - ETA: 0s - loss: 0.3793
50/63 [======================>.......] - ETA: 0s - loss: 0.4085
63/63 [==============================] - 0s 1ms/step - loss: 0.4200
Epoch 21/30
 1/63 [..............................] - ETA: 0s - loss: 0.3395
49/63 [======================>.......] - ETA: 0s - loss: 0.4139
63/63 [==============================] - 0s 1ms/step - loss: 0.4169
Epoch 22/30
 1/63 [..............................] - ETA: 0s - loss: 0.3211
36/63 [================>.............] - ETA: 0s - loss: 0.4134
61/63 [============================>.] - ETA: 0s - loss: 0.4142
63/63 [==============================] - 0s 2ms/step - loss: 0.4144
Epoch 23/30
 1/63 [..............................] - ETA: 0s - loss: 0.5288
34/63 [===============>..............] - ETA: 0s - loss: 0.4221
63/63 [==============================] - 0s 1ms/step - loss: 0.4126
Epoch 24/30
 1/63 [..............................] - ETA: 0s - loss: 0.4448
48/63 [=====================>........] - ETA: 0s - loss: 0.4133
63/63 [==============================] - 0s 1ms/step - loss: 0.4105
Epoch 25/30
 1/63 [..............................] - ETA: 0s - loss: 0.4868
49/63 [======================>.......] - ETA: 0s - loss: 0.4047
63/63 [==============================] - 0s 1ms/step - loss: 0.4084
Epoch 26/30
 1/63 [..............................] - ETA: 0s - loss: 0.3879
48/63 [=====================>........] - ETA: 0s - loss: 0.4123
63/63 [==============================] - 0s 1ms/step - loss: 0.4062
Epoch 27/30
 1/63 [..............................] - ETA: 0s - loss: 0.3862
49/63 [======================>.......] - ETA: 0s - loss: 0.4194
63/63 [==============================] - 0s 1ms/step - loss: 0.4044
Epoch 28/30
 1/63 [..............................] - ETA: 0s - loss: 0.4644
50/63 [======================>.......] - ETA: 0s - loss: 0.4052
63/63 [==============================] - 0s 1ms/step - loss: 0.4031
Epoch 29/30
 1/63 [..............................] - ETA: 0s - loss: 0.3527
49/63 [======================>.......] - ETA: 0s - loss: 0.3962
63/63 [==============================] - 0s 1ms/step - loss: 0.4023
Epoch 30/30
 1/63 [..............................] - ETA: 0s - loss: 0.5316
49/63 [======================>.......] - ETA: 0s - loss: 0.3994
63/63 [==============================] - 0s 1ms/step - loss: 0.4012

Here is a step-by-step explanation of the code:

  1. First, we create a sequential model using the tf.keras.Sequential() function. This is a linear stack of layers where we can add layers using the .add() method.

  2. Then we add three dense layers to the model using the .add() method. The first two layers have the relu activation function and the last layer has the softmax activation function.

  3. We import SparseCategoricalCrossentropy from tensorflow.keras.losses. This is our loss function, which will be used to evaluate the model during training.

  4. We compile the model using model.compile(), specifying the SparseCategoricalCrossentropy() as our loss function.

  5. We fit the model to the training data using model.fit(), specifying the training data (X_train and y_train) and the number of epochs* (10).

In summary, the code creates a sequential model with three dense layers, using the relu activation function in the first two layers and the softmax activation function in the output layer. The model is then compiled using the SparseCategoricalCrossentropy() loss function, and finally, the model is trained for 10 epochs using the model.fit() method.

*In machine learning, the term “epochs” refers to the number of times the entire training dataset is used to train the model. During each epoch, the model processes the entire dataset, updates its parameters based on the computed errors, and moves on to the next epoch until the desired level of accuracy is achieved. Increasing the number of epochs may improve the model accuracy, but it also increases the risk of overfitting on the training data. Therefore, the number of epochs is a hyperparameter that must be tuned to achieve the best possible results.

sgd_model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 10)                30        
                                                                 
 dense_1 (Dense)             (None, 5)                 55        
                                                                 
 dense_2 (Dense)             (None, 4)                 24        
                                                                 
=================================================================
Total params: 109
Trainable params: 109
Non-trainable params: 0
_________________________________________________________________

In this example, the first hidden layer has 10 neurons, so there are 10 * 3 = 30 parameters (3 input features). The second hidden layer has 5 neurons, so there are 5 * 10 + 5 = 55 parameters (10 inputs from the previous layer, plus 5 bias terms). The output layer has 4 neurons, so there are 5 * 4 + 4 = 24 parameters (5 inputs from the previous layer, plus 4 bias terms).

The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.

The None values in the output shape column represent the variable batch size that is inputted during the training process.

p_nonpreferred = sgd_model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
 1/63 [..............................] - ETA: 4s
61/63 [============================>.] - ETA: 0s
63/63 [==============================] - 0s 857us/step
[[3.0265430e-05 9.9000406e-01 7.7044405e-03 2.2612682e-03]
 [9.8559028e-09 2.8865002e-03 4.3772280e-02 9.5334125e-01]]
largest value 0.9999842 smallest value 1.1022385e-20

p_nonpreferred = model.predict(X_train): This line uses the predict method of the model object to make predictions on the input data X_train. The resulting predictions are stored in the p_nonpreferred variable.

print(p_nonpreferred [:2]): This line prints the first two rows of p_nonpreferred. Each row represents the predicted probabilities for a single observation in the training set. The four columns represent the predicted probabilities for each of the four classes in the dataset.

print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred)): This line prints out the largest and smallest values from p_nonpreferred, which can give an idea of the range of the predictions. The np.max and np.min functions from NumPy are used to find the maximum and minimum values in p_nonpreferred.

The output is a matrix with two rows (because we have two input examples) and four columns (because the output layer has four neurons). Each element of the matrix is the probability that the input example belongs to the corresponding class. For example, the probability that the first input example belongs to class 3 (which has the highest probability) is 0.99254191.

12.2.3.2 ADAM Algoritm

adam_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
adam_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001), # < change to 0.01 and rerun
)

adam_history = adam_model.fit(
                    X_train,y_train,
                    epochs=30
)
Epoch 1/30
 1/63 [..............................] - ETA: 29s - loss: 2.1227
46/63 [====================>.........] - ETA: 0s - loss: 1.4394 
63/63 [==============================] - 1s 1ms/step - loss: 1.3151
Epoch 2/30
 1/63 [..............................] - ETA: 0s - loss: 0.9224
43/63 [===================>..........] - ETA: 0s - loss: 0.7678
63/63 [==============================] - 0s 1ms/step - loss: 0.7279
Epoch 3/30
 1/63 [..............................] - ETA: 0s - loss: 0.5913
49/63 [======================>.......] - ETA: 0s - loss: 0.5599
63/63 [==============================] - 0s 1ms/step - loss: 0.5584
Epoch 4/30
 1/63 [..............................] - ETA: 0s - loss: 0.5087
49/63 [======================>.......] - ETA: 0s - loss: 0.5084
63/63 [==============================] - 0s 1ms/step - loss: 0.5000
Epoch 5/30
 1/63 [..............................] - ETA: 0s - loss: 0.5003
48/63 [=====================>........] - ETA: 0s - loss: 0.4758
63/63 [==============================] - 0s 1ms/step - loss: 0.4721
Epoch 6/30
 1/63 [..............................] - ETA: 0s - loss: 0.2643
48/63 [=====================>........] - ETA: 0s - loss: 0.4650
63/63 [==============================] - 0s 1ms/step - loss: 0.4582
Epoch 7/30
 1/63 [..............................] - ETA: 0s - loss: 0.5475
47/63 [=====================>........] - ETA: 0s - loss: 0.4494
63/63 [==============================] - 0s 1ms/step - loss: 0.4458
Epoch 8/30
 1/63 [..............................] - ETA: 0s - loss: 0.3605
47/63 [=====================>........] - ETA: 0s - loss: 0.4394
63/63 [==============================] - 0s 1ms/step - loss: 0.4361
Epoch 9/30
 1/63 [..............................] - ETA: 0s - loss: 0.4289
48/63 [=====================>........] - ETA: 0s - loss: 0.4337
63/63 [==============================] - 0s 1ms/step - loss: 0.4278
Epoch 10/30
 1/63 [..............................] - ETA: 0s - loss: 0.5932
48/63 [=====================>........] - ETA: 0s - loss: 0.4265
63/63 [==============================] - 0s 1ms/step - loss: 0.4206
Epoch 11/30
 1/63 [..............................] - ETA: 0s - loss: 0.3905
47/63 [=====================>........] - ETA: 0s - loss: 0.4015
63/63 [==============================] - 0s 1ms/step - loss: 0.4156
Epoch 12/30
 1/63 [..............................] - ETA: 0s - loss: 0.3275
47/63 [=====================>........] - ETA: 0s - loss: 0.4049
63/63 [==============================] - 0s 1ms/step - loss: 0.4085
Epoch 13/30
 1/63 [..............................] - ETA: 0s - loss: 0.4582
48/63 [=====================>........] - ETA: 0s - loss: 0.4046
63/63 [==============================] - 0s 1ms/step - loss: 0.4050
Epoch 14/30
 1/63 [..............................] - ETA: 0s - loss: 0.3297
47/63 [=====================>........] - ETA: 0s - loss: 0.4181
63/63 [==============================] - 0s 1ms/step - loss: 0.4021
Epoch 15/30
 1/63 [..............................] - ETA: 0s - loss: 0.3931
48/63 [=====================>........] - ETA: 0s - loss: 0.3867
63/63 [==============================] - 0s 1ms/step - loss: 0.4025
Epoch 16/30
 1/63 [..............................] - ETA: 0s - loss: 0.3410
40/63 [==================>...........] - ETA: 0s - loss: 0.4134
63/63 [==============================] - 0s 1ms/step - loss: 0.3997
Epoch 17/30
 1/63 [..............................] - ETA: 0s - loss: 0.2734
48/63 [=====================>........] - ETA: 0s - loss: 0.3849
63/63 [==============================] - 0s 1ms/step - loss: 0.3969
Epoch 18/30
 1/63 [..............................] - ETA: 0s - loss: 0.3366
47/63 [=====================>........] - ETA: 0s - loss: 0.3955
63/63 [==============================] - 0s 1ms/step - loss: 0.3970
Epoch 19/30
 1/63 [..............................] - ETA: 0s - loss: 0.4878
48/63 [=====================>........] - ETA: 0s - loss: 0.3868
63/63 [==============================] - 0s 1ms/step - loss: 0.3957
Epoch 20/30
 1/63 [..............................] - ETA: 0s - loss: 0.2970
47/63 [=====================>........] - ETA: 0s - loss: 0.3980
63/63 [==============================] - 0s 1ms/step - loss: 0.3939
Epoch 21/30
 1/63 [..............................] - ETA: 0s - loss: 0.4061
47/63 [=====================>........] - ETA: 0s - loss: 0.4047
63/63 [==============================] - 0s 1ms/step - loss: 0.3944
Epoch 22/30
 1/63 [..............................] - ETA: 0s - loss: 0.2295
47/63 [=====================>........] - ETA: 0s - loss: 0.3838
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 23/30
 1/63 [..............................] - ETA: 0s - loss: 0.4797
48/63 [=====================>........] - ETA: 0s - loss: 0.3941
63/63 [==============================] - 0s 1ms/step - loss: 0.3928
Epoch 24/30
 1/63 [..............................] - ETA: 0s - loss: 0.6253
47/63 [=====================>........] - ETA: 0s - loss: 0.3789
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 25/30
 1/63 [..............................] - ETA: 0s - loss: 0.3452
47/63 [=====================>........] - ETA: 0s - loss: 0.3896
63/63 [==============================] - 0s 1ms/step - loss: 0.3923
Epoch 26/30
 1/63 [..............................] - ETA: 0s - loss: 0.2694
47/63 [=====================>........] - ETA: 0s - loss: 0.4114
63/63 [==============================] - 0s 1ms/step - loss: 0.3926
Epoch 27/30
 1/63 [..............................] - ETA: 0s - loss: 0.3815
48/63 [=====================>........] - ETA: 0s - loss: 0.3874
63/63 [==============================] - 0s 1ms/step - loss: 0.3908
Epoch 28/30
 1/63 [..............................] - ETA: 0s - loss: 0.5244
48/63 [=====================>........] - ETA: 0s - loss: 0.3865
63/63 [==============================] - 0s 1ms/step - loss: 0.3920
Epoch 29/30
 1/63 [..............................] - ETA: 0s - loss: 0.4065
48/63 [=====================>........] - ETA: 0s - loss: 0.4086
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 30/30
 1/63 [..............................] - ETA: 0s - loss: 0.3673
48/63 [=====================>........] - ETA: 0s - loss: 0.3790
63/63 [==============================] - 0s 1ms/step - loss: 0.3907
adam_model.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_3 (Dense)             (None, 25)                75        
                                                                 
 dense_4 (Dense)             (None, 15)                390       
                                                                 
 dense_5 (Dense)             (None, 4)                 64        
                                                                 
=================================================================
Total params: 529
Trainable params: 529
Non-trainable params: 0
_________________________________________________________________

The None values in the output shape column represent the variable batch size that is inputted during the training process. The number of parameters in each layer depends on the number of inputs and the number of neurons in the layer, along with any additional bias terms.

In this example, the first hidden layer has 25 neurons, so there are 25 * 3 = 75 parameters (3 input features). The second hidden layer has 15 neurons, so there are 15 * 25 + 15 = 390 parameters (25 inputs from the previous layer, plus 15 bias terms). The output layer has 4 neurons, so there are 15 * 4 + 4 = 64 parameters (15 inputs from the previous layer, plus 4 bias terms).

The output None for the total number of trainable parameters means that none of the layers have been marked as non-trainable.

p_nonpreferred = adam_model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))
 1/63 [..............................] - ETA: 2s
60/63 [===========================>..] - ETA: 0s
63/63 [==============================] - 0s 866us/step
[[3.7956154e-03 9.6981263e-01 1.5898595e-02 1.0493187e-02]
 [4.6749294e-05 3.6971366e-03 6.8161853e-02 9.2809433e-01]]
largest value 0.999983 smallest value 1.492862e-13

Here, the only difference between the these two machine learning models is the optimizer. That line of code, optimizer=tf.keras.optimizers.Adam(0.001), specifies the optimizer to be used during training. In this case, it uses the Adam optimizer with a learning rate of 0.001. The Adam optimizer is an adaptive optimization algorithm that is commonly used in deep learning for its ability to dynamically adjust the learning rate during training, which can help prevent the model from getting stuck in local minima.

Code
import numpy as np
import matplotlib.pyplot as plt

# Define the objective function (quadratic)
def objective(x, y):
    return x**2 + y**2

# Define the Adam update rule
def adam_update(x, y, m, v, t, alpha=0.1, beta1=0.9, beta2=0.999, eps=1e-8):
    g = np.array([2*x, 2*y])
    m = beta1 * m + (1 - beta1) * g
    v = beta2 * v + (1 - beta2) * g**2
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
    dx = - alpha * m_hat[0] / (np.sqrt(v_hat[0]) + eps)
    dy = - alpha * m_hat[1] / (np.sqrt(v_hat[1]) + eps)
    return dx, dy, m, v

# Define the parameters for the optimization
theta = np.array([2.0, 2.0])
m = np.zeros(2)
v = np.zeros(2)
t = 0
alpha = 0.1
beta1 = 0.9
beta2 = 0.999
eps = 1e-8

# Generate the parameter space grid
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
Z = objective(X, Y)

# Generate the parameter space plot
fig, ax = plt.subplots()
ax.contour(X, Y, Z, levels=30, cmap='jet')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Parameter Space of Adam')

# Perform several iterations of Adam and plot the updates
for i in range(20):
    t += 1
    dx, dy, m, v = adam_update(theta[0], theta[1], m, v, t, alpha, beta1, beta2, eps)
    theta += np.array([dx, dy])
    ax.arrow(theta[0]-dx, theta[1]-dy, dx, dy, head_width=0.1, head_length=0.1, fc='b', ec='b')
plt.show()

plt.plot(sgd_history.history['loss'], label='SGD')
plt.plot(adam_history.history['loss'], label='Adam')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

12.2.3.3 Preferred ADAM Algorithm

As we have talked about in class before, numerical roundoff errors happen when coding in python due to memory overflow.

x1 = 2.0 / 10000
print(f"{x1:.18f}") # print 18 digits to the right of the decimal point
0.000200000000000000
x2 = 1 + (1/10000) - (1 - 1/10000)
print(f"{x2:.18f}")
0.000199999999999978

It turns out that while the implementation of the loss function for softmax was correct, there is a different and better way of reducing numerical roundoff errors which leads to more accurate computations.

If we go back to how a loss function for softmax regression is implemented we see that the loss function is expressed in the following formula: \[ \text{loss}(a_1, a_2, \dots, a_n, y) = \begin{cases} -\log(a_1) & \text{if } y = 1 \\ -\log(a_2) & \text{if } y = 2 \\ \vdots & \vdots \\ -\log(a_n) & \text{if } y = n \end{cases} \]

where \(a_j\) is computed from: \[ a_j = \frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}} = P(y=j \mid \vec{x}) \]

This can lead to numerical roundoff errors in tensorflow as the loss function is not directly computing \(a_j\).

In terms of code, that is exactly what loss=SparseCategoricalCrossentropy() is doing. Therefore, it would be more accurate if we could implement the loss function as follows: \[ \text{loss}(a_1, a_2, \dots, a_n, y) = \begin{cases} -\log(\frac{e^{z_1}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 1 \\ -\log(\frac{e^{z_2}}{e^{z_1} + e^{z_2} + ... + e^{z_n}}) & \text{if } y = 2 \\ \vdots & \vdots \\ -\log(\frac{e^{z_j}}{\sum\limits_{k=1}^n e^{z_k}}) & \text{if } y = n \end{cases} \]

We achieve this in two steps. The first is making the output layer a linear activation, and additionally adding a from_logits=True parameter to the loss=tf.keras.losses.SparseCategoricalCrossentropy line of code. By using a linear activation function instead of softmax, the model will output a vector of real numbers rather than probabilities.

preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),
)

preferred_history = preferred_model.fit(
                    X_train,y_train,
                    epochs=30
)
Epoch 1/30
 1/63 [..............................] - ETA: 28s - loss: 1.7850
48/63 [=====================>........] - ETA: 0s - loss: 1.1980 
63/63 [==============================] - 1s 1ms/step - loss: 1.1161
Epoch 2/30
 1/63 [..............................] - ETA: 0s - loss: 0.6643
48/63 [=====================>........] - ETA: 0s - loss: 0.6635
63/63 [==============================] - 0s 1ms/step - loss: 0.6332
Epoch 3/30
 1/63 [..............................] - ETA: 0s - loss: 0.5495
49/63 [======================>.......] - ETA: 0s - loss: 0.5082
63/63 [==============================] - 0s 1ms/step - loss: 0.5024
Epoch 4/30
 1/63 [..............................] - ETA: 0s - loss: 0.3845
47/63 [=====================>........] - ETA: 0s - loss: 0.4624
63/63 [==============================] - 0s 1ms/step - loss: 0.4612
Epoch 5/30
 1/63 [..............................] - ETA: 0s - loss: 0.4256
35/63 [===============>..............] - ETA: 0s - loss: 0.4366
63/63 [==============================] - 0s 1ms/step - loss: 0.4412
Epoch 6/30
 1/63 [..............................] - ETA: 0s - loss: 0.6007
37/63 [================>.............] - ETA: 0s - loss: 0.4405
63/63 [==============================] - 0s 1ms/step - loss: 0.4306
Epoch 7/30
 1/63 [..............................] - ETA: 0s - loss: 0.5292
41/63 [==================>...........] - ETA: 0s - loss: 0.4281
63/63 [==============================] - 0s 1ms/step - loss: 0.4233
Epoch 8/30
 1/63 [..............................] - ETA: 0s - loss: 0.2345
35/63 [===============>..............] - ETA: 0s - loss: 0.4112
63/63 [==============================] - 0s 2ms/step - loss: 0.4162
Epoch 9/30
 1/63 [..............................] - ETA: 0s - loss: 0.3684
37/63 [================>.............] - ETA: 0s - loss: 0.4130
63/63 [==============================] - 0s 1ms/step - loss: 0.4114
Epoch 10/30
 1/63 [..............................] - ETA: 0s - loss: 0.5446
45/63 [====================>.........] - ETA: 0s - loss: 0.4132
63/63 [==============================] - 0s 1ms/step - loss: 0.4089
Epoch 11/30
 1/63 [..............................] - ETA: 0s - loss: 0.4163
45/63 [====================>.........] - ETA: 0s - loss: 0.4109
63/63 [==============================] - 0s 1ms/step - loss: 0.4047
Epoch 12/30
 1/63 [..............................] - ETA: 0s - loss: 0.4489
46/63 [====================>.........] - ETA: 0s - loss: 0.3998
63/63 [==============================] - 0s 1ms/step - loss: 0.4031
Epoch 13/30
 1/63 [..............................] - ETA: 0s - loss: 0.4602
48/63 [=====================>........] - ETA: 0s - loss: 0.3974
63/63 [==============================] - 0s 1ms/step - loss: 0.4015
Epoch 14/30
 1/63 [..............................] - ETA: 0s - loss: 0.4532
47/63 [=====================>........] - ETA: 0s - loss: 0.4028
63/63 [==============================] - 0s 1ms/step - loss: 0.3983
Epoch 15/30
 1/63 [..............................] - ETA: 0s - loss: 0.6738
46/63 [====================>.........] - ETA: 0s - loss: 0.3814
63/63 [==============================] - 0s 1ms/step - loss: 0.3974
Epoch 16/30
 1/63 [..............................] - ETA: 0s - loss: 0.4042
47/63 [=====================>........] - ETA: 0s - loss: 0.3991
63/63 [==============================] - 0s 1ms/step - loss: 0.3970
Epoch 17/30
 1/63 [..............................] - ETA: 0s - loss: 0.4041
49/63 [======================>.......] - ETA: 0s - loss: 0.3998
63/63 [==============================] - 0s 1ms/step - loss: 0.3951
Epoch 18/30
 1/63 [..............................] - ETA: 0s - loss: 0.3383
47/63 [=====================>........] - ETA: 0s - loss: 0.3877
63/63 [==============================] - 0s 1ms/step - loss: 0.3955
Epoch 19/30
 1/63 [..............................] - ETA: 0s - loss: 0.6438
48/63 [=====================>........] - ETA: 0s - loss: 0.4018
63/63 [==============================] - 0s 1ms/step - loss: 0.3941
Epoch 20/30
 1/63 [..............................] - ETA: 0s - loss: 0.1844
48/63 [=====================>........] - ETA: 0s - loss: 0.3957
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 21/30
 1/63 [..............................] - ETA: 0s - loss: 0.2338
47/63 [=====================>........] - ETA: 0s - loss: 0.3884
63/63 [==============================] - 0s 1ms/step - loss: 0.3910
Epoch 22/30
 1/63 [..............................] - ETA: 0s - loss: 0.3001
48/63 [=====================>........] - ETA: 0s - loss: 0.4025
63/63 [==============================] - 0s 1ms/step - loss: 0.3906
Epoch 23/30
 1/63 [..............................] - ETA: 0s - loss: 0.5729
48/63 [=====================>........] - ETA: 0s - loss: 0.3979
63/63 [==============================] - 0s 1ms/step - loss: 0.3931
Epoch 24/30
 1/63 [..............................] - ETA: 0s - loss: 0.5538
47/63 [=====================>........] - ETA: 0s - loss: 0.3958
63/63 [==============================] - 0s 1ms/step - loss: 0.3909
Epoch 25/30
 1/63 [..............................] - ETA: 0s - loss: 0.3733
47/63 [=====================>........] - ETA: 0s - loss: 0.4052
63/63 [==============================] - 0s 1ms/step - loss: 0.3908
Epoch 26/30
 1/63 [..............................] - ETA: 0s - loss: 0.3424
43/63 [===================>..........] - ETA: 0s - loss: 0.3974
63/63 [==============================] - 0s 1ms/step - loss: 0.3890
Epoch 27/30
 1/63 [..............................] - ETA: 0s - loss: 0.3206
45/63 [====================>.........] - ETA: 0s - loss: 0.3886
63/63 [==============================] - 0s 1ms/step - loss: 0.3918
Epoch 28/30
 1/63 [..............................] - ETA: 0s - loss: 0.5887
40/63 [==================>...........] - ETA: 0s - loss: 0.3983
63/63 [==============================] - 0s 1ms/step - loss: 0.3898
Epoch 29/30
 1/63 [..............................] - ETA: 0s - loss: 0.4373
42/63 [===================>..........] - ETA: 0s - loss: 0.3609
63/63 [==============================] - 0s 1ms/step - loss: 0.3899
Epoch 30/30
 1/63 [..............................] - ETA: 0s - loss: 0.1973
47/63 [=====================>........] - ETA: 0s - loss: 0.4034
63/63 [==============================] - 0s 1ms/step - loss: 0.3897
p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))
 1/63 [..............................] - ETA: 2s
61/63 [============================>.] - ETA: 0s
63/63 [==============================] - 0s 862us/step
two example output vectors:
 [[-0.6257403   5.117499    0.07288799  1.0743215 ]
 [-2.9822855   0.81028026  3.5368693   6.0620565 ]]
largest value 18.288116 smallest value -6.9841967

Notice that in the preferred model, the outputs are not probabilities, but can range from large negative numbers to large positive numbers. The output must be sent through a softmax when performing a prediction that expects a probability.

If the desired output are probabilities, the output should be be processed by a softmax.

sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))
two example output vectors:
 [[3.11955088e-03 9.73529756e-01 6.27339305e-03 1.70773119e-02]
 [1.08768334e-04 4.82606189e-03 7.37454817e-02 9.21319604e-01]]
largest value 0.999989 smallest value 1.80216e-11

This code applies the softmax activation function to the output of a neural network model p_preferred, and then converts the resulting tensor to a numpy array using the .numpy() method. The resulting array sm_preferred contains the probabilities for each of the possible output classes for the input data.

The second line of code then prints the first two rows of sm_preferred, which correspond to the probabilities for the first two input examples in the dataset.

Lets check the loss functions one final time:

plt.plot(adam_history.history['loss'], label='ADAM')
plt.plot(preferred_history.history['loss'], label='Pref_ADAM')
plt.plot(sgd_history.history['loss'], label='SGD')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

12.2.4 References

  1. https://www.tensorflow.org/api_docs/python/tf/nn/softmax
  2. https://www.tensorflow.org/
  3. https://www.whyofai.com/blog/ai-explained
  4. https://www.coursera.org/specializations/machine-learning-introduction

12.3 Web Scraping with Selenium (by Michael Zheng)

Selenium is a free, open-source automation testing suite for web applications across different browsers and platforms. Selenium focuses on automating web-based applications.

12.3.1 Selenium vs BeautifulSoup?

Selenium is a web browser automation tool that can interact with web pages like a human user, whereas BeautifulSoup is a library for parsing HTML and XML documents. This means Selenium has more functionality since it can automate browser actions such as clicking buttons, filling out forms and navigating between pages.

However, Selenium is not as fast as BeautifulSoup. Thus, if your web scraping problem can be solved with BeautifulSoup, use that.

An example of a website that can’t be scraped by BeautifulSoup is a website that doesn’t fully load unless prompted to: https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid.

  • Go to the link and inspect the first photo
  • Collapse the ‘TaxonPhoto undefined’ div container and scroll to the last ‘TaxonPhoto undefined’
  • Go back to the web page and scroll down to load new images

See those ‘TaxonPhoto undefined’ elements that are popping up on the right side of the screen as we scroll? Those are more photos that are being rendered as we directly interact with the web page. BeautifulSoup can only scrape HTML elements from what’s already loaded on the web page. It cannot dynamically interact with the page to load more HTML elements. Luckily, Selenium can do that!

12.3.2 Example: Plant Images Scraper

I will demonstrate the functionalities of Selenium by building a program to scrape plant images from a website. Hopefully, this example will be enough for anybody listening to get started with Selenium.

12.3.2.1 Components of a Website

Websites are developed using 3 main languages: javascript, html, and css.

We don’t need to get too much into what each of these languages do, but just know that html tells a browser how to display the content of a website; and that is what we will interact with to extract data from the website.

12.3.2.2 HTML

In HTML, the contents of a website are organized into containers called div.

These div containers are given identifiers using class and id

<div class="widget"></div>

In this example, the div container is given the class name “widget”.

<div id="widget"></div>

In this example, the div container is given the id name “widget”.

We can use the find_elements method in Selenium to retrieve the containers that we want by using their XPATH, which is the address to the containers specified in the HTML file.

Say we want to retrieve all the “widget” containers on a web page. Then, we can use the find_elements method. The method can locate containers based on many techniques, but we want to specify By.XPATH here. Then we want to locate the containers whose ids have the name “widget”; we can do this with classes as well by replacing @id with @class.

find_elements(By.XPATH, "//*[starts-with(@id, 'widget')]")

12.3.2.3 Additional Selenium Functionalities

Selenium is very powerful and contains many useful features for interacting with browsers. We will not be using most of them in this project, but they’re still good to know.

As we mentioned earlier, find_elements will retrieve all specified elements on the page. But there is also find_element, note that element is singular, which will only return one element of the specified type; the first one that it comes across.

Besides XPATH, there are other techniques for locating div containers. For instance, we can also use:

# Find the element with name "my-element"
element = driver.find_element(By.NAME, 'my-element')

# Find the element with ID "my-element"
element = driver.find_element(By.ID, 'my-element')

# Find the element with class name "my-element"
element = driver.find_element(By.CLASS_NAME, 'my-element')

# Find the element with CSS selector "#my-element .my-class"
element = driver.find_element(By.CSS_SELECTOR, '#my-element .my-class')

==========================================================================================

You can also interact with text fields in browsers via Selenium.

Say you are automating a scraper that needs to login to a website. Well we know how to find the elements using the find_element method:

# Find the username and password fields
username_field = driver.find_element(By.NAME, 'username')
password_field = driver.find_element(By.NAME, 'password')

Now those two variables are pointing to the corresponding text fields on the page. So, we can enter in our username and password by using the send_keys method:

username_field.send_keys('myusername')
password_field.send_keys('mypassword')

To complete the login, we need to click on the login button. We can do this by using the click method:

# Find the login button and click it
login_button = driver.find_element(By.XPATH, '//button[@type="submit"]')
login_button.click()

==========================================================================================

Sometimes you may need to wait for an element to appear on the page before you can interact with it. You can do this using the WebDriverWait class provided by Selenium. For example:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

search_results = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'search')))

Selenium will wait for a maximum of 10 seconds for the element with the id “search” to appear on the page. If 10 seconds pass and the element doesn’t appear, then an error will be returned. Otherwise, the driver will retrieve the element and store it in the variable search_results.

Now that we have an understanding of how to interact with HTML elements using Selenium. Let’s get started with building the program!

  • Step 1: Import Libraries
import time # will be used to allow sufficient time for web pages to load

import requests # will be used to send requests to web pages to download images

# selenium functions
from selenium import webdriver # how selenium uses the browser on your laptop
from selenium.webdriver.chrome.service import Service # tells selenium what browser to use
from webdriver_manager.chrome import ChromeDriverManager # a package to manage chrome driver dependencies so you don't have to
from selenium.webdriver.common.by import By # method for using XPATHS to locate div elements
  • Step 2: Scrape Image Links

Let’s make a plan for how we are gonna scrape these images:

  1. Go to this link: https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid

  2. Scroll down; notice how the page takes some time to load more images (this is where the ‘time’ library will come into play)

  3. Right click on a picture and Inspect

  4. Navigate to the div container with id that starts with ‘cover-image…’

  5. Notice that the images are stored in a AWS S3 data lake with the link to the image encapsulated by url(…)

  6. Copy and paste the link into browser to open the image

  7. But another important point, the image url is mixed in with a bunch of other text; starts with “width: 100%…” (so we need to remove all the text surrounding the link)

Let’s define a function called image_links_scraper. It’s job will be to extract the image links for each image on the website. It will take in 2 parameters: link = link to the website that we wanna scrape and max_images = total number of images we wanna scrape

def image_links_scraper(link, max_images):
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install())) 
    # 1. downloads the latest google chrome driver (executable that selenium uses to launch google chrome)
    # 2. service is responsible for starting the webdriver, an interface for interacting with browsers, using the chrome driver
    # 3. once the webdriver is started, we can use it to interact with chrome

    # whenever we want to interact with the browser we call a method from driver

    driver.get(link) 
    # get method opens the browser to the specified link

    image_links = []
    # we will store the scraped image links in this list

    ### ISSUE (step 2) ###
    
    current_height = driver.execute_script("return document.body.scrollHeight") 
    # executes a javascript command to get the current height of the page (which is the length of the page from the top to the bottom before it loads new images)

    while True: # keep scrolling down on the browser to load new images until we reach the end of the page
        driver.execute_script(f"window.scrollTo({current_height}, document.body.scrollHeight);") 
        # run javascript command to scroll to the bottom of the page

        elements = driver.find_elements(By.XPATH, "//*[starts-with(@id, 'cover')]") 
        # find all elements where the 'id' tag starts with the string 'cover' because these div containers have the image links

        if len(elements) >= max_images: # check to see if we have scraped enough image links, as specified by the max_images parameter
            break # if so, stop scolling

        time.sleep(5)
        # wait for page to load; dependent on internet speed

        new_height = driver.execute_script("return document.body.scrollHeight") 
        # get new page height after scrolling

        if current_height == new_height: # check to see if the page height has stopped changing
            break # if so, we've reached the end of the page and need to stop scrolling
        else:
            current_height = new_height # otherwise, we need to keep scrolling

    # at this point, we have not scraped any images
    # we only have the div container elements that contain the image links we want to extract
    
    # now we go through each element and extract the links
    for element in elements:
        # ### ISSUE (step 7) ###

        s = element.get_attribute('style') 
        # returns the text in the 'style' attribute

        start = 'width: 100%; min-height: 183px; background-size: cover; background-position: center center; background-repeat: no-repeat; background-image: url("'
        # the useless text before the link

        end = '");' 
        # the useless text after the link

        link = s[len(start):-len(end)] 
        # perform string splicing to get only the URL from the entire string

        image_links.append(link) 
        # add the image link to the list

        print(link) 
        # print the links as we extract them to visualize function in real-time

    driver.quit()
    # once we're done automating the browser, we should close it using the quit() method of the driver object

    return image_links
  • Step 3: Download the Images

Now, we take the image links extracted from the previous step and download the images located at each link.

Let’s define a method called download_images that takes in 2 parameters: image_links = whatever image_links_scraper returns and folder_name = name of the folder to save the scraped images to

def download_images(image_links, folder_name):
    i = 1 # keep track of the image number to give each image an identifier
    
    for link in image_links: # iterate through all the image links
        r = requests.get(link).content # retrieve the image content from URL by sending a request to the website
        file_name = f'{folder_name}/{i}.jpg' # generate image file name (image number) and directory

        with open(file_name, 'wb') as f:
            f.write(r) # save the image
        
        i += 1 # update the image number for the next iteration
  • Step 4: Run Everything All Together

The result is a dataset of plant images saved in a folder called _selenium_download.

# if __name__ == '__main__':
link = 'https://www.inaturalist.org/taxa/52083-Toxicodendron-pubescens/browse_photos?layout=grid' # website to scrape images from
max_images = 20 # number of images to scrape
folder_name = '_selenium_download' # name of folder to save images to
image_links = image_links_scraper(link, max_images)
download_images(image_links, folder_name)
https://inaturalist-open-data.s3.amazonaws.com/photos/2418087/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/7400447/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/7400452/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/148667986/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148667997/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668006/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668015/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668025/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668036/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668042/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668053/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668061/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668071/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668077/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/148668090/medium.jpeg
https://static.inaturalist.org/photos/114478526/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630561/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630572/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630582/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/122630593/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/101461689/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/194174354/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/194174425/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/51452451/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455593/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455616/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/51455665/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/98405618/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/101535531/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/101535544/medium.jpg
https://inaturalist-open-data.s3.amazonaws.com/photos/150810477/medium.jpeg
https://inaturalist-open-data.s3.amazonaws.com/photos/163894686/medium.jpg