word2vec_basic.py

阅读原文时间：2023年07月15日阅读：2

ssh://sci@192.168.67.128:22/usr/bin/python3 -u /home/win_pymine_clean/feature_wifi/word2vec_basic.py
Found and verified text8.zip
Data size 17005207
Most common words (+UNK) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5238, 3084, 12, 6, 195, 2, 3136, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
3084 originated -> 5238 anarchism
3084 originated -> 12 as
12 as -> 6 a
12 as -> 3084 originated
6 a -> 195 term
6 a -> 12 as
195 term -> 2 of
195 term -> 6 a
2017-09-18 01:35:09.854800: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 01:35:09.874305: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 01:35:09.874359: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 01:35:09.874379: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-18 01:35:09.874396: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Initialized
Average loss at step 0 : 261.962097168
Nearest to but: rtp, sheng, modules, defenders, banknote, contrasting, proximal, va,
Nearest to state: pigeon, neutralize, gacy, independent, lug, rebellious, scholar, infestation,
Nearest to five: analytics, bhp, exams, synapomorphies, paolo, nanotubes, leaching, lombards,
Nearest to their: fragment, allocations, filthy, forgets, leaks, myron, albatross, response,
Nearest to as: bodybuilders, expecting, indivisible, rembrandt, clades, redirection, summand, checkmate,
Nearest to other: editorship, aan, beauties, exterminated, guan, nolo, bess, pearce,
Nearest to than: explainable, sword, capacitor, savory, rally, coleridge, imr, subordinates,
Nearest to are: ignatius, repetitive, eras, soter, elan, collectivity, fractions, perugia,
Nearest to its: castillo, hogshead, rna, ljubljana, fiberglass, ausgleich, toxin, balderus,
Nearest to see: abrasive, rifles, und, sufferer, nvidia, caricatured, manly, chee,
Nearest to and: meiosis, bridges, insurrection, feistel, sacrum, sepsis, reinforced, oscillator,
Nearest to three: honest, communally, gast, hawkins, highway, menzies, flocked, nijmegen,
Nearest to it: peh, namespaces, revolting, murat, mcgrath, toltec, fiorello, aediles,
Nearest to war: buprenorphine, rakis, decide, funky, stilicho, rnir, satirized, hydrolysis,
Nearest to however: pyramid, whittier, edifices, heraclea, alveolar, counties, carousel, carnap,
Nearest to states: renormalization, ideology, defending, immerse, semi, fireplace, roach, classically,
Average loss at step 2000 : 113.145377743
Average loss at step 4000 : 53.0118778615
Average loss at step 6000 : 33.3697857151
Average loss at step 8000 : 23.4925061049
Average loss at step 10000 : 17.8379190623
Nearest to but: and, soroban, var, electromagnetism, suitable, yeast, ruins, connotation,
Nearest to state: feel, independent, refrigerator, scholar, inventor, once, archive, disambiguation,
Nearest to five: zero, bckgr, nine, coke, eight, one, seven, hakama,
Nearest to their: response, wiesenthal, the, fragment, a, naturally, dawn, prevent,
Nearest to as: by, namibia, is, in, encourage, and, heroes, nazi,
Nearest to other: clock, concluded, bang, house, western, existent, obey, strategic,
Nearest to than: sword, rally, equilibrium, first, for, dhabi, contained, sample,
Nearest to are: is, under, certain, were, fractions, was, wickets, ignatius,
Nearest to its: faber, the, explanations, aristophanes, rna, ed, apo, one,
Nearest to see: abbot, entire, yeast, kohanim, rifles, solstice, multi, jurisdiction,
Nearest to and: in, of, with, or, nine, the, UNK, by,
Nearest to three: zero, jpg, nine, one, two, six, seven, five,
Nearest to it: archie, ideas, he, decided, viet, sigma, a, this,
Nearest to war: decide, density, from, patent, supposedly, passenger, agave, evil,
Nearest to however: pyramid, and, atheists, cuisine, counties, neq, spotted, technique,
Nearest to states: agave, semi, ideology, what, nine, defending, defendants, classically,
Average loss at step 12000 : 14.0006338406
Average loss at step 14000 : 11.8489933434
Average loss at step 16000 : 9.97434587622
Average loss at step 18000 : 8.41560820138
Average loss at step 20000 : 7.94222211516
Nearest to but: and, circ, electromagnetism, northrop, suitable, soroban, is, which,
Nearest to state: subkey, refrigerator, feel, inventor, once, rebellious, first, independent,
Nearest to five: nine, eight, zero, seven, six, three, four, agouti,
Nearest to their: his, the, acacia, a, wiesenthal, fragment, response, its,
Nearest to as: and, in, for, by, is, agouti, from, harbour,
Nearest to other: concluded, clock, existent, operatorname, western, rai, obey, bang,
Nearest to than: for, sword, rally, archimedean, antwerp, equilibrium, or, nine,
Nearest to are: is, were, was, agouti, have, ignatius, gollancz, under,
Nearest to its: the, his, operatorname, rna, magdeburg, their, faber, requisite,
Nearest to see: kohanim, seabirds, agouti, abbot, entire, yeast, rifles, subkey,
Nearest to and: or, circ, dasyprocta, operatorname, in, agouti, of, for,
Nearest to three: six, two, nine, seven, five, eight, one, zero,
Nearest to it: he, this, she, there, archie, which, viet, dasyprocta,
Nearest to war: decide, density, passenger, patent, operatorname, stilicho, dasyprocta, ns,
Nearest to however: and, cuisine, educators, pyramid, edifices, credo, atheists, operatorname,
Nearest to states: agave, agouti, semi, dasyprocta, ideology, hadith, what, defending,
Average loss at step 22000 : 7.19506471896
Average loss at step 24000 : 6.85889063323
Average loss at step 26000 : 6.72703241718
Average loss at step 28000 : 6.40524325681
Average loss at step 30000 : 5.92727108788
Nearest to but: and, circ, which, suitable, electromagnetism, although, operatorname, agouti,
Nearest to state: subkey, feel, refrigerator, inventor, diodes, rebellious, first, once,
Nearest to five: eight, six, seven, four, three, zero, nine, two,
Nearest to their: his, the, its, acacia, a, wiesenthal, fragment, punts,
Nearest to as: by, harbour, and, agouti, in, circ, is, for,
Nearest to other: clock, concluded, western, rai, different, operatorname, traits, existent,
Nearest to than: or, for, rally, sword, and, ann, archimedean, antwerp,
Nearest to are: were, is, was, have, agouti, be, under, gollancz,
Nearest to its: the, their, his, operatorname, magdeburg, requisite, rna, tonnage,
Nearest to see: kohanim, seabirds, trinomial, agouti, aba, abbot, yeast, subkey,
Nearest to and: or, circ, dasyprocta, agouti, operatorname, with, in, of,
Nearest to three: six, five, two, four, seven, eight, nine, agouti,
Nearest to it: he, this, she, there, which, amalthea, archie, dasyprocta,
Nearest to war: decide, buprenorphine, density, ns, patent, passenger, aforementioned, northrop,
Nearest to however: and, almighty, educators, cuisine, technique, edifices, expressly, operatorname,
Nearest to states: agave, semi, agouti, defending, ideology, dasyprocta, hadith, what,
Average loss at step 32000 : 5.91289714551
Average loss at step 34000 : 5.74882777894
Average loss at step 36000 : 5.79514622164
Average loss at step 38000 : 5.50678780043
Average loss at step 40000 : 5.23573713326
Nearest to but: and, circ, although, which, or, zero, that, electromagnetism,
Nearest to state: subkey, inventor, feel, refrigerator, diodes, rebellious, independent, first,
Nearest to five: four, three, six, eight, seven, zero, two, nine,
Nearest to their: his, its, the, acacia, operatorname, wiesenthal, agouti, punts,
Nearest to as: agouti, harbour, circ, zero, and, in, by, aba,
Nearest to other: different, clock, western, operatorname, concluded, fountain, dasyprocta, rai,
Nearest to than: or, for, rally, and, sword, simple, hbf, qualification,
Nearest to are: were, is, have, agouti, zero, be, was, multivibrator,
Nearest to its: their, the, his, operatorname, magdeburg, dasyprocta, agouti, zero,
Nearest to see: kohanim, trinomial, seabirds, agouti, aba, rifles, abrasive, abbot,
Nearest to and: or, zero, circ, dasyprocta, operatorname, in, agouti, but,
Nearest to three: four, five, six, seven, eight, two, zero, agouti,
Nearest to it: he, she, this, which, there, zero, archie, sal,
Nearest to war: decide, buprenorphine, northrop, density, ordinarily, aforementioned, slovenes, collide,
Nearest to however: and, but, educators, almighty, technique, that, operatorname, cuisine,
Nearest to states: semi, agouti, agave, defending, ideology, dasyprocta, hadith, what,
Average loss at step 42000 : 5.36539960575
Average loss at step 44000 : 5.24343806195
Average loss at step 46000 : 5.22763335347
Average loss at step 48000 : 5.23191133296
Average loss at step 50000 : 4.98666904318
Nearest to but: and, although, circ, however, or, four, which, two,
Nearest to state: subkey, inventor, feel, refrigerator, rebellious, leche, diodes, independent,
Nearest to five: six, four, three, eight, seven, dasyprocta, agouti, two,
Nearest to their: his, its, the, operatorname, punts, acacia, agouti, biostatistics,
Nearest to as: harbour, circ, agouti, kapoor, six, is, by, aba,
Nearest to other: different, four, many, clock, some, western, three, operatorname,
Nearest to than: or, and, for, rally, sword, simple, but, operatorname,
Nearest to are: were, is, have, be, was, agouti, gad, multivibrator,
Nearest to its: their, the, his, operatorname, dasyprocta, magdeburg, agouti, her,
Nearest to see: kohanim, trinomial, seabirds, aba, originator, erythrocytes, rifles, agouti,
Nearest to and: or, dasyprocta, circ, but, operatorname, six, agouti, four,
Nearest to three: four, six, five, seven, eight, two, one, dasyprocta,
Nearest to it: he, this, she, there, which, amalthea, archie, dasyprocta,
Nearest to war: decide, buprenorphine, ordinarily, northrop, aforementioned, density, slovenes, scenario,
Nearest to however: but, and, almighty, operatorname, technique, educators, that, four,
Nearest to states: defending, agouti, semi, agave, ideology, dasyprocta, hadith, classically,
Average loss at step 52000 : 5.02374834335
Average loss at step 54000 : 5.19878741825
Average loss at step 56000 : 5.04657789743
Average loss at step 58000 : 5.05832611966
Average loss at step 60000 : 4.95238713527
Nearest to but: and, however, or, although, circ, which, operatorname, kapoor,
Nearest to state: subkey, inventor, refrigerator, feel, pulau, ursinus, leche, michelob,
Nearest to five: six, four, three, eight, seven, two, dasyprocta, zero,
Nearest to their: his, its, the, her, operatorname, agouti, punts, a,
Nearest to as: in, circ, agouti, kapoor, by, harbour, is, aba,
Nearest to other: different, many, some, three, operatorname, clock, four, western,
Nearest to than: or, and, for, but, simple, rally, qualification, sword,
Nearest to are: were, is, have, be, gad, agouti, was, fractions,
Nearest to its: their, his, the, her, operatorname, magdeburg, biostatistics, agouti,
Nearest to see: but, kohanim, trinomial, seabirds, erythrocytes, originator, aba, solstice,
Nearest to and: or, circ, operatorname, dasyprocta, agouti, including, but, pulau,
Nearest to three: six, four, five, two, seven, eight, dasyprocta, agouti,
Nearest to it: he, this, she, there, which, they, amalthea, archie,
Nearest to war: buprenorphine, aforementioned, ordinarily, decide, rivaling, northrop, density, ursus,
Nearest to however: but, and, that, almighty, operatorname, technique, when, which,
Nearest to states: agouti, agave, defending, dasyprocta, ideology, semi, hadith, kingdom,
Average loss at step 62000 : 5.00446435535
Average loss at step 64000 : 4.85484569466
Average loss at step 66000 : 4.64125810647
Average loss at step 68000 : 5.0052472105
Average loss at step 70000 : 4.88670423937
Nearest to but: and, however, although, or, callithrix, circ, mico, which,
Nearest to state: subkey, inventor, ursinus, refrigerator, pulau, feel, capuchin, leche,
Nearest to five: four, six, eight, three, seven, zero, nine, two,
Nearest to their: its, his, the, her, some, operatorname, agouti, michelob,
Nearest to as: agouti, circ, kapoor, harbour, aba, operatorname, thaler, in,
Nearest to other: different, many, some, three, profits, operatorname, several, fountain,
Nearest to than: or, but, and, simple, qualification, rally, operatorname, wadsworth,
Nearest to are: were, have, is, be, gad, agouti, almighty, was,
Nearest to its: their, his, the, her, operatorname, agouti, ajanta, biostatistics,
Nearest to see: kohanim, but, trinomial, methodist, entire, originator, clo, aba,
Nearest to and: or, circ, callithrix, but, dasyprocta, operatorname, thaler, agouti,
Nearest to three: six, five, four, seven, eight, two, dasyprocta, one,
Nearest to it: he, she, this, there, which, they, amalthea, abakan,
Nearest to war: callithrix, ordinarily, buprenorphine, aforementioned, rivaling, decide, northrop, kapoor,
Nearest to however: but, and, that, which, almighty, although, when, though,
Nearest to states: thaler, kingdom, agouti, agave, defending, semi, ideology, dasyprocta,
Average loss at step 72000 : 4.7651747334
Average loss at step 74000 : 4.80770084751
Average loss at step 76000 : 4.72523927844
Average loss at step 78000 : 4.80518674481
Average loss at step 80000 : 4.79347246975
Nearest to but: however, and, although, callithrix, mico, circ, or, which,
Nearest to state: subkey, feel, pulau, inventor, capuchin, hadith, refrigerator, ursinus,
Nearest to five: four, six, seven, eight, three, two, nine, zero,
Nearest to their: its, his, the, her, our, agouti, operatorname, michelob,
Nearest to as: agouti, circ, kapoor, microcebus, thaler, callithrix, harbour, astoria,
Nearest to other: many, different, some, three, profits, operatorname, western, clock,
Nearest to than: or, but, and, kosar, simple, qualification, operatorname, wadsworth,
Nearest to are: were, is, have, gad, be, agouti, almighty, was,
Nearest to its: their, his, the, her, biostatistics, operatorname, agouti, dasyprocta,
Nearest to see: but, kohanim, trinomial, originator, aba, clo, milestones, yankees,
Nearest to and: or, callithrix, but, operatorname, circ, dasyprocta, thaler, agouti,
Nearest to three: five, four, six, two, seven, eight, dasyprocta, agouti,
Nearest to it: he, this, there, she, which, they, amalthea, abakan,
Nearest to war: ordinarily, callithrix, buprenorphine, aforementioned, rivaling, northrop, decide, lossy,
Nearest to however: but, that, although, when, though, almighty, operatorname, and,
Nearest to states: kingdom, thaler, agouti, defending, agave, dasyprocta, ideology, semi,
Average loss at step 82000 : 4.76438662171
Average loss at step 84000 : 4.76674228442
Average loss at step 86000 : 4.76899926138
9joml ;;'.Average loss at step 90000 : 4.7310421375
Nearest to but: and, however, or, which, callithrix, although, while, circ,
Ne

==
ij

arest to state: subkey, pulau, capuchin, ursinus, michelob, hadith, feel, refrigerator,
Nearest to five: four, seven, eight, six, three, zero, nine, two,
Nearest to their: its, his, the, her, our, them, agouti, some,
Nearest to as: agouti, by, in, microcebus, harbour, circ, callithrix, aba,
Nearest to other: different, many, some, including, several, profits, operatorname, three,
Nearest to than: or, but, and, kosar, simple, qualification, archaeopteryx, wadsworth,
Nearest to are: were, have, is, be, include, gad, agouti, almighty,
Nearest to its: their, his, the, her, biostatistics, agouti, dasyprocta, operatorname,
Nearest to see: but, kohanim, originator, trinomial, aba, bragi, yankees, clo,
Nearest to and: or, but, callithrix, operatorname, circ, including, however, dasyprocta,
Nearest to three: four, five, two, six, seven, eight, one, dasyprocta,
Nearest to it: he, this, she, there, which, they, amalthea, abakan,
Nearest to war: callithrix, ordinarily, globemaster, aforementioned, buprenorphine, northrop, rivaling, kapoor,
Nearest to however: but, that, and, although, which, calypso, though, operatorname,
Nearest to states: kingdom, thaler, defending, agave, agouti, ashes, state, dasyprocta,
Average loss at step 92000 : 4.66148071206
Average loss at step 94000 : 4.72550550318
Average loss at step 96000 : 4.68572968149
Average loss at step 98000 : 4.59241366351
Average loss at step 100000 : 4.68939590132
Nearest to but: however, and, although, or, while, which, callithrix, circ,
Nearest to state: subkey, pulau, capuchin, ursinus, michelob, leche, inventor, hadith,
Nearest to five: four, seven, three, six, eight, two, zero, nine,
Nearest to their: its, his, the, her, our, them, some, agouti,
Nearest to as: harbour, agouti, circ, aba, operatorname, kapoor, by, goo,
Nearest to other: different, many, some, including, operatorname, profits, several, these,
Nearest to than: or, and, but, kosar, simple, much, archaeopteryx, qualification,
Nearest to are: were, have, is, be, include, these, gad, agouti,
Nearest to its: their, his, the, her, biostatistics, agouti, dasyprocta, apo,
Nearest to see: kohanim, aba, batted, bragi, but, trinomial, originator, milestones,
Nearest to and: or, but, callithrix, circ, dasyprocta, operatorname, agouti, including,
Nearest to three: five, four, six, two, seven, eight, dasyprocta, agouti,
Nearest to it: he, this, there, she, they, which, abakan, amalthea,
Nearest to war: callithrix, globemaster, ordinarily, aforementioned, northrop, buprenorphine, kapoor, rivaling,
Nearest to however: but, although, that, though, which, and, while, calypso,
Nearest to states: kingdom, thaler, agouti, agave, defending, dasyprocta, state, ideology,
Please install sklearn, matplotlib, and scipy to show embeddings.

Process finished with exit code 0

https://raw.githubusercontent.com/tensorflow/tensorflow/r1.3/tensorflow/examples/tutorials/word2vec/word2vec_basic.py

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

==============================================================================

"""Basic word2vec example."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import os
import random
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf

Step 1: Download the data.

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
"""Download a file if not present, and make sure it's the right size."""
if not os.path.exists(filename):
filename, _ = urllib.request.urlretrieve(url + filename, filename)
statinfo = os.stat(filename)
if statinfo.st_size == expected_bytes:
print('Found and verified', filename)
else:
print(statinfo.st_size)
raise Exception(
'Failed to verify ' + filename + '. Can you get to it with a browser?')
return filename

filename = maybe_download('text8.zip', 31344016)

Read the data into a list of strings.

def read_data(filename):
"""Extract the first file enclosed in a zip file as a list of words."""
with zipfile.ZipFile(filename) as f:
data = tf.compat.as_str(f.read(f.namelist()[0])).split()
return data

vocabulary = read_data(filename)
print('Data size', len(vocabulary))

Step 2: Build the dictionary and replace rare words with UNK token.

vocabulary_size = 50000

def build_dataset(words, n_words):
"""Process raw inputs into a dataset."""
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(n_words - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reversed_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
vocabulary_size)
del vocabulary # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

data_index = 0

Step 3: Function to generate a training batch for the skip-gram model.

def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window batch = np.ndarray(shape=(batch_size), dtype=np.int32) labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) span = 2 * skip_window + 1 # [ skip_window target skip_window ] buffer = collections.deque(maxlen=span) if data_index + span > len(data):
data_index = 0
buffer.extend(data[data_index:data_index + span])
data_index += span
for i in range(batch_size // num_skips):
target = skip_window # target label at the center of the buffer
targets_to_avoid = [skip_window]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
if data_index == len(data):
buffer[:] = data[:span]
data_index = span
else:
buffer.append(data[data_index])
data_index += 1
# Backtrack a little bit to avoid skipping words in the end of a batch
data_index = (data_index + len(data) - span) % len(data)
return batch, labels

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=1)
for i in range(8):
print(batch[i], reverse_dictionary[batch[i]],
'->', labels[i, 0], reverse_dictionary[labels[i, 0]])

Step 4: Build and train a skip-gram model.

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a label.

We pick a random validation set to sample nearest neighbors. Here we limit the

validation samples to the words that have a low numeric ID, which by

construction are also the most frequent.

valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64 # Number of negative examples to sample.

graph = tf.Graph()

with graph.as_default():

# Input data.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Ops and variables pinned to the CPU because of missing GPU implementation
with tf.device('/cpu:0'):
# Look up embeddings for inputs.
embeddings = tf.Variable(
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss  
nce\_weights = tf.Variable(  
    tf.truncated\_normal(\[vocabulary\_size, embedding\_size\],  
                        stddev=1.0 / math.sqrt(embedding\_size)))  
nce\_biases = tf.Variable(tf.zeros(\[vocabulary\_size\]))

# Compute the average NCE loss for the batch.
# tf.nce_loss automatically draws a new sample of the negative labels each
# time we evaluate the loss.
loss = tf.reduce_mean(
tf.nn.nce_loss(weights=nce_weights,
biases=nce_biases,
labels=train_labels,
inputs=embed,
num_sampled=num_sampled,
num_classes=vocabulary_size))

# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)

# Add variable initializer.
init = tf.global_variables_initializer()

Step 5: Begin training.

num_steps = 100001

with tf.Session(graph=graph) as session:
# We must initialize all variables before we use them.
init.run()
print('Initialized')

average_loss = 0
for step in xrange(num_steps):
batch_inputs, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

# We perform one update step by evaluating the optimizer op (including it  
# in the list of returned values for session.run()  
\_, loss\_val = session.run(\[optimizer, loss\], feed\_dict=feed\_dict)  
average\_loss += loss\_val

if step % 2000 == 0:  
  if step > 0:  
    average\_loss /= 2000  
  # The average loss is an estimate of the loss over the last 2000 batches.  
  print('Average loss at step ', step, ': ', average\_loss)  
  average\_loss = 0

# Note that this is expensive (~20% slowdown if computed every 500 steps)  
if step % 10000 == 0:  
  sim = similarity.eval()  
  for i in xrange(valid\_size):  
    valid\_word = reverse\_dictionary\[valid\_examples\[i\]\]  
    top\_k = 8  # number of nearest neighbors  
    nearest = (-sim\[i, :\]).argsort()\[1:top\_k + 1\]  
    log\_str = 'Nearest to %s:' % valid\_word  
    for k in xrange(top\_k):  
      close\_word = reverse\_dictionary\[nearest\[k\]\]  
      log\_str = '%s %s,' % (log\_str, close\_word)  
    print(log\_str)

final_embeddings = normalized_embeddings.eval()

Step 6: Visualize the embeddings.

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
assert low_dim_embs.shape[0] >= len(labels), 'More labels than embeddings'
plt.figure(figsize=(18, 18)) # in inches
for i, label in enumerate(labels):
x, y = low_dim_embs[i, :]
plt.scatter(x, y)
plt.annotate(label,
xy=(x, y),
xytext=(5, 2),
textcoords='offset points',
ha='right',
va='bottom')

plt.savefig(filename)

try:
# pylint: disable=g-import-not-at-top
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000, method='exact')
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [reverse_dictionary[i] for i in xrange(plot_only)]
plot_with_labels(low_dim_embs, labels)

except ImportError:
print('Please install sklearn, matplotlib, and scipy to show embeddings.')

手机扫一扫

移动阅读更方便

你可能感兴趣的文章