CS 224N / Ling 237 — Natural Language Processing

This page answers frequently asked questions (FAQs) for CS224N / Ling 237.

6/7/07		What kinds of grades are you guys giving out? UPDATED
5/11/07		PA3 tips
5/08/07		How do I interpret the output of the scoring script?
5/07/07		Ah! I don't understand this whole feature index thing in PA3.
5/04/07		"`stepSize underflow`" in PA2 Part II
4/28/07		Distortion Parameters for Model 2
4/24/07		What numbers should I be getting for PA2?
4/24/07		What should getAlignmentProb() return?
4/18/07		For PA2 Part I, why is the test data part of the training data?
4/18/07		Questions on Knight, section 31
4/13/07		Do I have to create my validation set for PA1?
4/12/07		Increasing perplexity with more training data?
4/12/07		Do we smooth at train time or test time?
4/12/07		How do I use stop tokens in my n-gram model?
4/12/07		Does PA1 require a strict proof?
4/12/07		Smoothing implementation details
4/12/07		Smoothing and conditional probabilities
4/12/07		Smoothing and unknown words
4/9/07		What numbers should I be getting for PA1?
4/9/07		Having trouble compiling/running PA1?
4/1/07		Do I have to do my final project in Java?

Grade Distributions for PA1, PA2, and PA3

May 19 2007

For PA1, the mean was 9.4 with a std dev of 1.17. For PA2, the mean was 9.14 with a std dev of 1.21. For PA3, the mean was 9.0 with a std dev of 1.5. Of course, the final grades are not going to be bell curved, so don't worry too much about being a point or two off... just blow us away with your final project.

PA3 tips

May 11 2007

1. for the parsing part, you may find that you get underflow from multiplying too many probabilities together - instead switch to log probabilities and add them. (and store the log probs in your grammar, don't re-log them every time)

2. for the parser, don't have 3 loops over all non-terminals (in the pseudo-code is says 'for each A,B,C in nonterms) - instead have one loop over non-terms and then get all rules with that nonterminal as the left (or right) child.

3. for the maxent part, when computing the derivative, don't have an outer loop over the derivative and an inner loop over data. this will be VERY SLOW. iterate through your data and incrementally build up the derivative, based on the features present in each datum.

4. for the maxent portion, what numbers should you be getting? the baseline features should give you somewhere around 40% F1. I think you should all be able to get at least into the 60s, and possibly the low 70s with some clever feature engineering. *pay no attention to the accuracy, care only about the F-score.*

How do I interpret the output of the scoring script?

May 8 2007

The scoring script has lots of numbers, and it might not be clear which ones are important. here's an example output (please don't take these numbers as indicative of what you should be aiming for!):

    processed 69051 tokens with 7119 phrases; found: 5236 phrases; correct: 3134.
accuracy:  84.42%; precision:  59.85%; recall:  44.02%; FB1:  50.73
              DNA: precision:  47.65%; recall:  23.16%; FB1:  31.17  680
              RNA: precision:  11.11%; recall:   5.34%; FB1:   7.22  63
        cell_line: precision:  65.24%; recall:  23.11%; FB1:  34.13  187
        cell_type: precision:  56.76%; recall:  44.38%; FB1:  49.82  717
          protein: precision:  63.36%; recall:  54.87%; FB1:  58.81  3589

The first line isn't important, and on the second line the first number is accuracy, but you should ignore that number as well. As discussed in class, accuracy is not a good measure for NER because the O class is so common. Really we care about precision, recall, and F-score. The second line also gives these values, micro-averaged, for all entities, while the following lines give them for each individual type of entity. These numbers are all important, but the most important one is the the overall F-score. Its in the upper-right corner, and in this case its 50.73.

Ah! I don't understand this whole feature index thing in PA3.

May 7 2007

Those of you who have built feature-based classifiers before may be a little confused about how features are represented in our EncodedDatum class. Basically, there are two different ways that one can think about the relationship between features, labels and weights. The first is that you can think of features are functions of both the observations AND the labels (e.g. "word=The & label=protein") and then learn a weight for each feature. An alternative representation is to build features over only the observations (e.g. "word=The") and then learn a weight for each (feature,label) pair (e.g. ("word=The",protein)). They are more-or-less equivalent (technically the first version is more expressive) and we opted for the latter representation because we felt that it would be simpler for you guys, and generally make it harder for you to accidentally "cheat" when building your features. It also simplifies a lot of things computationally. But, it results in the use of the IndexLinearizer class which might be a bit confusing to the uninitiated. So, we have included here a little bit of code which will iterate through each datum, and each possible label, and get the weight for that (feature,label) pair. Hopefully this will help clear up some confusion:

        for (EncodedDatum datum : data) {
          for (int label = 0; label < encoding.getNumLabels(); label++) {
            int numFeats = datum.getNumActiveFeatures();
            for (int i = 0; i < numFeats; i++) {
              int feat = datum.getFeatureIndex(i);
              int index = indexLinearizer.getLinearIndex(feat, label);
              double val = datum.getFeatureCount(i);
              double weight = x[index];
            }            
          }
        }

"`stepSize underflow`" on PA2 Part II

May 4 2007

Have you seen an error saying "BacktrackingSearcher.minimize: stepSize underflow."? Here's what it means. (This description should make sense if you've seen line search minimization before.) Inside the minimizer, it determines a direction to walk using the partial derivative information you supply, and then does a line search in that direction. It heads out a certain distance in the indicated direction, and if the function's value at that point isn't sufficiently below the previous function value, relative to the step size and the derivative that was calculated at the previous function value, then it goes into a loop where it tries to shorten the step size until the function value is sufficiently below the previous function value (relative to the calculated derivatives and the size of the step). You get the "stepSize underflow" message when it decides that there is no distance it can walk in the indicated direction which will lower the function value (as you want to do in minimization). In all cases of which we are aware, this doesn't indicate a problem in the optimization code, but rather that the function value or derivative information that you are supplying is wrong. For instance, if the derivatives say that a certain direction is downhill, but it is actually uphill, and that direction is chosen, then you'll get this message.

Distortion Parameters for Model 2

28 April 2007

There's been a lot of questions about the distortion parameters used for Model 2. I've added the following paragraph to the PA2 handout hopefully clarifying this:

"For each bucket j, you should have a parameter d(j) to indicate the probability of that distortion. These parameters should be learned during the EM process. (It turns out that choosing reasonable functions for d, instead of trying to learn parameters, can work pretty well. We'd like you to attempt learning distortion parameters with the EM algorithm, but dealing with distortions in some other sensible way could also warrant credit.)"

Note that since d is over buckets determined by the indices of the English/French words and the lengths of the English/French sentences, it is indeed represented by a one-dimensional table of floating points as mentioned in Knight's tutorial.

What numbers should I be getting for PA2?

24 April 2007

As low as you can go. Never be satisfied!

Model, # Training Sentences Precision Recall AER time

Model 1, 1k 0.557 0.640 0.411 1m 20s

Model 1, 10k 0.613 0.691 0.357 6m 40s

Model 2, 1k 0.666 0.781 0.294 1m

Model 2, 10k 0.695 0.831 0.258 6m 20s

Don't worry too much about the times (though if you optimize well, you should be able to train on over 100k sentences and get below 20% AER). Model 3 would be very cool to see!

What should getAlignmentProb() return?

24 April 2007

Say f is the source sentence, e is the target sentence (as it is in all our examples). Then getAlignmentProb() should return p(a , f | e).

For PA2 Part I, why is the test data part of the training data?

18 April 2007

If you look at the starter code for PA2, you'll notice there's a suspicious line:

trainingSentencePairs.addAll(testSentencePairs);

The reason why we include the test data is just to avoid unseen word problem and to simplify this assignment. And since we're doing unsupervised learning, the EM process doesn't make use of the annotated alignments in the test data, so we can say it's not cheating.

Questions on Knight section 31

18 April 2007

There are some questions regarding formulas found in section 31 of the Kevin Knight tutorial. He begins the section as follows:

P(a | e, f) = P(a, f | e) / P(f | e) = P(a, f | e) / Σ_a P(a, f | e)

Consider the computation of the denominator above. Using Model 1, we can write it as:

Σ_a Π_j t(f_j | e_{a_j})

Knight is being imprecise here, and it's quite confusing. The latter expression is not really equal to the denominator above, because it neglects to take account of the length of the French sentence.

Here's a more precise account. Let's ignore the summation over a for the moment, and just focus on P(a, f | e). Since len(f) is a deterministic function of f, P(a, f | e) = P(a, f, len(f) | e). Now we repeatedly apply the chain rule for conditional probability:

P(a, f, len(f) | e)

P(len(f) | e) ×
P(a | len(f), e) ×
P(f | a, len(f) e)

So it's really the product of three probabilities, and Knight kind of ignored the first two, which determine how long the French sentence will be and what the alignment will be. He was really focused on the third piece, and it is true that

P(f | a, len(f) e) = Π_j t(f_j | e_{a_j})

So Knight's slip is very confusing. I think the reason he made it is just that he was focused on something else. The point he is trying to make in this section is something quite different: that the sum and product in his expression can be effectively interchanged (read the section for details), saving a huge number of multiplications.

Using a validation set for PA1

13 April 2007

You do not have to create your own. Uncomment lines 227-228 in LanguageModelTester.java (after loading the training sentences, but before loading the test pairs) and you will then have a validation set to use.

You can use this set to choose parameters, such as the lambdas for linear interpolation (trying out hand-picked values is fine).

Increasing perplexity with more training data?

12 April 2007

Some students who have been trying to investigate learning curves have reported seeing test-set perplexity increase as the amount of training data grows. This is counter-intuitive: shouldn't more training data yield a better model, which is therefore able to attain lower perplexity? Chris came up with a possible explanation involving the handling of the <UNK> token. Remember that the <UNK> token actually represents an equivalence class of tokens. As more training data is added, this equivalence class shrinks. Because the meaning of the <UNK> token is changing, model perplexities are not directly comparable. Especially when the amount of training is small, adding more data will rapidly lower the model probability of <UNK>, causing the entropy and perplexity of the model distribution to grow.

If you've been looking at learning curves, an interesting investigation — not specifically required for the assignment — would be to measure the learning curve while holding the definition of <UNK> constant. This would mean allowing the models trained on small data sets to "know about" all the words in the largest training set. All known words in this sense would get explicit counts, which could be 0, and then you'd still have an <UNK> token representing all words which did not appear in even the largest training set.

Do we smooth at train time or test time?

12 April 2007

Generally speaking, smoothing should be the last step of training a model: first you collect counts from your training data, and then you compute a smoothed model distribution which can be applied to (that is, used to make predictions about) any test data.

In principle, it would be possible to postpone the computation of a smoothed probability until test time. But (a) it's not very efficient, because most smoothing algorithms require iterating through all the training data, which you shouldn't have to do more than once, and (b) if you're wanting to do this because your smoothing computation depends upon something in the test data, then you're doing things wrong. (For example, model probabilities should not depend on how many unknown words appear in the test data.)

How do I use stop tokens in my n-gram model?

12 April 2007

Real sentences are not infinite; they begin and end. To capture this in your n-gram model, you'll want to use so-called "stop" tokens, which are just arbitrary markers indicating the beginning and end of the sentence.

It's typically done as follows. Let <s> and </s> (or whatever) be arbitrary tokens indicating the start and end of a sentence, respectively. During training, wrap these tokens around each sentence before counting n-grams. So, if you're building a bigram model, and the sentence is

I like fish tacos

you'll change this to

<s> I like fish tacos </s>

and you'll collect counts for 5 bigrams, starting with (<s>, I) and ending with (tacos, </s>). If you encountered the same sentence during testing, you'd predict its probability as follows:

P(<s> I like fish tacos </s>) = P(<s>) · P(I | <s>) · ... · P(tacos | fish) · P(</s> | tacos)

where P(<s>) = 1. (After all, the sentence must begin.)

Does PA1 require a strict proof?

12 April 2007

Q. Is it necessary to give strict mathematical proof that the smoothing we've done is proper probability distribution? Or is it enough to just give a brief explanation?

A. You should give a concise, rigorous proof. No hand-waving. I'll show an example on Friday. Note that it's important that your proof applies to your actual implementation, not some ideal abstraction.

Smoothing implementation details

12 April 2007

Do you have questions regarding details of various smoothing methods? (For example, maybe you're wondering how to compute those alphas for Katz back-off smoothing.)

You might benefit from looking at a smoothing tutorial Bill put together last year.

For greater detail, an excellent source is the Chen & Goodman paper, An empirical study of smoothing techniques for language modeling.

Smoothing and conditional probabilities

12 April 2007

Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution:

P(x | y) = P(x, y) / P(y)

What if you want to use smoothing? The wrong way to compute the smoothed conditional probability distribution P(x | y) would be:

From the joint P(x, y), compute a smoothed joint P'(x, y).
Separately, from the marginal P(y), compute a smoothed marginal P''(y).
Divide them: let P'''(x | y) = P'(x, y) / P''(y).

The problem is that steps 1 and 2 do smoothing separately, so it makes no sense to divide the results. (In fact, doing this might even yield "probabilities" greater than 1.) The right way to compute the smoothed conditional probability distribution P(x | y) is:

From the joint P(x, y), compute a smoothed joint P'(x, y).
From the smoothed joint P'(x, y), compute a smoothed marginal P'(y).
Divide them: let P'(x | y) = P'(x, y) / P'(y).

Here, there is only one smoothing operation. We compute a smoothed joint distribution, and compute everything else from that.

If there's interest, I can show a worked-out example of this in Friday's section.

(It would also be correct to compute all the conditional distributions before doing any smoothing, and then to smooth each conditional distribution separately. This is a valid alternative to smoothing the joint distribution, and because it's simple to implement, this is often the approach used in practice. However, the results might not be as good, because less information is used in computing each smoothing function. This would be an interesting question to investigate in your PA1 submission.)

Smoothing and unknown words

12 April 2007

A few people have inquired about smoothing and unknown words (or more generally, n-grams). The basic idea of smoothing is to take some probability mass from the words seen during training and reallocate it to words not seen during training. Assume we have decided how much probability mass to reallocate, according to some smoothing scheme. The question is, how do we decide how to allocate this probability mass among unknown words, when we don't even know how many unknown words there are? (No fair peeking at the test data!)

There are multiple approaches, but no perfect solution. (This is an opportunity for you to experiment and innovate.) A straightforward and widely-used approach is to assume a special token <UNK> which represents (an equivalence class of) all unknown words. All of the reallocated probability mass is assigned to this special token, and any unknown word encountered during testing is treated as an instance of this token.

Another approach is to make the (completely unwarranted) assumption that there is some total vocabulary of fixed size B from which all data (training and test) has been drawn. Assuming a fixed value for B allows you to fix a value for N₀, the number of unknown words, and the reallocated probability mass can then be divided equally (or according to some other scheme) among the N₀ unknown words. The question then arises: how do you choose B (or equivalently, N₀)? There is no principled way to do it, but you might think of B as a hyperparameter to be tuned using the validation data (see M&S p. 207).

Both of these approaches have the shortcoming that they treat all unknown words alike, that is, they will assign the same probability to any unknown word. You might think that it's possible to do better than this. Here are two unknown words you might encounter, say, on the internet: "flavodoxin" and "B000EQHXQY". Intuitively, which should be considered more probable? What kind of knowledge are you applying? Do you think a machine could make the same judgment?

A paper by Efron & Thisted, Estimating the number of unseen species: How many words did Shakespeare know?, addresses related issues.

PA1 perplexity and WER

9 April 2007

Here are some estimates for what kind of numbers you should be seeing as you do PA1:

Model Test Set Perplexity HUB WER

Unigram 1300 .09
Bigram 1200 .075
Trigram 300 .06

For the bigram and trigram models, you should be able to get your training set perplexity below 100. Of course, these numbers are just guidelines to help you check whether you are on track. Try your best to get them as low as possible!

Compiling and Running PA1

9 April 2007

Follow the instructions on the first page of the PA1 handout. On the second page, the instruction to "cd cs224n/java" is incorrect - there is no such directory (the handout has now been fixed).

You can run ant in one of two ways. From the directory ~/cs224n/pa1/java/ you can type "./ant" to use the symbolic link, or you can add the cs224n bin directory to your path variable: "setenv PATH ${PATH}:/afs/ir/class/cs224n/bin/apache-ant-1.6.2/bin/". Once you change the PATH variable, you can invoke ant by simply typing "ant" (note, however, that to compile the files you should still make sure you are in directory ~/cs224n/pa1/java/ so it can find build.xml).

ant will compile the .java files in the ~/cs224n/pa1/java/src folder and put the resulting .class files in the ~/cs224n/pa1/java/classes folder. To run these classes, you must make sure java knows to look in the classes folder:

Make sure you are in ~/cs224n/pa1/java/. Check to see whether you have anything in the CLASSPATH variable by typing "printenv CLASSPATH". If nothing was printed out, you can set the CLASSPATH by simply typing "setenv CLASSPATH ./classes". If something was printed, type "setenv CLASSPATH ${CLASSPATH}:./classes". This will preserve whatever was already stored in the variable.

You should now be able to run the class files. From ~/cs224n/pa1/java/ try typing "java cs224n.assignments.LanguageModelTester" or try out the the included shell script "./run".

Do I have to do my final project in Java?

1 April 2007

No. You can use Perl, C, C++, or any other widely used programming language. Extra credit if you design a Turing machine to compute your final project. Double extra credit if you build a diesel-powered mechanical computer to compute your final project. Triple extra credit if you build a human-level AI capable of autonomously conceiving, executing, and presenting your final project.

Site design by Bill MacCartney

Grade Distributions for PA1, PA2, and PA3

PA3 tips

How do I interpret the output of the scoring script?

Ah! I don't understand this whole feature index thing in PA3.

"stepSize underflow" on PA2 Part II

Distortion Parameters for Model 2

What numbers should I be getting for PA2?

What should getAlignmentProb() return?

For PA2 Part I, why is the test data part of the training data?

Questions on Knight section 31

Using a validation set for PA1

Increasing perplexity with more training data?

Do we smooth at train time or test time?

How do I use stop tokens in my n-gram model?

Does PA1 require a strict proof?

Smoothing implementation details

Smoothing and conditional probabilities

Smoothing and unknown words

PA1 perplexity and WER

Compiling and Running PA1

Do I have to do my final project in Java?

"`stepSize underflow`" on PA2 Part II