Back to Blog
Imagezilla lsh6/11/2023 This covers the case with a hash bucket that is split over more than one chunk, like you see for the blue, yellow, and magenta buckets here. Then you let each chunk attend within itself and the adjacent chunks. This allows for some parallel computation. The first step for batching is to split the sorted sequence into fixed size chunks. Instead, I'll show you how to do a batch computation. You could do this one bucket at a time, but that doesn't take advantage of hardware parallelism. Next, you map each vector to a bucket with LSH, then you sort the vectors by LSH bucket, and finally, you do attention only in each bucket. This is called QK attention and performs just as well as regular attention. To start, you modify the model so that it outputs a single vector at each position, which serves both as a query and a key. Now I'll show you how to integrate LSH into attention layers. And this can be done efficiently to take advantage of parallel computing. You can repeat this process multiple times to increase the probability of finding Q and K in the same bin. This reduces the search space for each K to the same LSH buckets as Q. First, you hash Q and K, then perform standard attention but within same-hash bins. You also already know how standard attention works, but let me show you how to speed this up using LSH attention. The process is then repeated depending on the number of hashes that you have. And the sign tells you which side of the plane the hash will be on. You know that hash(x) is the sign(xR) where R is random with size of d for dimension times the number of hash bins. When choosing the hash, you want to make the buckets roughly the same size. Then you only run attention on keys that are in the same hash buckets as the query. This helps you group similar query and key vectors together, just like the nouns with pronouns examples you saw before. Using locality-sensitive hashing, you can hash both the query q and key k. Let me show you how to do the same for attention. You already know from earlier courses that you can use locality-sensitive hashing to reduce the computational costs of finding k-nearest neighbors. Please feel free to skip ahead if you're familiar with both KNN and LSH. If KNN or locality-sensitive hashing sounds unfamiliar to you, please take a moment to review these lessons. You already covered nearest neighbors in course one, week four of the NLP specialization. This is an example of why you only want to work with the nearest neighbors to speed up the attention part. So when working with pronouns, you know you only need to look at the other nouns, and you can ignore the other words like it, and was, and tired. A pronoun is a word that substitutes for a noun, it for animal or it for street, for example, as I just showed you. You only need to look at the nouns because it can only refer to the nouns, not to all of the words. In both attentions, you can see that it is either the animal or the street. It again refers to either the street or the animal, but in this case, it refers to the street. For example, in the sentence, the animal didn't cross the street because it was too tired, it refers to the animal, or in the second example, the animal didn't cross the street because it was too wide. Take the word it, attention is focused on certain words to determine if it refers to the street or to the animal. In this picture, you can see what attention is doing. To improve it, I will show you how to use locality-sensitive hashing, which you've learned before. The first part that contributes to the complexity of transformer on long sequences is the dot-product attention. Łukasz Kaiser is a Staff Research Scientist at Google Brain and the co-author of Tensorflow, the Tensor2Tensor and Trax libraries, and the Transformer paper. Younes Bensouda Mourri is an Instructor of AI at Stanford University who also helped build the Deep Learning Specialization. This Specialization is designed and taught by two experts in NLP, machine learning, and deep learning. Please make sure that you’ve completed course 3 - Natural Language Processing with Sequence Models - before starting this course. Learners should have a working knowledge of machine learning, intermediate Python including experience with a deep learning framework (e.g., TensorFlow, Keras), as well as proficiency in calculus, linear algebra, and statistics. In Course 4 of the Natural Language Processing Specialization, you will:Ī) Translate complete English sentences into German using an encoder-decoder attention model,ī) Build a Transformer model to summarize text,Ĭ) Use T5 and BERT models to perform question-answering, andĭ) Build a chatbot using a Reformer model.īy the end of this Specialization, you will have designed NLP applications that perform question-answering and sentiment analysis, created tools to translate languages and summarize text, and even built a chatbot!
0 Comments
Read More
Leave a Reply. |