Learning Vocabulary By Frequency

Featured image by TeroVesalainen from Pixabay

Some words are more common than others. This very basic linguistic principle has been researched (warning: PDF) a lot. This research shows a correlation between frequency and vocabulary acquisition, but that was already something taken for granted as common knowledge.

If you ask most people who have successfully learned a language how they did it, they’ll tell you they hit a certain level, then began breaking the language down and targeting each part separately. Vocabulary and grammar are the core parts of a language, so how do you sanely break them down? Grammar is a completely different beast which we won’t get into in this article. For vocabulary though, you see a trend in what is used often, what is used sometimes, and what is rarely used.

Making Sense of Vocabulary

How many words are there in the English language? How do we count or determine a word, and what are the implications of doing so? If we are conservative and use the number of words accepted in the Oxford Dictionary, we end up with 171,476 words. This further gets filtered down per The Economist to roughly 20,000 to 35,000 for the average adult (who took a vocabulary test).

The BBC reported it took about 800 words to function in day-to-day communications, 3,000 to understand film or similar, and 8,000 to 9,000 to understand most written works. As we dig deeper, we find that there are further subdivisions dividing the rough number of words into various strata of competency. Within different fields, there are also varying levels of competency based on jargon and specific constructions.

The Chinese language exam, the HSK (Hanyu Shuiping Kaoshi) has levels corresponding to knowing 150, 300, 600, 1,200, 2,500, and 5,000 or more words for each level of the test. This exam is a bit more optimistic of where you’ll be at each level, but the principle is the same, work in order of the most common words to the least. But how do we figure out what is common and what is not?

Breaking Down Vocabulary

One of the hardest things to determine is what words make up each “level” for teaching. When I worked as a language teacher and a private language tutor, I had to prepare materials based on student levels and their goals. Goals are important because certain common vocabulary can be useless in certain learning situations.

Take for instance a student who wants to go abroad to study in a short-term program. They have a short-term goal and a limited duration to use their language. You probably won’t want to prepare materials about discussing sports or similar unless it impacts their studies, or basic social functioning. The more in line with the student’s goals you get, the more easily you can keep them motivated and get them to the level they need to be.

Most language instruction materials will focus on a specific goal for the language usage. One course may be specifically designed to help a learner learn the language thoroughly and so it focuses on the 150 words for the first level of the HSK (which won’t necessarily match with 1 through 150 on a frequency chart). A travel course is going to diverge heavily from this since its 150 words are going to need to make a person able to survive in a variety of situations. “Medicine” may be more common, but “insulin shot” is going to be a lot more useful if you need it on a trip.

You also have active and passive vocabulary. The word “brother” isn’t something you’ll use much if you’re an only child, but is still essential to know. When it is a passive word, you’re familiar with and understand the meaning, but you don’t actively use it.

Good learners find the right materials for their goals. What are you going to use the vocabulary for? Are you learning for a trip, or for a lifetime?

When to Learn by Frequency

Once you get past the beginner phase for the language and more importantly for language learning, you can begin to incorporate frequency charts in more easily. A beginner will understand the majority of the basic grammar (though maybe no in depth).

The DaoDeJing is composed of 806 unique characters, but 10 of them get you access to over a quarter of the text, and a meager 15 get you access to a third. By 40, you’re at half. While this is pretty useless for reading the general work, the same principle can be further expanded. We don’t care about learning one work, we want to get the most access to all of them (or conceptually the same with speech). Some topics come up more than others, and knowing how to make it through them makes the difference between having specialized knowledge and general knowledge of the vocabulary.

Conversely, too general of knowledge means that you are unable to understand and communicate in specialized situations. If you work in a standard office, you should know how to say “stapler”, “staple remover”, etc. or equivalent. If you’re diabetic, learn to say “insulin” if it is necessary for you. Remember, you want access to the majority of your possible linguistic interactions. If you only use formal language for all of your language interactions, learning the equivalent of “yeet” won’t help you much.

Understanding Frequency

Once you know why you’re learning, you can begin breaking down the language. We’re going to go over some of the principles with the most common words in English. These numbers were sourced from Word Frequencies in Written and Spoken English ( CC BY-SA 2.0 UK).

The top 49 spoken words in English (out of a little under 4,900, so ~1%)

1% of the language gets you to 50%. This effect is extremely pronounced for the core vocabulary (~7% gets you 80%), but it gets more and more in line with the Pareto Principle as you advance into difference niches and topics. As you weed out the common grammatical glue and filler words, you’ll find that certain things pop up more and more in your specific use of the language that don’t fit the general chart. “Security” is number 1405 in our spoken list, but it is more common in my speech than “Christmas” (number 468) since I work in IT.

Focusing on frequency tends to be most efficient once you get past the intermediate level if this is your first language, or early beginner level if you’re experienced. Otherwise, you have too many grammatical words which may overwhelm you at your language level. “The” is number one on basically any frequency list, but every ESL book I’ve used opted not to include it in their first dialogues because it’s a hard concept to master for most learners.

Usage and Vocabulary

Now that we know what words are most common, we want to begin to prioritize them. Each word is going to have at least one meaning or grammatical function. Some will have more. How do we determine how to target the other usages of a word?

I split the given term up on a separate document and focus on the usages which are consistent between a travel dictionary, a standard dictionary, and an extended dictionary. The other usages might be important, but they can be moved into a separate list unless they’re common. Some may be suitable for passive familiarity only. Focus on the core meaning(s) and core usage, then expand to the others if they touch on your field or goals. For our “security” example, the usage I tend towards implies “cybersecurity” rather than locks on doors and guards.

After you build your usage lists, reincorporate the common variants back into the main list. If there is more than one common usage, just include it as a separate item with the original term. Try to find a good example usage with common words or words you’re learning as well. Feel free to further add notes to this list or references to other materials if necessary.

Applying Usage

Let’s see this usage in action. The following list is composed of random words just to showcase our process and doesn’t follow the actual frequency portion unless you have some really interesting goals.

An example list. Break down the vocabulary into a way you can remember how to reproduce everything related to its usage.

Once you finish this process, you have your base list. You can either use this to study from or take it further. What you include is going to depend on your language goals. For instance, with something like Thai, you may not learn the writing system if you’re just going on vacation. Feel free to skip information which doesn’t advance the language for your goals.

This approach will get you far, but lacks the depth of more targeted language learning. Jargon and specialized terms are going to be buried down the list unless you make a special effort to target them.

How to Use Frequency Learning to Go Further with Vocabulary

Try to find a frequency chart for items relevant to your target language usage. If you’re planning to travel, use a good travel book or list. If you just want to talk to people, focus on general frequency usage but also target topics you’re interested in or which are common interests. Even though I don’t like watching sports, I still know how to at least discuss basketball and similar in Mandarin because it was relevant in common conversation when I lived abroad.

The next step is to take (or build) a relevant frequency chart for our targeted field, and cross it with our original chart. If we’re working with something like musicology, “music” is probably going to be high, but still well after our standard grammar words. What we want to do is pull out any of the common words from our first chart which are not specific to our second. You can usually remove the first 500 to 1,000 words from our general chart to make a good baseline (as you advance, this gets to be a larger number). You can do this via spreadsheet or a preferred programming language (I tend to like Perl or Lua for this).

Once you remove the common words, you’re left with a list of words you need for your goals (and a count). Add up all of the counts and calculate out the relative percentages that these words make up, and address them accordingly. I tend to keep multiple lists via something like Anki which I go through in parallel (general word set, topic A, topic B, etc.). Though this whole last step isn’t necessarily required for learning vocabulary, it definitely helps make it more efficient. If you lack the ability to find or create this data, consider using a premade resource (like a vocabulary book). It will be less effective, but better than nothing.

Limitations of This Method

This method is powerful but is limited depending on usage. The more usage items or grammatical items and the less familiar a learner is with the language, the harder it is to make proper use of frequency data. What are the rules for “the”? Another limitation is available data. Many languages have a copious amount of data (though it can be hard to find depending on your native language and language level), but other languages have less.

You also have to take into account the difficulty in processing the data. It takes having the data available and the skill to handle spreadsheet processing, or more ideally programming knowledge to slog through this entire process. It’s possible to do by hand, but painful. The good news is, unless you change goals, it’s a one time expense. Even if it’s not, you can just do a subset of the process rather than processing thousands of entries.

Process Summary

The process for using frequency for vocabulary learning is simple. Use (or build) a basic frequency chart for the target language. Organize the chart by commonality and begin breaking it down based on usage. Build a list of the various use cases for target words and examples based on how common they actually are. “House” may be a common word, but the verb “to house” is not as commonly used.

If you have the ability to do so, build or find a secondary frequency chart which is specific to whatever goals you have. If you’re learning to travel, use a travel guide, if you’re learning to work, learn office terms and industry specific jargon. The frequency of use is only going to be as relevant as the frequency of your linguistic interactions.

This process is extremely efficient, but it does require you to be at a level to where you can linguistically assess the value of a resource. A vocabulary list may look great, but if everything is dated, what value does it provide? On the other hand, the list may look great and not include authentic terms. If you start with frequency lists for vocabulary too early, you run the risk of not having the grammatical backing to use the terms as well. You have to have the level in order to really determine whether you’re helping or hurting yourself.

Originally published at https://somedudesays.com on February 10, 2020.

I write about technology, linguistics (mainly Chinese), and anything else that interests me. Check out https://somedudesays.com for more from me!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store