How to determine which skills to put on a resume

We develop many skills over time. Unfortunately, hiring managers don't have the time, energy, or need to hear about each and every skill you've developed over your lifetime. Standard advice for job-seekers is to keep one's resume length fixed at one page. When it comes to choosing which skills to list on a resume it is best practice to list the skills which are closely related to the position. This step includes reading the job description and identifying skills one has that would support this role. Let's do better.

Objective:
Determine which terms are most relevant to a Data Scientist's resume in the New York City market.

Procedure:

  1. Search jobsite for the target role with requests
  2. Parse Soup of search results to find URLs of all job description HTML documents
  3. Download and cache job description HTML
  4. Extract the job description from HTML
  5. Extract features from each job description with NLTK
  6. Calculate tf-idf for each feature
  7. Filter and sort results, information will float to the top

Find the full source on GitHub

$python --version
Python 3.8.8
$uname -srp
NetBSD 9.99.81 x86_64

Search Jobsite for target role
We're not going to use the web front-end for this. That would take too much time. Instead we are going to use Python's requests module to navigate. Since it is a violation of the terms of service for many jobsites to download any content, I will use a fictitious the jobsite "infact.com" to demonstrate. An example URL for a Data Scientist role in New York, NY could look like:

#!/usr/bin/env python3
ROLE = 'Data Scientist'
LOCATION = 'New York, NY'
URL = f'https://www.infact.com/jobs/?q={ROLE}&l={LOCATION}&sort=date'

Now we use the requests module to fetch the HTML from infact.com's server:

import requests
response = requests.get(URL)
HTML = response.text()

Since infact.com is completely fictitious, we'll imagine to have an HTML document which contains the job listings for the role and location queried sorted by date posted. Further, we can imagine this data is provided in a systematic way which links each listing to another page with a full job description. An excellent way to navigate through HTML tags is with the BeautifulSoup module. Let's make a guess that on infact.com, the job content is loaded dynamically through JavaScript. Assume that inside a <script> tag each job-listing is contained inside a dictionary-like object called 'jobmap.'

import re       #for regular expressions
from bs4 import BeautifulSoup
soup = BeautifulSoup(HTML, 'html.parser')
jobmap = soup.find(text=re.compile('jobmap'))

Parse search results to find URLs of all job description HTML documents
The jobmap proably contains lots of data like a unique identifier of the role to be filled, the employer, the date posted, etc.. Let's call the unique identifier a jobkey or jk for short. Although not incredibly challenging, this part can get pretty hacky. With some looping, logic, and string functions, I leave it to the reader to extract the jobkeys for each job-posting. Be creative :)

With a list or dictionary of jobkeys, we can use the requests module to fetch the HTML document which contains the full job-description. Let's assume we have a dictionary of the following format:

jk
1038eef0fed4b586d
296a3d424eb8855e1
3faf9d0ffb3232d4f
......
Nffffffffffffffff

Download and cache all job description HTML documents
The HTTP-request part will closely the format we used earlier. What is different is new URLs are programatically created for each jobkey. Also the results are being cached on the disk.

Nota bene: Caching page-content is often a violation of a webpage's TOS. Check in with the host's TOS and with your own morality before caching.

BASE_URL = 'https://www.infact.com/viewjob?jk='
for jk in jobmap['jk']:
   filename = './cache/'+jk
   if os.path.isfile(filename):
      continue
   else:
      URL = BASE_URL + jk
      response = request.get(URL)
      with open(filename,'w') as f:
         f.write(response.text)

Process the beautiful content: TF-IDF
Write some prose on tf-idf.

Count raw term frequency
The task is to break down each job-description first into sentences and then into tokens. Once we have tokens, stop-words will be filtered out and each term will be lemmatized with it's part of speech. Finally, uni-, bi-, and tri-grams will be created from each sentence. Of course one could proceed with larger grams, but going much further into gram-space tends to create a sparse dataset. NLTK is a great tool for this task and others exist. Finally, each processed feature will be counted.

#!/usr/bin/env python3
# "process_content.py"

from nltk import pos_tag
from nltk.tokenize import sent_tokenize, RegexpTokenizer
from nltk.probability import Counter
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

tokenizer = RegexpTokenizer(r'\w+')
wnl = WordNetLemmatizer()
term_counter = Counter()

for file in files:
   raw = file.read().decode('utf8')
   raw = raw.lower()
   sent = tokenize(raw)
   words = tokenizer.tokenize(sent)
   tagged_words = pos_tag(words)
   filtered_words = [t for t in tagged_words if not t[0] in stop_words]
  lemmas = [wnl.lemmatize(w[0], pos=convert_function(w[1])) for w in filtered_words]
   for sentence in lemmas:
      for num in range(1,4):
         grams = ngrams(sentence, num)
         term_counter.update(grams)
[ print(v) for i, v in enumerate(term_counter.items()) if i < 5 ]

The last line of the script prints out a little preview so we can be sure we're on the right track.

$./process_content.py
  (('applied',), 32)
  (('commercial',), 42)
  (('artificial','intelligence',), 79)
  (('program',), 247)
  (('equal','opportunity','employer',), 105)

Count raw document frequency
We will use what we learned in the tf bit above to help us in the next part. Namely, we now have a list of all of the features which appear in the corpus. This list will be used to determine which features are in each document. It is probably possible to determine document frequency at the same time as raw term frequnecy. The added complexity probably isn't worth the efficiency.

# Given an input list of TERMS and a file, this function returns a
# list of bool for for each term.

from nltk.probability import FreqDist

def which_terms_in_doc(TERMS, FILENAME):
   termsindoc= FreqDist()
   with open(FILENAME, 'rb') as f:
      sent = get_sentences(f)
      lemmas = map(get_lemmas, sent)
      termsindoc.update(get_terms(lemmas))
   return list(map(lambda x: x in in termsindoc.keys(), TERMS))

There are a few locally defined functions in the snippet above. Their function is self-evident and if you're interested in their definitions please refer to my source on GitHub.

Calculations
There are a few different ways to calculate term frequency and document frequency. The simplest is to just divide the number of occurances by the total number of values: terms or documents. For the term frequency, Cambridge University suggests a modification:



For document frequency I follow a similar logarithmic treatment:



Finally, we dot the vectors together to calculate tf-idf:



Easy-peasy.

Filter & Sort
There are many terms which are specific to a single document. Since we're attempting to learn about the role as a whole, it is prudent to filter out some of the most unique terms. I selected all terms which appeared in at least 1/8 of the number of documents. One day I'd like to create a Term Count vs Terms in % of Documents curve and see the shape. For now, I guess 1/8.

df = pd.DataFrame.from_dict(term_counter)
df['doc_count'] = doc_counter
df['tf_corpus'] = list(map(logtf, df['term_count']))
df['idf'] = np.log(n_docs / df['doc_count'])
df['tf-idf_corpus'] = df['tf_corpus'] * df['idf']

# Select only terms which appear in more than 1/8 of documents
df = df[ df['doc_count'] > len(listings)/8 ]
df.sort_values(by='tf-idf_corpus',ascending=False).to_csv('./tfidf.csv')

$column -t -s, < ./tfidf.csv | head -n 50
termterm_countdoc_counttf_corpusidftf-idf_corpus
risk17747.06.1762.558315.800
investment10040.05.6052.71915.244
healthcare11244.05.7182.62415.007
data analyst15352.06.0302.45714.818
medium18858.06.2362.34814.64
city7138.05.26262.770914.582
sale11349.05.7272.51614.414
agency8142.05.3942.6714.407
inc7039.05.2482.744914.40
analytic10147.05.612.558314.36
colleague6339.05.1432.744914.117
director7543.05.312.64714.07
c6239.05.1272.744914.073
dashboard6139.05.1102.744914.029
portfolio5838.05.0602.770914.02
integration5738.05.042.770913.974
associate6541.05.1742.694913.944
data science team6240.05.1272.71913.944
senior data7845.05.3562.601813.937
ml15160.06.0172.314113.925
hand experience5638.05.022.770913.92
architecture7745.05.3432.601813.903
operational5839.05.0602.744913.890
america5538.05.0072.770913.875
least7946.05.36942.579813.852
insurance8648.05.4542.53713.839
asset6843.05.2192.64713.81
natural5639.05.022.744913.794
firm9250.05.52172.496513.785
engagement5338.04.9702.770913.772
science team6543.05.1742.64713.698
story5439.04.98892.744913.694
business intelligence6242.05.1272.6713.693
reporting8449.05.4302.51613.667
condition6744.05.2042.62413.658
outcome5841.05.0602.694913.637
domain5540.05.0072.71913.
care12559.05.82832.330913.585
b4938.04.8912.770913.55
84938.04.8912.770913.55
enterprise8651.05.4542.476713.50
influence4838.04.8712.770913.497
experience use4838.04.8712.770913.497
tech7448.05.302.53713.458
content8451.05.4302.476713.450
enhance5441.04.98892.694913.44
accommodation9554.05.5532.41913.437
define7047.05.2482.558313.427
ai19271.06.25742.14513.42
assist5642.05.022.6713.422
compute5642.05.022.6713.422
operate5140.04.9312.71913.41
track5843.05.0602.64713.39
may7248.05.2762.53713.388
employment opportunity4638.04.8282.770913.379
prior4638.04.8282.770913.379
applicable5542.05.0072.6713.373
would5040.04.9122.71913.358
party5040.04.9122.71913.358
thrive5241.04.9512.694913.343
brand13263.05.8822.265313.32
give5944.05.072.62413.325
internal external4739.04.8502.744913.313
potential4739.04.8502.744913.313
use data5141.04.9312.694913.291
interest6647.05.1892.558313.277
federal5342.04.9702.6713.27
direct5342.04.9702.6713.27
next4438.04.78412.770913.256
social5543.05.0072.64713.25
love5543.05.0072.64713.25
characteristic4639.04.8282.744913.254
guide4639.04.8282.744913.254
connect4639.04.8282.744913.254
business need4840.04.8712.71913.247
point5744.05.042.62413.23
4 year5744.05.042.62413.23
record5744.05.042.62413.23
best practice5945.05.072.601813.211
activity7350.05.2902.496513.207
..................
work1393291.08.2390.7356.058

Get Smart
We've done it. These terms may matter most when carefully choosing which characters to include on a resume. Do have experience assessing risk? If so, it'd be a good idea to highlight that. Can you build a dashboard? If not, now's a good time to learn. The knowledge in this table is a goldmine.

If you're feeling daring, it may be assumed that you have some of the more common skills if you include the less common. "Python" just so happens to fall 9 lines before the end of the list.