My projects

About me | Projects | More about me | Contact

Top 5 companies in Canada for "data analyst" (key words: Data Extraction, REST API, Python, JSON, Adzuna)

- As a nerd, I'm always curious about which employers have the most vacancies (in data analysis field) in Canada. So, here it comes. This simple mini project is deployed based on the raw data extracted from Adzuna, which is a big third party job platform based in UK extracting and gathering data from other job webs, such as Indeed, Glassdoor and LinkedIn, etc. Furthermore, they supply the RESTful API, therefore, I can get the data in a clean and neat format (in JSON) in a safe and efficient way comparing to other methods, such as web scraping (web scraping gets raw html and sometimes faces to the problems of copyright / legality, etc. - it's annoying in one word). The most important thing is - it's free. Actually there is no "analysis" in this project - Aduzna will do these in their server end and spit out the results. What I need to do is just accessing their endpoint in the server via a GET request (as in HTTP), extract these results, parse them from JSON into data frame, visualize, and then save them into a csv file. Alors, c'est parti!

Register and get an API credential on Adzuna API
Get the endpoints for "Top companies" in Canada - an endpoint is like a "web address" (url), there are 9 endpoints in Adzuna for different functionalities, ex. get job list in certain area; get top companies; get salaries, etc. (what I need now is "get top companies"). As per Adzuna, this endpoint returns a "leaderboard" of the top five employers by number of vacancies. I can filter for specific locations and/or job titles
Get familier with the parameters defined in that endpoint - what I used was just "what" parameter (what = 'data%20analyst'), '%20' in here represent a space character. But if I want a more specific location, say I want to find out Top 5 employers for Java developer with # of vacancies in Ontario, just make what = 'java%20developer', location0='Canada', location1='Ontario' (by default, location0='Canada' when we select country = 'ca')
In Python (I used Jupyter Notebook), set up the url including the parameters, in my case:
url = f'https://api.adzuna.com/v1/api/jobs/ca/top_companies?app_id={APP_ID}&app_key={APP_KEY}&what={WHAT}'
Make a request to GET an http response via the url
IF response code == 200, access the value of the key "leaderboard" in a dictionary (key-value pair in json) and convert into data frame (in Pandas) with 2 columns "company_name" and "count", ELSE return error code
Save this data frame into a CSV file
Plot the results in a bar chart (or simply insert a chart in Excel)
Done!

Here I copy & paste my code:


						# Created by Han Zhang (you can copy it but pls put my name on it, if it doesn't work for you, email me) 

						import requests

						import csv

						import pandas as pd

						import matplotlib.pyplot as plt

						APP_ID = YOUR_ID

						APP_KEY = YOUR_KEY

						# Define the job to search

						WHAT = 'data%20analyst'

						# File to store results

						csv_file = 'top_companies.csv'

						url = f'https://api.adzuna.com/v1/api/jobs/ca/top_companies?app_id={APP_ID}&app_key={APP_KEY}&what={WHAT}'

						# Make the API request

						response = requests.get(url)

						if response.status_code == 200:

						    data = response.json()

						    # Parse JSON and extract the leaderboard data

						    leaderboard = data['leaderboard']

						    # Convert leaderboard data to a DataFrame

						    df = pd.DataFrame(leaderboard)

						    # Save the DataFrame to a CSV file

						    output_file = "Top_Companies.csv"

						    df.to_csv(output_file, index=False)

						
						    print(f"Data saved to {output_file}")

						    # plot the data in bar chart

						    if not df.empty and 'canonical_name' in df and 'count' in df:

						        plt.figure(figsize=(10, 6))

						        bars = plt.barh(df['canonical_name'], df['count'], color='skyblue')

						
						        # Add labels and title

						        plt.xlabel('Count', fontsize=12)

						        plt.ylabel('Company Name', fontsize=12)

						        plt.title('Top Companies by Count of Job Vacancy (data analyst) - valide at 2024-12-10 12:55', fontsize=14)

						        plt.gca().invert_yaxis()  # Invert the Y-axis for better visualization

						        
						        # Add data labels next to each bar

						        for bar in bars:

						            width = bar.get_width()

						            plt.text(width + 0.2, bar.get_y() + bar.get_height()/2, str(int(width)), 

						                     va='center', ha='left', fontsize=10, color='black')

						
						        # Show the plot

						        plt.tight_layout()

						        plt.show()

						    else:

						        print("DataFrame is empty or required columns are missing.")

						else:

						    print("DATA WAS NOT EXTRACTED, ERROR CODE: " & response.status_code)

Here is the result - Top 5 companies with # of vacancies for data analyst in Canada. (Please note that your result might be different than mine if you run at a different time or day.)

The result is very interesing. The top company has just less than 20 vacancies in the entire country makes me doubt it. I don't know how Adzuna handles the natural language, but my experience on Google search tells me "the less key words the more results". So I just put "data" instead of "data analyst". The result proves that I was right. But that brings another problem - it' too vague. Data what? Tons of jobs are related to data. In conclusion, both are correct, if I search "apex developer for just salesforce platform", maybe I'll get 0 result. The key point is to define the purpose of our analysis in advance and to find out a "tolerance point".

Anyway, this is a method (first step) to play the Adzuna API, it's still in a abstract level. I will dig into that to explore more.

References:
[1] Adzuna API (https://developer.adzuna.com/overview)
[2] RedHat (https://www.redhat.com/en/topics/api/what-is-a-rest-api)

My Projects

Mes projets