Project 1: Tidying Up Plastic

April 6, 2021   

0.0 Introduction

No one can go a day without using plastic. It has become an essential item in many of the products we buy or use regularly, furniture to clothing to electronics. Because plastic is so ubiquitous, it becomes easy at times to forget about the impact it has on the environment. Through this project, I hope to learn about the different types of plastic (I keep forgetting what all the recycling numbers mean) and their main producers. That way I can do my part to recycle correctly and avoid plastics that have the largest impact on the environment. I also hope to learn how external factors play a role in a country’s plastic waste production.

0.1 Imports

I will be using three datasets for this report: plastics, plaEcon, and gdp.

The plastics dataset was imported from tidytuesdayR and contains information of plastic waste collected by country for several households. The plastic waste is sorted by recycling codes and labeled with the parent company that produced the plastic. tidytuesdayR must be installed to import the dataset (see code below). The seven plastic types are listed below in order of their recycling number:

  1. pet (polyethylene terephthalate)
    • used in soft drink bottles, food containers
  2. hdle (high-density polyethylene)
    • used in plastic milk containers, plastic bags, bottle caps, oil cans
  3. pvc (polyvinyl chloride)
    • used in window frames, chemical bottles, pipes
  4. ldpe (low-density polyethylene)
    • used in plastic/Ziploc bags, squeeze bottles
  5. pp (polypropylene)
    • used in flower pots, bumpers, industrial fibers, microwavable food containers
  6. ps (polystyrene)
    • used in toys, video cassettes, beer cups, Styrofoam
  7. o (other)
    • polycarbonate, polyamide, bioplastics

The plaEcon or new-plastics-economy-global commitment dataset was imported from data.world.com and contains information about various companies and the plastic packaging volume (tons) they produced in the year 2019.

Lastly, the gdp dataset was imported from World Bank and contains the GDP per capita (USD) for all countries between the years 1960 and 2020. However, we will only select the GDP for the year 2019, since the other two datasets are limited to this year.

The last two datasets (as well as the .Rmd and .R files) can be found on my Project 1 GitHub folder. Code to import the datasets is below. Just hit run and everything should be fine!

# uncomment to install these package for project 1
#install.packages(c("tidytuesdayR", "gapminder", "plotly"))

library(tidyverse)
library(ggplot2)
library(plotly)    # used for fancy plots
library(gapminder) # used for fancy plots
library(cluster)   # PAM

## plastics
plastics <- tidytuesdayR::tt_load('2021-01-26')$plastics
## 
##  Downloading file 1 of 1: `plastics.csv`
## Plastic production per volume (tons) for major plastic 
## producing companies in 2019
plaEcon <- read.csv("https://raw.githubusercontent.com/viggy-ravi/sds348_project1/main/data/new-plastics-economy-global%20commitment.csv",
                 sep=",", na.strings="NA", strip.white=T, stringsAsFactors=F)

## Worldbank GDP per Capita 
gdp <- read.csv("https://raw.githubusercontent.com/viggy-ravi/sds348_project1/main/data/worldbank_gdp.csv",
                sep=",", na.strings="NA", strip.white=T, stringsAsFactors=F,
                skip=4)

1.0 Tidying

1.1 Exploring the data

Before we jump right in, we first need to explore the data a bit to understand what information is contained in these datasets, if they are missing any values (NA or otherwise), and how we can best join the three datasets together. Let’s start with the plastics dataset.

# glimpse at plastics
plastics %>% summary()
##    country               year      parent_company         empty         
##  Length:13380       Min.   :2019   Length:13380       Min.   :   0.000  
##  Class :character   1st Qu.:2019   Class :character   1st Qu.:   0.000  
##  Mode  :character   Median :2019   Mode  :character   Median :   0.000  
##                     Mean   :2019                      Mean   :   0.412  
##                     3rd Qu.:2020                      3rd Qu.:   0.000  
##                     Max.   :2020                      Max.   :2208.000  
##                                                       NA's   :3243      
##       hdpe               ldpe                o                  pet          
##  Min.   :   0.000   Min.   :    0.00   Min.   :     0.00   Min.   :    0.00  
##  1st Qu.:   0.000   1st Qu.:    0.00   1st Qu.:     0.00   1st Qu.:    0.00  
##  Median :   0.000   Median :    0.00   Median :     0.00   Median :    0.00  
##  Mean   :   3.046   Mean   :   10.32   Mean   :    49.61   Mean   :   20.94  
##  3rd Qu.:   0.000   3rd Qu.:    0.00   3rd Qu.:     2.00   3rd Qu.:    0.00  
##  Max.   :3728.000   Max.   :11700.00   Max.   :120646.00   Max.   :36226.00  
##  NA's   :1646       NA's   :2077       NA's   :267         NA's   :214       
##        pp                 ps                pvc          grand_total       
##  Min.   :   0.000   Min.   :   0.000   Min.   :  0.00   Min.   :     0.00  
##  1st Qu.:   0.000   1st Qu.:   0.000   1st Qu.:  0.00   1st Qu.:     1.00  
##  Median :   0.000   Median :   0.000   Median :  0.00   Median :     1.00  
##  Mean   :   8.221   Mean   :   1.862   Mean   :  0.35   Mean   :    90.15  
##  3rd Qu.:   0.000   3rd Qu.:   0.000   3rd Qu.:  0.00   3rd Qu.:     6.00  
##  Max.   :6046.000   Max.   :2101.000   Max.   :622.00   Max.   :120646.00  
##  NA's   :1496       NA's   :1972       NA's   :4328     NA's   :14         
##    num_events       volunteers   
##  Min.   :  1.00   Min.   :    1  
##  1st Qu.:  4.00   1st Qu.:  114  
##  Median : 15.00   Median :  400  
##  Mean   : 33.37   Mean   : 1118  
##  3rd Qu.: 42.00   3rd Qu.: 1416  
##  Max.   :145.00   Max.   :31318  
##                   NA's   :107
plastics %>% select(country) %>% unique() %>% count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1    69

From the quick summary, we see that there is a total of 13,328 rows that describe the plastic waste for 69 countries. The plastic is sorted into the first seven recycling codes (in numerical order: pet, hdpe, pvc, ldpe, pp, ps, o) and contains some other information about how the plastic was collected. For the purpose of this report, we are mostly interested in the country, parent_company, and the seven recycling codes. Also, based on the summary, we see that there are a lot of missing values that may need to be taken care of later. (This dataset looks like it needs to be pivoted longer!)

# look at pla_econ
head(plaEcon)
##                     company   company_type plastic_packaging_volume_in_tons
## 1 Colgate-Palmolive Company packaged goods                           287008
## 2               Danone S.A. packaged goods                           750000
## 3                    Diageo packaged goods                            40000
## 4                 Essity AB packaged goods                    not disclosed
## 5                 H&M Group packaged goods                    not disclosed
## 6      Henkel AG & Co. KGaA packaged goods                    not disclosed
plaEcon %>% summary()
##    company          company_type       plastic_packaging_volume_in_tons
##  Length:191         Length:191         Length:191                      
##  Class :character   Class :character   Class :character                
##  Mode  :character   Mode  :character   Mode  :character

The plaEcon dataset is simpler and only contains information about each major company, the amount of plastic packaging it produced in the year 2019, and what the plastic was used for. Note that some of the plastic packaging is not disclosed in this dataset. We might want to change these to NA values so its easier to manipulate later on.

# glimpse at gdp
dim(gdp)
## [1] 264  66
# columns
names(gdp)
##  [1] "Country.Name"   "Country.Code"   "Indicator.Name" "Indicator.Code"
##  [5] "X1960"          "X1961"          "X1962"          "X1963"         
##  [9] "X1964"          "X1965"          "X1966"          "X1967"         
## [13] "X1968"          "X1969"          "X1970"          "X1971"         
## [17] "X1972"          "X1973"          "X1974"          "X1975"         
## [21] "X1976"          "X1977"          "X1978"          "X1979"         
## [25] "X1980"          "X1981"          "X1982"          "X1983"         
## [29] "X1984"          "X1985"          "X1986"          "X1987"         
## [33] "X1988"          "X1989"          "X1990"          "X1991"         
## [37] "X1992"          "X1993"          "X1994"          "X1995"         
## [41] "X1996"          "X1997"          "X1998"          "X1999"         
## [45] "X2000"          "X2001"          "X2002"          "X2003"         
## [49] "X2004"          "X2005"          "X2006"          "X2007"         
## [53] "X2008"          "X2009"          "X2010"          "X2011"         
## [57] "X2012"          "X2013"          "X2014"          "X2015"         
## [61] "X2016"          "X2017"          "X2018"          "X2019"         
## [65] "X2020"          "X"

Lastly, the gdp dataset contains information about the GDP for 264 countries from the year 1960 to 2020. We are only interested in the year 2019.

1.2 Data cleaning and pivoting

Great! Now we can start reducing out datasets to contain only the necessary columns. I will go ahead and rename some of the column names so they’re easier to call later on. I will also pivot longer the plastics dataset to contain the recycling code (recyc_codes) and its respective count for each (country, parent_company) grouping. I will also convert the not disclosed rows to NA values in the plaEcon dataset and select only the X2019 column in the gdp dataset.

# tidy plastics
plastics <- plastics %>% 
  select(-year, -empty, -num_events, -volunteers, -grand_total) %>%
  # the dataset has additional rows for Grand Total that we can filter out
  filter(parent_company != 'Grand Total')

tidy_plastic <- plastics %>% 
  pivot_longer(3:9, names_to="recyc_codes", values_to="count")

glimpse(tidy_plastic)
## Rows: 93,296
## Columns: 4
## $ country        <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Ar~
## $ parent_company <chr> "Unbranded", "Unbranded", "Unbranded", "Unbranded", "Un~
## $ recyc_codes    <chr> "hdpe", "ldpe", "o", "pet", "pp", "ps", "pvc", "hdpe", ~
## $ count          <dbl> 155, 50, 532, 848, 122, 114, 17, 0, 0, 0, 222, 35, 0, 0~
## looks tidy!
glimpse(tidy_plastic)
## Rows: 93,296
## Columns: 4
## $ country        <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Ar~
## $ parent_company <chr> "Unbranded", "Unbranded", "Unbranded", "Unbranded", "Un~
## $ recyc_codes    <chr> "hdpe", "ldpe", "o", "pet", "pp", "ps", "pvc", "hdpe", ~
## $ count          <dbl> 155, 50, 532, 848, 122, 114, 17, 0, 0, 0, 222, 35, 0, 0~
# tidy pla_econ
tidy_plaEcon <- plaEcon %>%
  rename(ppv = plastic_packaging_volume_in_tons) %>%
  mutate(ppv = na_if(ppv, "not disclosed"))

# tidy gdp
tidy_gdp <- gdp %>% 
  select(Country.Name, X2019) %>%
  rename(gdp2019 = X2019)

1.3 Spelling corrections

It looks like a few country names are different between the plastics and gdp datasets. Ideally, we could have converted the country names to ISO country codes, but since there are only a few differences, I will do this manually. (I compared the differences by directly looking at the World Bank website.)

# find differing countries
anti_join(tidy_plastic, tidy_gdp, by=c("country"="Country.Name")) %>% 
  select(country) %>% distinct()
## # A tibble: 9 x 1
##   country                                           
##   <chr>                                             
## 1 Cote D_ivoire                                     
## 2 ECUADOR                                           
## 3 EMPTY                                             
## 4 Hong Kong                                         
## 5 NIGERIA                                           
## 6 Taiwan_ Republic of China (ROC)                   
## 7 United States of America                          
## 8 Korea                                             
## 9 United Kingdom of Great Britain & Northern Ireland
# This is what I need to change

# Cote D_ivoire                   -> Cote d'Ivoire (not in first join) (ignore)
# ECUADOR                         -> Ecuador
# Hong Kong                       -> Hong Kong SAR, China
# NIGERIA                         -> Nigeria
# Taiwan_ Republic of China       -> not in World Bank (ignore)
# United States of America        -> United States
# Korea                           -> Korea, Rep.
# United Kingdom of Great Britain -> United Kingdom

tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "ECUADOR", "Ecuador")
tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "Hong Kong", "Hong Kong SAR, China")
tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "NIGERIA", "Nigeria")
tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "United States of America", "United States")
tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "Korea", "Korea, Rep.")
tidy_plastic$country <- str_replace_all(tidy_plastic$country, 
                                        "United Kingdom of Great Britain & Northern Ireland", 
                                        "United Kingdom")

tidy_plastic %>% filter(country %in% c("Ecuador",
                                      "Hong Kong SAR, China",
                                      "Nigeria",
                                      "United States",
                                      "Korea, Rep.",
                                      "United Kingdom")) %>%
  select(country) %>% distinct()
## # A tibble: 6 x 1
##   country             
##   <chr>               
## 1 Ecuador             
## 2 Hong Kong SAR, China
## 3 Nigeria             
## 4 United Kingdom      
## 5 United States       
## 6 Korea, Rep.

Wonderful! We can start confidently joining now.

2.0 Joining/Merging

Overall, it looks like the plastics and plaEcon datasets can be joined along the parent_company/company columns, and the plastics and gdp datasets can be joined along the country columns.

# how many unique companies are there
tidy_plastic %>% select(parent_company) %>% unique() %>% count() %>% pull()
## [1] 10822
tidy_plaEcon %>% select(company) %>% unique() %>% count() %>% pull()
## [1] 182
# inner join on plastics and plaEcon
temp <- inner_join(tidy_plastic, tidy_plaEcon, by=c("parent_company"="company")) %>% 
  mutate(count = replace_na(count, 0))

# how many mutual companies
temp %>% select(parent_company) %>% unique() %>% count() %>% pull()
## [1] 14
# how many unique countries are there
temp %>% select(country) %>% unique() %>% count() %>% pull()
## [1] 59
tidy_gdp %>% select(Country.Name) %>% unique() %>% count() %>% pull()
## [1] 264
# inner join on plastics/plaEcon and gdp
df <- inner_join(temp, tidy_gdp, by=c("country"="Country.Name")) %>%
  mutate(ppv = as.numeric(ppv)) %>%
  drop_na() %>%
  distinct()

# how many mutual countries
df %>% select(country) %>% unique() %>% count() %>% pull()
## [1] 58

Since all datasets contain some valuable information/columns, an inner join is necessary. From the first inner join, we saw that plastics contains a lot more companies (10,822) than the plaEcon dataset (182). So, we will be losing a lot of information from the plastic dataset. This may impact the accuracy of future correlations. After the join, we see that only 14 companies lasted with 1925 data points. Next, from the second inner join, we see that temp contains 59 countries (dropped by 10) and gdp contains 264 countries. The final data frame contains 58 countries, as expected (the EMPTY countries were dropped). I have dropped the rows where ppv has NA values, since we will be needing this column for future analysis. There is also no easy way to estimate these values since they are specific to the company.

3.0 Wrangling

Now for the interesting part! There are a few chunks below, so I will list the main ideas of them here, and describe the results at the end of this section.

  • Summary of df data frame
  • Convert (mutate) recycling codes to their corresponding numbers (1-7)
  • Countries with the most/least diversity of plastics (by number of companies) - we will use some of this data in a future plot
  • Count number of plastics per recycling code
  • Top 5 companies that produce the most plastic packaging per volume
  • Plastic count in higher/lower gdp rated countries (median gdp)

3.1 Summary of data frame

## summary 
df %>% summary()
##    country          parent_company     recyc_codes            count        
##  Length:1234        Length:1234        Length:1234        Min.   :   0.00  
##  Class :character   Class :character   Class :character   1st Qu.:   0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :   0.00  
##                                                           Mean   :  26.95  
##                                                           3rd Qu.:   2.00  
##                                                           Max.   :3268.00  
##  company_type            ppv             gdp2019       
##  Length:1234        Min.   :  40000   Min.   :  679.3  
##  Class :character   1st Qu.: 129000   1st Qu.: 3659.0  
##  Mode  :character   Median : 610000   Median : 9912.3  
##                     Mean   :1413248   Mean   :21529.4  
##                     3rd Qu.:3000000   3rd Qu.:33228.2  
##                     Max.   :3000000   Max.   :81993.7

3.2 Mutate

## mutate recycling code to number
df <- df %>%
  mutate(recyc_nums = case_when(
    recyc_codes == "pet"  ~ 1,
    recyc_codes == "hdpe" ~ 2,
    recyc_codes == "pvc"  ~ 3,
    recyc_codes == "ldpe" ~ 4,
    recyc_codes == "pp"   ~ 5,
    recyc_codes == "ps"   ~ 6,
    recyc_codes == "o"    ~ 7
  )) %>%
  arrange(country, parent_company, recyc_nums)
glimpse(df)
## Rows: 1,234
## Columns: 8
## $ country        <chr> "Argentina", "Argentina", "Argentina", "Argentina", "Ar~
## $ parent_company <chr> "Carrefour", "Carrefour", "Carrefour", "Carrefour", "Ca~
## $ recyc_codes    <chr> "pet", "hdpe", "pvc", "ldpe", "pp", "ps", "o", "pet", "~
## $ count          <dbl> 0, 0, 0, 2, 2, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 1, 0, 1~
## $ company_type   <chr> "retail & hospitality", "retail & hospitality", "retail~
## $ ppv            <dbl> 57000, 57000, 57000, 57000, 57000, 57000, 57000, 129000~
## $ gdp2019        <dbl> 9912.282, 9912.282, 9912.282, 9912.282, 9912.282, 9912.~
## $ recyc_nums     <dbl> 1, 2, 3, 4, 5, 6, 7, 1, 2, 2, 3, 4, 5, 5, 6, 7, 7, 1, 2~

3.3 Plastic diversity

# find countries with most diverse plastics
df_max <- df %>% 
  select(country, parent_company) %>%
  group_by(country) %>%
  summarise(num_companies = n()) %>%
  slice_max(num_companies, n=5)
df_max
## # A tibble: 5 x 2
##   country       num_companies
##   <chr>                 <int>
## 1 United States            49
## 2 Argentina                46
## 3 Switzerland              42
## 4 Vietnam                  38
## 5 India                    36
# find countries with least diverse plastics
df_min <- df %>% 
  select(country, parent_company) %>%
  group_by(country) %>%
  summarise(num_companies = n()) %>%
  slice_min(num_companies, n=5)
df_min
## # A tibble: 8 x 2
##   country     num_companies
##   <chr>               <int>
## 1 Benin                   7
## 2 Cyprus                  7
## 3 El Salvador             7
## 4 Romania                 7
## 5 Rwanda                  7
## 6 Serbia                  7
## 7 Singapore               7
## 8 Togo                    7
# save the country names for a future plot
top_countries <- df_max %>% pull(country)
bot_countries <- df_min %>% pull(country)

3.4 Plastic count

# group_by (recycling code), summarize (count)
df %>% 
  group_by(recyc_codes) %>%
  summarise(total_count = sum(count), number = mean(recyc_nums)) %>%
  arrange(desc(total_count))
## # A tibble: 7 x 3
##   recyc_codes total_count number
##   <chr>             <dbl>  <dbl>
## 1 pet               20186      1
## 2 o                  4867      7
## 3 pp                 4506      5
## 4 hdpe               1858      2
## 5 ldpe               1704      4
## 6 pvc                  76      3
## 7 ps                   54      6

3.5 Plastic producers

# top 5 companies (ppv)
df %>% 
  select(parent_company, ppv) %>%
  distinct() %>%
  top_n(5) %>%
  arrange(desc(ppv))
## # A tibble: 5 x 2
##   parent_company            ppv
##   <chr>                   <dbl>
## 1 The Coca-Cola Company 3000000
## 2 Tetra Pak              721000
## 3 Unilever               610000
## 4 Mars, Incorporated     129000
## 5 SC Johnson              90000

3.6 GDP analysis of plastic waste

# filter recyc_codes by median gdp
med_gdp <- df %>% summarise(median(gdp2019)) %>% pull()
med_gdp
## [1] 9912.282
df_top_med <- df %>%
  filter(gdp2019 > med_gdp) %>%
  group_by(recyc_codes) %>%
  summarise(total_count = sum(count), number = mean(recyc_nums)) %>%
  arrange(desc(total_count))
df_top_med
## # A tibble: 7 x 3
##   recyc_codes total_count number
##   <chr>             <dbl>  <dbl>
## 1 pet                2032      1
## 2 o                  1296      7
## 3 pp                  348      5
## 4 ldpe                142      4
## 5 hdpe                131      2
## 6 ps                    7      6
## 7 pvc                   7      3
df_top_med %>% summarise(upper_med_plastic_count = sum(total_count)) %>% pull()
## [1] 3963
df_low_med <- df %>%
  filter(gdp2019 < med_gdp) %>%
  group_by(recyc_codes) %>%
  summarise(total_count = sum(count), number = mean(recyc_nums)) %>%
  arrange(desc(total_count))
df_low_med
## # A tibble: 7 x 3
##   recyc_codes total_count number
##   <chr>             <dbl>  <dbl>
## 1 pet               17895      1
## 2 pp                 4106      5
## 3 o                  3547      7
## 4 hdpe               1716      2
## 5 ldpe               1559      4
## 6 pvc                  69      3
## 7 ps                   47      6
df_low_med %>% summarise(lower_med_plastic_count = sum(total_count)) %>% pull()
## [1] 28939

The data frame summary indicates that there are 1234 data points (left) and all NA's seem to have filtered out through the joins. The most interesting columns seem to be the ppv and gdp2019 purely based on their interquartile range vales. First, I will convert the recycling codes to their corresponding number codes (1-7). Specifically, pet=1, hdpe=2, pvc=3, ldpe=4, pp=5, ps=6, o=7. This will help arrange/sort any future analysis if necessary. From section 3.4 it looks like pet is the most common plastic found in waste.

Section 3.3 shows the plastic diversity of various countries. I measured this by counting the number of distinct parent companies for each country (the parent company refers the the company that produced the plastic that was later found in waste). I was expecting countries with higher GDP in general to have more diverse plastics, while countries with lower GDP to have less diverse plastics. Based on the results, it looks like the United States, Argentina, Switzerland, Vietnam, and India had the most diversity of plastics, while Benin, Cyprus, El Salvador, Romania, Serbia, Singapore, and Togo had the least diversity of plastic. That was an interesting result, as there is a mix of high GDP and low GDP countries in both groups. For example, India has a GDP of ~2100 USD (high diversity) while Singapore has a GDP of 65,000 USD (low diversity). It looks like both GDP and other external factors (like lifestyle and business presence) might play a factor in the diversity of plastics in countries.

I wanted to delve deeper into how GDP affects plastic waste collected. Section 3.6 explores the amount of plastic produced by countries with GDP higher and lower than the median GDP ~9,900 USD. From previous studies, its clear that countries with lower GDP typically produce more plastic waste since they have less infrastructure to recycle their waste. This was reflected in the results, as higher GDP countries produced ~4,000 pieces of plastic waste, while lower GDP countries produced ~29,000 pieces of plastic waste!

4.0 Visualization

4.1 Correlation plot

Based on the analysis done in section 3.0, it seemed clear that there is a correlation between GDP and plastic waste. However, the correlation plot below doesn’t seem to reflect that. It looks like the 2019 GDP does not seem to correlate with any individual plastic type (or the total plastic count). This might be due to limited data, both in the plastic dataset and in the gdp dataset. In general, GDP seems to have a slight negative correlation with each plastic (similarly there was a higher negative correlation between GDP and total plastic count). Interestingly, some plastics seem to correlate with each other. Specifically, ps and ‘pet’ have a reasonable structures with hdpe and ldpe.

To produce this correlation plot, I joined the original plastics dataset with the gdp dataset to see how GDP correlates with each type of plastic.

df2 <- inner_join(plastics, tidy_gdp, by=c("country"="Country.Name"))

cormat2 <- df2 %>% select(is.numeric) %>% cor(use="pair")

tidycor2 <- cormat2 %>% as.data.frame %>% rownames_to_column("var1") %>%
  pivot_longer(-1, names_to="var2", values_to="correlation")

tidycor2 %>% ggplot(aes(var1, var2, fill=correlation)) + 
  geom_tile() +
  scale_fill_gradient2(low="red", mid="white", high="blue") +
  geom_text(aes(label=round(correlation,2)), color="black", size=4) +
  theme(axis.text.x = element_text(angle=90, hjust=1)) + 
  coord_fixed()

4.2 Interactive plot

The plot below is one of my most favorite plots. The plot below shows the distribution of plastic found in each country by plastic type. The data points are different sizes based on the GDP of the country (larger circle = higher GDP). This is an interactive map, where you can click on the circles in the legend (right) to remove/add countries from the plot. You can also double-click on a circle in the legend to isolate the country on the plot. Moreover, you can hover over points on the plot to get more information about the data. I limited the countries to the most and least diverse countries (from section 3.3) since plotting all 58 countries was a bit too busy. However, the main idea still stands: in general, countries with larger GDP (larger circles) produce less plastic waste, while countries with lower GDP (smaller circles) produce more plastic waste. Feel free to remove the larger circles from the plot to see all the countries.

# plot 1: count vs gdp by country

## concatenate top and bottom countries (diversity)
## top: US, Argentina, Switzerland, Vietnam, India
## bot: Benin, Cyprus, El Salvador, Romania, Rowanda, Serbia, Singapore, Togo
filter_countries <- c(top_countries, bot_countries)

## filter out these countries only
df_filter <- df %>% filter(country %in% filter_countries)

## plot
p1 <- ggplot(df_filter, aes(recyc_codes, count, size=gdp2019, color=country)) + 
  geom_point() +
  labs(title="Amount of Plastic by Country and GDP (2019)",
       x="Recycling Codes", y="Plastic Count")
ggplotly(p1)

4.3 Plastic type vs company type

The last plot shows the distribution of plastic types for each company type (packaging goods, packaging producers, and retail & hospitality). Packaging goods has by far the most total plastic, with the most plastic type being pet. Packaging producers seems to mostly contain only o (other) types of plastics. Lastly, there is limited data for the retail & hospitality.

## plot 2: company by ppv
company_type_stats <- df %>%
  select(company_type, recyc_codes, count) %>%
  group_by(company_type, recyc_codes) %>%
  summarise(mean_plastic = floor(mean(count)),
            sd_plastic = sd(count),
            n=n(), .groups = 'drop',
            se_plastic = sd_plastic/sqrt(n)) %>%
  mutate(sd_plastic = replace_na(sd_plastic, 0)) %>%
  mutate(se_plastic = replace_na(se_plastic, 0))

ggplot(company_type_stats, aes(recyc_codes, mean_plastic)) +
  facet_wrap(~company_type) + 
  geom_bar(stat='summary', fun=mean) +
  geom_errorbar(aes(y=mean_plastic,
                 ymin=mean_plastic-se_plastic,
                 ymax=mean_plastic+se_plastic)) +
  labs(title="Average plastic type per company type",
       x="Recycling Codes", y="Global Average Plastic Count")

5.0 PAM clustering

# step 1: process data
df_gower <- df %>%
  select(-recyc_nums) %>%
  group_by(country) %>%
  summarise(count = sum(count), gdp2019) %>%
  distinct() %>%
  ungroup() %>%
  mutate_if(is.character, as.factor)
# step 2: choose k value
gower <- daisy(df_gower, metric="gower")

sil_width <- vector()
for (i in 2:10){
  pam_fit <- pam(gower, diss=T, k=i)
  sil_width[i] <- pam_fit$silinfo$avg.width
}

ggplot() + 
  geom_line(aes(x=1:10, y=sil_width)) + 
  scale_x_continuous(name="k", breaks=1:10)

# step 4: run cluster analysis (PAM)
pam3 <- pam(gower, diss=T, k=3)

df_gower %>% mutate(cluster=pam3$clustering)%>%
  group_by(cluster) %>% 
  summarise_if(is.numeric, .funs=list("mean"=mean, 
                                      "median"=median,
                                      "sd"=sd), na.rm=T) %>%
  pivot_longer(contains("_")) %>%
  separate(name, sep="_", into=c("variable", "stat")) %>%
  pivot_wider(names_from="variable", values_from="value") %>%
  arrange(stat)
## # A tibble: 9 x 4
##   cluster stat   count gdp2019
##     <int> <chr>  <dbl>   <dbl>
## 1       1 mean    345.   7549.
## 2       2 mean    139.  47561.
## 3       3 mean   6061.   2510.
## 4       1 median  210.   6703.
## 5       2 median   46   46195.
## 6       3 median 6949    2230.
## 7       1 sd      439.   5915.
## 8       2 sd      191.  16709.
## 9       3 sd     2216.    869.
# step 5: visualize cluster
plot(pam3, which=2)

For the clustering analysis, I looked at the variables country, total plastic count, and GDP. From the silhouette width plot, we see that k=3 clusters is the best option for the given data. With further analysis, we see that group 1 has a mean plastic count of 345 and GDP of ~7,550 USD, group 2 has a mean plastic count of 139 and GDP of ~47,560 USD, and group 3 has a mean plastic count of 6,061 and GDP of 2,510 USD. Moreover, group 2 is the high GDP/low plastic producing countries, group 1 is the medium GDP/medium plastic producing countries, and group 3 is the low GDP/high plastic producing countries.



comments powered by Disqus