This research project was conducted under my supervision at Rensselaer Polytechnic Institute (RPI), My students worked on this project under my guidlines and mentoring is: Karthik Dusi. This research project was a part of the Data Analytics course that I taught during Fall 2019 at Rensselaer Polytechnic Institute.
Outcome of this project was published as a conference paper in AAAI Symposium on AI for Social Good in August, 2020.
Access to water is one of the fundamental human rights. Clean water is an issue plaguing many countries worldwide and is one of the world’s largest health concerns. The poor are those who suffer significantly from access to improved water sources and often contract other infectious diseases from unsafe water. This paper examines how the AI community can further research into clean water data that is available and investigate the socio-economic factors that prevent some communities from gaining access to safe water sources. Preliminary and Exploratory Data Analysis were done on the UN data to understand the patterns, relations, and trends between related variables. Key correlations were investigated between different socioeconomic factors such as GDP, Corruption, and Infrastructure to understand what has the greatest effect on access to improved water sources. To do so, visualizations were built using Python and the Seaborn package, as well as using the Pandas package to curate the data.
Over 1.1 billion people in the world lack access to a general water source according to the World Wildlife Organization, 2020 report. According to Worldwildlife.org, 2.7 billion people suffer from water scarcity at least one month a year, and 2.4 billion people are victims to clean water inadequately. These numbers have been on the rise and continue to be as scientists predict that by 2025, over two-thirds of the world’s population will face water shortage issues (World Wildlife Organization 2020).
The UN provides datasets that they have collected as well as datasets from related organizations like the WHO at UN Data Repository . Other sites like Our World in Data also has relevant data towards understanding the problems behind lack of access to clean water. To understand basic correlations and present ideas for the reader, a dataset with information on populations using improved water sources was explored(World Health Organization 2014). This dataset has a percentage of a population using improved water source for 192 countries, and further divides the percentages into whether they live in rural or urban areas. The rural areas are defined as areas not part of major metropolitan areas, which are defined by population density and distance from the metropolitan city, and the rural data reflects data collected on those areas. Similarly, the urban data reflects data collected in areas part of major metropolitan areas. There is also historical data ranging from 1990 up to 2012 for these countries, giving ample data to explore and analyze.
The project workflow as shown in Figure 1, project starts with obtaining the data from the sources, cleaning the data, and storing the data on a local repository.
EDA was done on the collected dataset to understand correlations between GDP per capita on percentage of total population’s access to improved water source, as well as understanding the correlations between GDP and percentage of urban populations and percentage of rural populations’ access to the improved water sources.
Combining this initial dataset with other indicators provided by the World Bank (The World Bank 2018, 2020) resulted in variables measuring Government Effectiveness, Overall Infrastructure, and the Corruption Perception Index. The initial dataset ranged from 1990 to 2012, but the World Bank dataset had data from 1995 to 2012. For preliminary purposes, the following analyses were done on data collected on the year 2012. Looking at only 124 countries in 2012, the following heatmap in Figure 6 to investigate correlations was generated.
we can see a strong blue color means a higher correlation between the two variables. We see a medium to a strong correlation between the corruption perception index (Corruption Perceptions Index 2020) and percent of the rural population with access. This can be perceived as certain rural populations not having access to improved water sources because of a higher corruption perception index. We can also observe that there is a strong relationship between government effectiveness and percentage value of rural population that has access to improved water sources, which makes sense given that more effective governments are able to provide water sources to all parts of the country. There is a medium correlation between infrastructure rating and percentage of the total population with access to improved water sources. This could be because this infrastructure rating considers all infrastructure in the country, and it may be more prudent just to observe water-related infrastructure, like drainage basins, sewers, reservoirs, etc.
The AI community can help leverage this data and turn it into a usable tool for governments and relief organizations by helping them predict where resources must be allocated first to enhance access to improved water sources. Using it as a model to predict where clean water sources will deplete given trends in GDP, infrastructure, and other socioeconomic factors would be very useful as several scholars assert that by 2025, two-thirds of the world’s population will face water shortage.(World Wildlife Organization 2020).
Additionally, models can be used to investigate where water quality is low. With machines and water filters that continuously check whether the water is safe to drink or not, a data collection feature could be added and could provide data for data scientists to use in narrowing down where the water contamination is happening. Prototypes of devices that can detect whether water quality is low and can report the data to a database exist, and could be used for this application. By using machine learning techniques and neural networks, this existing data coupled with other socio-economic datasets can be used for the further analysis and develop prediction models. Lack of clean water leads to many infectious diseases, such as deadly diarrheal diseases, cholera, and typhoid, and by using a model to see where there is no clean water available, medical professionals can help try to prevent the spread of infectious diseases in those areas utilizing those models. Stakeholders for this type of application would be public policy experts, healthcare professionals, and infrastructure professionals who could help provide data and insights regarding what sort of socioeconomic factors are most prevalent in prohibiting access to clean water.
This paper presents a preliminary understanding of what could be done to collect and explore the data to help solve access to improved water sources. Looking at correlations between key indicators and populations with access to water sources provides a basic understanding of what features to use in future models. Additionally, looking at rural populations over urban populations may be more productive since urban populations tend to be well developed and have good water sources. We plan to use the insights gained from this initial analysis to test out different hypotheses and research questions in the future.