Coursework designed to help understand further concepts and demonstrate practical skills related to the Hadoop environment.


Coursework tasks and Problem statement

1. Identify and evaluate a number of publicly available dataset
s related to air pollution and severity of respiratory disease. These may be from sources such as kaggle.com or data.gov.uk
2. Select appropriate datasets, as informed by their interests
3. Integrate and import these datasets into a suitable data storage and processing system, providing rationale for their choice
4. Perform meaningful analysis of the data to derive some simple useful information, as can be obtained by the dataset selected.
5. Provide visualisation of the analysis through any Hadoop-related technologies which the students deem suitable.
"If you need the complete report and codes, please leave a comment on this blog. Few screenshots, steps, and codes are not included assuming that you are able to figure them out. If you are having difficulty getting the output, then please leave a comment and I will help you with that."

Sample Coursework for the above task

Identifying and evaluating datasets to air pollution and severity of respiratory disease using Hadoop.

Abstract: In order to determine whether air pollution has any impact on respiratory disorders, I will analyze various datasets on the subject in this assignment. As is well known, there will be a lot of data in this situation, making it a huge data issue. In order to extract any relevant information from the data we gathered, I will use the Hadoop ecosystem. We can discover some important insights by utilizing the Hadoop distributed file system and visualization tools. Most of the complexity involved in distributed processing and storage is abstracted and managed by Hadoop. The datasets are saved in HDFS, and the filtered and shuffled information is visualized using Apache Zeppelin and Apache Pig. By analyzing historical data, this assignment will determine how respiratory diseases are related to air pollution in every nation. It will also explain why I chose HDFS, pig, and zeppelin, Yarn, etc over other Hadoop ecosystem technologies.

  1. Introduction 
The recorded data were scarce before the world become digital, and the majority of data were documents like tables. With a single system, we can simply finish any task involving such files, so storing and analyzing data was not a problem. The amount of data captured rose quickly after the development of the internet and once the globe went digital, and that data also came in a variety of formats. Given the size of the data set, its analysis has grown to be a significant undertaking. Every microsecond, data is produced, and as a result, the problem of big data has emerged. Using a single storage device and processor to manage massive data is challenging. Hadoop framework is what we have to address these problems. The Hadoop distributed file system is the first of three elements that are made to cope with big data. Hadoop stores huge data on distributed nodes and clusters. The second element is MapReduce, which divides enormous amounts of diverse data into smaller pieces for examination before averaging the results to produce final data. The resource manager for a series of tasks in a Hadoop cluster is called Yarn. Hadoop also includes a number of components, including hive, pig, spark, flume, and Sqoop. In order to tackle our big data problem about the correlation between air pollution and respiratory disorders, we will leverage some of these Hadoop components. 

    2. Discussion of the problem and justification of the dataset

Finding a link between air pollution and respiratory ailments in various nations from 2011 to 2017 is the stated problem. According to our personal experience, we can generally affirm that there is a relationship. Anytime we are in an environment with poor air quality, we might experience some suffocation or discomfort, but it is just a personal experience, and we need to consider whether this is the case everywhere. For instance, during the COVID-19 pandemic, the majority of persons who contracted Corona viruses had a history of respiratory disorders such as asthma, COPD, and lung cancer. Therefore, it is necessary to determine the causes of respiratory disorders, whether there are any differences in the number of patients between nations and, if so, what are the most prevalent practices in those countries, etc. Since we are doing an analysis based on air pollution statistics, we must determine the air quality in countries with more and less respiratory disease sufferers. If there is a correlation, we must investigate the causes of air pollution in these nations. For example, countries with the most factories will produce the most hazardous air pollutants, while countries that use solid fuels for cooking, such as wood, will also produce air pollutants, but they may not be as dangerous. Also, we must determine if there is a connection between air pollution and particular respiratory disorders. For instance, all respiratory disorders may or may not be caused by air pollution; some can also be caused by lifestyle choices; for instance, cigarette smokers are more likely to cause lung cancer than air pollution. I have downloaded the "cause of deaths_" and "PM2.5 Global Air Pollution 2010-2017" datasets from Kaggle. I chose these datasets because they contain data on the fatality rate from air pollution from 2011 to 2017 across several nations. After analyzing these two datasets, we will be able to comprehend how air pollution causes respiratory illnesses, which can be fatal.

    3.  The documentation of the technical solution 

Since we are dealing with large amounts of data, we must use Hadoop to analyze the data that has already chosen. Apache hive, Hadoop MapReduce, and Apache pig, which are all accessible in the Hadoop environment, can be used to execute this task. In our case, I use Apache hive due to its extensive querying flexibility. I favor querying over scripting. Let me explain why I prefer Apache hive over Apache pig and MapReduce

Apache Pig

Apache Hive

Hadoop MapReduce

Language

Scripting

Query

Compiled

Level of abstraction

High

High

low

Lines of codes

Less lines of codes

Less lines of codes

More lines of codes

Code efficiency

Less efficiency

Less efficiency

High efficiency


Here we can see that Hive and Pig have similar characteristics, but since I favour querying over Pig-like scripting languages, I will use Apache Hive. Apache Hive, an own doing, distributed data warehouse, makes it possible to conduct large-scale analytics. A data warehouse provides the rapid examination of data, which facilitates data-driven decision making. SQL enables Hive users to read, write, and manipulate petabytes of data. Hive is developed on top of Apache Hadoop, a platform for the storage and analysis of open-source data. Hive is well optimized for managing massive amounts of data and has tight Hadoop integration. Hive is distinguished by its SQL-like interface, which may be combined with Apache Tez or MapReduce to search through massive databases.

The diagram depicts the operation of a Apache Hive. The hive drivers can support applications written in any language, including Python, Java, and others. Clients like as JDBC, ODBC, and others connect applications to the hive server. Hive services such as HiveServer2 enable clients to execute queries against the hive. In addition, hive services offer other services such as compiler, driver, and metastore. Apache Hive, like Apache Pig, utilizes the MapReduce framework as the default engine for query execution, as MapReduce is capable of processing massive volumes of data. In addition, yarn serves as a resource manager. Given that Hive is part of the Hadoop ecosystem, HDFS is used for distributed storage. Both the local and MapReduce modes of Hive are functional. It is always preferable to use a query language like hive since it uses a MapReduce engine, which enables simple queries to analyze massive volumes of data. This is because we are working with two distinct datasets and need to run several tasks over them. Because it is so similar to SQL queries and because we can use the jdfc interpreter when employing Zeppelin's visualization features, Apache Hive is the technology I've chosen to analysis the selected datasets.

Making these datasets accessible in the Hadoop distributed file system (HDFS) is the next step after selecting our datasets and the appropriate analysis technology that is, Apache Hive

Scap-i

I established the directory /user/hadoop in HDFS with open access permissions and am using PuTTy Client to run the "wget" command to download the data set from Kaggle (Scap-i). At this stage, we need to copy our datasets from the local directory to the newly created HDFS directory (Scap-ii). Further made two directories under /user/hadoop called "airpollution" and "resp_des_death"(Scap-ii).

 Scap-ii

 Scap-iii

The cause of death dataset has 208114 rows, while the pollution dataset has 19280 rows. However, we need to organize these data’s and combine the necessary data in another table. We need to remove a few rows and columns from the present dataset and then combine the results into one single dataset because we are only considering air pollution and respiratory disease for the years 2011 to 2017. Before uploading to the HDFS, I initially made some adjustments to the datasets from the local directory. I uploaded the cause_of_deaths.csv and AirPollution.csv files to HDFS folders made under /user/hadoop using the Ambari server's file view .

Then, using the default data base, I created two tables with the names "airpollution" and "cause_of_deaths" by uploading the corresponding "AirPollution.csv" and "cause_of_deaths.csv" files. Now let’s see the statistics of both tables . We can see that raw data size for ‘airpollution’ is 59527 and ‘cause_of_deaths’ is 18001861. We can now use Apache Hive to run a few queries to further organize the data. 

Select * Lower_Respiratory_Infections, Chronic_Respiratory_Diseases from cause_of_deaths where  Year>2010 and Year<2018

The cause_of_deaths table has a large number of death causes; however, we are only choosing the information pertaining to respiratory diseases. The output table will be saved into the directory with name'respiratory_desc.csv' after extracting the data exclusively linked with respiratory disorders such Lower Respiratory Infections and Chronic Respiratory Diseases for the period from 2011 to 2017 in the different nations. 

The same way I will reduce the ‘AirPollution.csv’ file with required data and it will save to the airpollution directory with the name ‘AirPollution_redu.csv’. After deleting duplicate and irrelevant data from the file to be created, we must merge the two newly created tables into one new table in order to do analysis activities. I combined the tables into a single one and gave it the name "merged_data," along with the filename "merged_data.csv" and directory "merged _file."

    4. Result of the analysis and insights 

Next, I verified the correlation between Lower Respiratory Infections and air quality index for the evaluation, and I named the resulting column corr_iri_aqi. I also checked the correlation between Chronic Respiratory Diseases and air quality index, and I named the resulting column corr_crd_aqi. 

Select country,corr(lower_respiratory_infections,air_quality_in dex) corr_LRI_AQI from merged_data group by country 

And we got the following result with column name corr_iri_aqi and screenshot shows only first few countries with correlation.


Like the same way I have the checked correlation between chronic respiratory diseases and air quality index. 

select country,corr(chronic_respiratory_diseases,air_quality_in dex) corr_CRD_AQI from merged_data group by country 

And we got the result with column name corr_crd_aqi. The below screenshot shows only first few countries. 

Zeppelin is a fantastic tool for visualization in the Hadoop environment, as we are all aware. In order to perform hive queries on Zeppelin and produce a visualization, I used the jdbc(hive) interpreter. Let's check how Zeppelin renders the visualization after I combined the correlation query for chronic respiratory disease and lower respiratory infections into a single query against the air quality index. I won't go into great detail regarding the zeppelin visualization here because I have already demonstrated it in a screen-captured video.

%jdbc(hive) Select country,air_quality_index, 
corr(lower_respiratory_infections,air_quality_index)
corr_LRI_AQI, 
corr(chronic_respiratory_diseases,air_quality_index) 
corr_CRD_AQI from merged_data group by 
country,air_quality_index

We can see from this visualization that air quality indexes and respiratory disorders have a strong correlation. When the air quality index is high, air pollution is more likely to occur and there is higher chance for respiratory diseases. 

Let's now look at the Tez view, which displays all hive queries together with their performance data, to see how we might use it to locate specific hive queries.

Now let's look at which country has the highest and lowest air quality indices between 2011 and 2017. We can now see that, between 2011 and 2017, Finland had the lowest air quality indices and Nepal had the highest. 
Let's compare respiratory illnesses and air quality indexes for Nepal and Finland exclusively for this time period. We can see from this visualization that Nepal is more susceptible to respiratory illnesses than Finland, and the explanation is that Nepal's air quality index is greater than Finland's. 

We will now check for respiratory illnesses in nations with air quality indices exceeding 60. We can observe that the majority of Asian nations, including China, India, and others, have air quality indices above 60 and more respiratory ailments. To safeguard the ecosystem and the wellbeing of all living creatures, air pollution must be reduced right away. 

Also, when we are check the countries which has air quality index less than 10 then we got the result, After evaluating the two outcomes, we can conclude that countries with a larger population also have respiratory diseases and an air quality index that are both greater. 

In order to determine whether population affects air pollution, let's look at the Indian nation and examine the air quality index from 2011 to 2017. We can see that India consistently has an air quality index average of 90, which is exceedingly poor for the nation. 

We can observe the same thing when looking at China or any other country with a large population. India should reduce air pollution to improve global health. Finally, I have just checked the air quality index for all the countries and got the result, The above result shows the countries who has higher population has high air pollution and higher respiratory diseases. 

    5. Conclusion 

In conclusion, we discover that every country's air quality and respiratory disorders are closely associated. Additionally, the population has a significant impact on air quality. Air pollution will increase as the population grows. The number of causes, vehicles, and solid fuels like wood in a country will result in a high air quality index, which will increase the number of respiratory disorders there. Additionally, when we consider Asian nations, we discover that they have higher rates of respiratory illnesses than other regions of the world. This is because Asian nations burn large cultivating farms after harvesting to prepare the land for subsequent cultivation, and because the majority of factories are located in Asian nations, while the rest of the world only uses the products. Additionally, as the population grows, more people are driving motorcycles, which emit air pollutants. The fact that respiratory diseases are also present in nations with a low population density, albeit at a much lower rate, indicates that other factors, such as climate and way of life, can also contribute to respiratory diseases. However, in conclusion, I believe that respiratory illnesses are related to air pollution, and every country should take immediate steps to eliminate air pollution in order to produce healthy generations that can breathe clean air. 

"If you need the complete report and codes, please leave a comment on this blog. Few screenshots, steps, and codes are not included assuming that you are able to figure them out. If you are having difficulty getting the output, then please leave a comment and I will help you with that."

Continue with your research on  
1. Autonomously and independently identify deficiencies when interacting with a range of technologies and leveraging knowledge of these deficiencies to improve future practice.

 2. Examine, select and autonomously apply skills to leverage data stored in a range of database/data storage paradigms.


References : https://aws.amazon.com/big-data/what-is-hive/  https://www.projectpro.io/article/mapreducevs-pig-vs-hive/163#

Comments

Popular posts from this blog

Deep Learning - develop a Long Short Term Memory (LSTM) network for time series analysis using Python