Coursework tasks and Problem statement
1. Identify and evaluate a number of publicly available dataset
s related to air pollution and severity of respiratory disease. These may be from sources such as kaggle.com or data.gov.uk
2. Select appropriate datasets, as informed by their interests
3. Integrate and import these datasets into a suitable data storage and processing system, providing rationale for their choice
4. Perform meaningful analysis of the data to derive some simple useful information, as can be obtained by the dataset selected.
5. Provide visualisation of the analysis through any Hadoop-related technologies which the students deem suitable.
"If you need the complete report and codes, please leave a comment on this blog. Few screenshots, steps, and codes are not included assuming that you are able to figure them out. If you are having difficulty getting the output, then please leave a comment and I will help you with that."
Sample Coursework for the above task
Identifying and evaluating datasets to air pollution and severity of respiratory disease using Hadoop.
Abstract: In order to determine whether air pollution has any impact on respiratory disorders, I will analyze various datasets on the subject in this assignment. As is well known, there will be a lot of data in this situation, making it a huge data issue. In order to extract any relevant information from the data we gathered, I will use the Hadoop ecosystem. We can discover some important insights by utilizing the Hadoop distributed file system and visualization tools. Most of the complexity involved in distributed processing and storage is abstracted and managed by Hadoop. The datasets are saved in HDFS, and the filtered and shuffled information is visualized using Apache Zeppelin and Apache Pig. By analyzing historical data, this assignment will determine how respiratory diseases are related to air pollution in every nation. It will also explain why I chose HDFS, pig, and zeppelin, Yarn, etc over other Hadoop ecosystem technologies.
- Introduction
The recorded data were scarce before the world
become digital, and the majority of data were documents
like tables. With a single system, we can simply finish
any task involving such files, so storing and analyzing
data was not a problem. The amount of data captured
rose quickly after the development of the internet and
once the globe went digital, and that data also came in a
variety of formats. Given the size of the data set, its
analysis has grown to be a significant undertaking. Every
microsecond, data is produced, and as a result, the
problem of big data has emerged. Using a single storage
device and processor to manage massive data is
challenging. Hadoop framework is what we have to
address these problems. The Hadoop distributed file
system is the first of three elements that are made to cope
with big data. Hadoop stores huge data on distributed
nodes and clusters. The second element is MapReduce,
which divides enormous amounts of diverse data into
smaller pieces for examination before averaging the
results to produce final data. The resource manager for a
series of tasks in a Hadoop cluster is called Yarn.
Hadoop also includes a number of components, including
hive, pig, spark, flume, and Sqoop. In order to tackle our
big data problem about the correlation between air
pollution and respiratory disorders, we will leverage
some of these Hadoop components.
2. Discussion of the problem and justification of the dataset
Finding a link between air pollution and respiratory
ailments in various nations from 2011 to 2017 is the
stated problem. According to our personal experience,
we can generally affirm that there is a relationship.
Anytime we are in an environment with poor air quality,
we might experience some suffocation or discomfort, but
it is just a personal experience, and we need to consider
whether this is the case everywhere. For instance, during
the COVID-19 pandemic, the majority of persons who
contracted Corona viruses had a history of respiratory
disorders such as asthma, COPD, and lung cancer.
Therefore, it is necessary to determine the causes of
respiratory disorders, whether there are any differences in
the number of patients between nations and, if so, what
are the most prevalent practices in those countries, etc.
Since we are doing an analysis based on air pollution
statistics, we must determine the air quality in countries
with more and less respiratory disease sufferers. If there
is a correlation, we must investigate the causes of air
pollution in these nations. For example, countries with
the most factories will produce the most hazardous air
pollutants, while countries that use solid fuels for
cooking, such as wood, will also produce air pollutants,
but they may not be as dangerous. Also, we must
determine if there is a connection between air pollution
and particular respiratory disorders. For instance, all
respiratory disorders may or may not be caused by air
pollution; some can also be caused by lifestyle choices;
for instance, cigarette smokers are more likely to cause
lung cancer than air pollution. I have downloaded the
"cause of deaths_" and "PM2.5 Global Air Pollution
2010-2017" datasets from Kaggle. I chose these datasets
because they contain data on the fatality rate from air
pollution from 2011 to 2017 across several nations. After
analyzing these two datasets, we will be able to
comprehend how air pollution causes respiratory
illnesses, which can be fatal.
3. The documentation of the technical solution
Since we are dealing with large amounts of data, we must use Hadoop to analyze the data that has already chosen. Apache hive, Hadoop MapReduce, and Apache pig, which are all accessible in the Hadoop environment, can be used to execute this task. In our case, I use Apache hive due to its extensive querying flexibility. I favor querying over scripting. Let me explain why I prefer Apache hive over Apache pig and MapReduce.
| Apache Pig | Apache Hive | Hadoop MapReduce |
Language | Scripting | Query | Compiled |
Level of abstraction | High | High | low |
Lines of codes | Less lines of codes | Less lines of codes | More lines of codes |
Code efficiency | Less efficiency | Less efficiency | High efficiency |
Here we can see that Hive and Pig have similar characteristics, but since I favour querying over Pig-like scripting languages, I will use Apache Hive. Apache Hive, an own doing, distributed data warehouse, makes it possible to conduct large-scale analytics. A data warehouse provides the rapid examination of data, which facilitates data-driven decision making. SQL enables Hive users to read, write, and manipulate petabytes of data. Hive is developed on top of Apache Hadoop, a platform for the storage and analysis of open-source data. Hive is well optimized for managing massive amounts of data and has tight Hadoop integration. Hive is distinguished by its SQL-like interface, which may be combined with Apache Tez or MapReduce to search through massive databases.

The diagram depicts the operation of a Apache Hive. The hive drivers can support applications written in any language, including Python, Java, and others. Clients like as JDBC, ODBC, and others connect applications to the hive server. Hive services such as HiveServer2 enable clients to execute queries against the hive. In addition, hive services offer other services such as compiler, driver, and metastore. Apache Hive, like Apache Pig, utilizes the MapReduce framework as the default engine for query execution, as MapReduce is capable of processing massive volumes of data. In addition, yarn serves as a resource manager. Given that Hive is part of the Hadoop ecosystem, HDFS is used for distributed storage. Both the local and MapReduce modes of Hive are functional. It is always preferable to use a query language like hive since it uses a MapReduce engine, which enables simple queries to analyze massive volumes of data. This is because we are working with two distinct datasets and need to run several tasks over them. Because it is so similar to SQL queries and because we can use the jdfc interpreter when employing Zeppelin's visualization features, Apache Hive is the technology I've chosen to analysis the selected datasets.
Making these datasets accessible in the Hadoop distributed file system (HDFS) is the next step after selecting our datasets and the appropriate analysis technology that is, Apache Hive

I established the directory /user/hadoop in HDFS with open access permissions and am using PuTTy Client to run the "wget" command to download the data set from Kaggle (Scap-i). At this stage, we need to copy our datasets from the local directory to the newly created HDFS directory (Scap-ii). Further made two directories under /user/hadoop called "airpollution" and "resp_des_death"(Scap-ii).
Scap-ii
Scap-iii
The cause of death dataset has 208114 rows, while the pollution dataset has 19280 rows. However, we need to organize these data’s and combine the necessary data in another table. We need to remove a few rows and columns from the present dataset and then combine the results into one single dataset because we are only considering air pollution and respiratory disease for the years 2011 to 2017. Before uploading to the HDFS, I initially made some adjustments to the datasets from the local directory. I uploaded the cause_of_deaths.csv and AirPollution.csv files to HDFS folders made under /user/hadoop using the Ambari server's file view .
Then, using the default data base, I created two tables with the names "airpollution" and "cause_of_deaths" by uploading the corresponding "AirPollution.csv" and "cause_of_deaths.csv" files. Now let’s see the statistics of both tables . We can see that raw data size for ‘airpollution’ is 59527 and ‘cause_of_deaths’ is 18001861. We can now use Apache Hive to run a few queries to further organize the data.
Select * Lower_Respiratory_Infections, Chronic_Respiratory_Diseases from cause_of_deaths where Year>2010 and Year<2018

The cause_of_deaths table has a large number of death causes; however, we are only choosing the information pertaining to respiratory diseases. The output table will be saved into the directory with name'respiratory_desc.csv' after extracting the data exclusively linked with respiratory disorders such Lower Respiratory Infections and Chronic Respiratory Diseases for the period from 2011 to 2017 in the different nations.
The same way I will reduce the ‘AirPollution.csv’ file with required data and it will save to the airpollution directory with the name ‘AirPollution_redu.csv’. After deleting duplicate and irrelevant data from the file
to be created, we must merge the two newly created
tables into one new table in order to do analysis
activities. I combined the tables into a single one and
gave it the name "merged_data," along with the filename
"merged_data.csv" and directory "merged _file."
4. Result of the analysis and insights
Next, I verified the correlation between Lower
Respiratory Infections and air quality index for the
evaluation, and I named the resulting column
corr_iri_aqi. I also checked the correlation between
Chronic Respiratory Diseases and air quality index, and I
named the resulting column corr_crd_aqi.
Select
country,corr(lower_respiratory_infections,air_quality_in
dex) corr_LRI_AQI from merged_data group by country
And we got the following result with column name
corr_iri_aqi and screenshot shows only first few
countries with correlation.
Like the same way I have the checked correlation
between chronic respiratory diseases and air quality
index.
select
country,corr(chronic_respiratory_diseases,air_quality_in
dex) corr_CRD_AQI from merged_data group by
country
And we got the result with column name corr_crd_aqi.
The below screenshot shows only first few countries.
Zeppelin is a fantastic tool for visualization in the
Hadoop environment, as we are all aware. In order to
perform hive queries on Zeppelin and produce a
visualization, I used the jdbc(hive) interpreter. Let's
check how Zeppelin renders the visualization after I
combined the correlation query for chronic respiratory
disease and lower respiratory infections into a single
query against the air quality index. I won't go into great
detail regarding the zeppelin visualization here because I
have already demonstrated it in a screen-captured video.
%jdbc(hive)
Select country,air_quality_index,
corr(lower_respiratory_infections,air_quality_index)
corr_LRI_AQI,
corr(chronic_respiratory_diseases,air_quality_index)
corr_CRD_AQI from merged_data group by
country,air_quality_index
We can see from this visualization that air quality
indexes and respiratory disorders have a strong
correlation. When the air quality index is high, air
pollution is more likely to occur and there is higher
chance for respiratory diseases.
Let's now look at the Tez view, which displays all hive
queries together with their performance data, to see how
we might use it to locate specific hive queries.
Now let's look at which country has the highest and
lowest air quality indices between 2011 and 2017.
We can now see that, between 2011 and 2017, Finland
had the lowest air quality indices and Nepal had the
highest.
Let's compare respiratory illnesses and air
quality indexes for Nepal and Finland exclusively for this
time period.
We can see from this visualization that Nepal is more
susceptible to respiratory illnesses than Finland, and the
explanation is that Nepal's air quality index is greater
than Finland's.
We will now check for respiratory illnesses in nations
with air quality indices exceeding 60.
We can observe that the majority of Asian nations,
including China, India, and others, have air quality
indices above 60 and more respiratory ailments. To
safeguard the ecosystem and the wellbeing of all living
creatures, air pollution must be reduced right away.
Also, when we are check the countries which has air
quality index less than 10 then we got the result,
After evaluating the two outcomes, we can conclude that
countries with a larger population also have respiratory
diseases and an air quality index that are both greater.
In order to determine whether population affects air
pollution, let's look at the Indian nation and examine the
air quality index from 2011 to 2017. We can see that
India consistently has an air quality index average of 90,
which is exceedingly poor for the nation.
We can observe
the same thing when looking at China or any other
country with a large population. India should reduce air
pollution to improve global health.
Finally, I have just checked the air quality index for all
the countries and got the result,
The above result shows the countries who has higher
population has high air pollution and higher respiratory
diseases.
5. Conclusion
In conclusion, we discover that every country's air
quality and respiratory disorders are closely associated.
Additionally, the population has a significant impact on
air quality. Air pollution will increase as the population
grows. The number of causes, vehicles, and solid fuels
like wood in a country will result in a high air quality
index, which will increase the number of respiratory
disorders there. Additionally, when we consider Asian
nations, we discover that they have higher rates of
respiratory illnesses than other regions of the world. This
is because Asian nations burn large cultivating farms
after harvesting to prepare the land for subsequent
cultivation, and because the majority of factories are
located in Asian nations, while the rest of the world only
uses the products. Additionally, as the population grows,
more people are driving motorcycles, which emit air
pollutants. The fact that respiratory diseases are also
present in nations with a low population density, albeit at
a much lower rate, indicates that other factors, such as
climate and way of life, can also contribute to respiratory
diseases. However, in conclusion, I believe that
respiratory illnesses are related to air pollution, and every
country should take immediate steps to eliminate air
pollution in order to produce healthy generations that can
breathe clean air.
"If you need the complete report and codes, please leave a comment on this blog. Few screenshots, steps, and codes are not included assuming that you are able to figure them out. If you are having difficulty getting the output, then please leave a comment and I will help you with that."
Continue with your research on
1. Autonomously and independently identify deficiencies when interacting with
a range of technologies and leveraging knowledge of these deficiencies to
improve future practice.
2. Examine, select and autonomously apply skills to leverage data stored in a
range of database/data storage paradigms.
References : https://aws.amazon.com/big-data/what-is-hive/ https://www.projectpro.io/article/mapreducevs-pig-vs-hive/163#
Comments
Post a Comment