Department of
Computer Engineering (2019 course)
Course Name
& Code: Data
Science and Big Data Analytics Laboratory (310256)
Group
A: Data Science
1. Data Wrangling I
2. Data Wrangling II
3. Descriptive Statistics - Measures of Central
Tendency and variability
4.
Data Analytics I
5.
Data Analytics II
6.
Data Analytics III
7.
Text Analytics
8.
Data Visualization I
9. Data Visualization II
10.
Data Visualization III
Group B: Big Data Analytics –
JAVA/SCALA
1. Write
a code in JAVA for a simple Word Count application that counts the number of
occurrences of each word in a given input set using the Hadoop MapReduce
framework on local-standalone set-up.
2. Design
a distributed application using MapReduce which processes a log file of a
system.
3. Locate
dataset (e.g., sample_weather.txt) for working on weather data which reads the
text input files and finds average for temperature, dew point and wind speed.
4. Write
a simple program in SCALA using Apache Spark framework.
Group
C : Mini Projects/ Case Study – PYTHON/R
1. Write
a case study on Global Innovation Network and Analysis (GINA). Components of
analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model
planning analytic technique and 4. Results and Key findings.
2. Use
the following dataset and classify tweets into positive and negative tweets. https://www.kaggle.com/ruchi798/data-science-tweets.
3. Use
the following dataset and classify tweets into positive and negative tweets. https://www.kaggle.com/ruchi798/data-science-tweets.
4. Use
the following covid_vaccine_statewise.csv dataset and perform following
analytics on the given dataset https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv
a. Describe the dataset b. Number of persons state wise vaccinated for first
dose in India c. Number of persons state wise vaccinated for second dose in
India d. Number of Males vaccinated d. Number of females vaccinated.
5. Write
a case study to process data driven for Digital Marketing OR Health care
systems with Hadoop Ecosystem components as shown. (Mandatory)
●
HDFS: Hadoop Distributed File System
●
YARN: Yet Another Resource Negotiator
●
MapReduce: Programming based Data Processing
●
Spark: In-Memory data processing
●
PIG, HIVE: Query based processing of data services
●
HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides
analytical tools) Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing.
Comments
Post a Comment