Comcast Data Engineer Interview Questions and Answers

Your Complete Guide to Ace the Interview: Comcast Data Engineer Interview Questions and Answers. Learn how to excel in the Comcast Data Engineer interview process with insider hints, professional advice, and in-depth insights. Our extensive Q&A section can help you standout.

Comcast Data Engineer Interview Questions and Answers

Who is Data engineer?

Data engineering is a setting of designing and build a system for collecting, storing and analyzing data scale. Data engineers design and build pipelines that transform and transport data into a format, by the time it reaches the data scientist. But what about data engineers and who are they and what are the responsibility they have, let’s find out.

Working as a data engineer can give you the opportunity to make a promising difference. Therefore, a data scientist career can add incredible value to a business. According to DICE’s 2020 Tech job report, Data engineer is the fastest growing job in 2019, it growing by 50%YoY, Growth estimates from 2017-2025 range from 18% to whopping 31% Per Annum.

Career as a Data Engineering in Comcast

The demand for big data professionals has been not higher, data scientist and data engineer’s jobs can rank among the top emerging jobs on LinkedIn. Many people building higher-salary Careers as Data Engineers.

Average salary ranges between $65,0s an in-house Engineering, Development and Innovation Centre for Comcast. Comcast creates incredible technology and entertainment that connects millions of people to the moments and experiences that matter most.

We support cutting edge technology, products, and services, including:

  • Comcast Cyber Security
  • Next Generation Access Networks
  • Comcast Technology Solutions
  • Business Services
  • Xfinity X1
  • Xfinity xFi
  • X1 Voice Remote
  • Xfinity connected home and IoT initiatives

Comcast Data Engineer Interview Questions and Answers

Here is a selection of Comcast Data Engineer Interview Questions and Answers that Comcast has lately asked. These questions are appropriate for both new graduates and seasoned workers. All of the questions below have been answered by our experts of Pyspark Training in Chennai.

1. In Spark, what is the difference between Dataframe, Dataset, and RDD?

As a result of its tabular nature, a Dataframe has additional metadata that allows Spark to perform specific optimizations on the resulting query. An RDD, on the other hand, is more of a black box of data that cannot be optimised due to the absence of constraints on the operations that may be performed upon it.

However, you can use the rdd method to convert a Dataframe to an RDD, and the to DF function to convert an RDD to a Dataframe. Because of the built-in query optimization, it is generally suggested to use a Dataframe wherever possible.

2. Coalesce vs repartition () in Spark ()

It prevents a complete shuffling. If the number is known to be reducing, the executor can securely store data on the smallest number of divisions possible, only moving data from the extra nodes to the ones we kept.

3. What is the difference between a map and a flat Map, and when should each be used?

An RDD of length N is transformed into another RDD of length N, and it translates from two lines to two lines-length. Flat Map, on the other hand, takes an RDD of length N and turns it into a collection of N collections, which it then flattens into a single RDD of results.

4. What exactly is the distinction between cache and persist?

The only distinction between cache and persist actions is one of syntax. Cache is a synonym for persist or persist (MEMORY ONLY), implying that cache is just persisting with the default storage level of MEMORY ONLY.

5. On an RDD, do we need to call cache or persist?

Python is my preferred language over Scala. However, because Spark is built in Scala, I expected my code to execute quicker in Scala than in Python for obvious reasons. With that premise, study and write the Scala version of some popular data preparation routines for 1 GB of data. The data comes from Kaggle’s Spring Leaf competition. Just a quick rundown of the statistics. Data is made up of different types, such as int, float, string, and Boolean. I’m only using 6 of the 8 cores for Spark processing, so I set minPartitions=6 to ensure that each core has something to do.

6. What is the best way to read numerous text files into a single RDD?

You can specify entire directories, wildcards, and even CSV files including directories and wildcards. sc. text File (“/my/dir1, /my/paths/part-00[0-5] *, /another/dir, /a/specific/file”) is a good example. This is an exposing of Hadoop’s FileInputFormat, therefore it also works with Hadoop, as Nick Chammas points out.

7. How can I modify the names of dataframe columns in Pyspark?

I come from a pandas background, so I’m used to receiving data from CSV files into a dataframe and then changing the column names using the

simple command: new column name list = df. columns

8. In Pyspark dataframe produced with sqlContext, however, this does not work.

The number of cores vs. the number of executors in Apache Spark. I believe the answer is a little simpler than some of the suggestions below. The cluster network graph provides the answer. The usage for run 1 remains constant at around 50 M bytes/s. Run 3 doubles the stable consumption to roughly 100 M bytes/s.

9. In Spark SQL’s dataframe, how do I alter column types?

Though this may be the best solution, I believe the alternatives proposed by msemelman, Martin Senne, and others based on with Column, withColumnRenamed, and cast are easier and cleaner]. Consider that a Spark dataframe is a (immutable) RDD of Rows, so we’re never actually replacing a column, simply constructing a new dataframe with a new schema each time.

10. How do I disable Spark’s INFO logging?

This could be because of the way Spark calculates its class path. My guess is that Hadoop’s log4j.properties file is on the class path before Spark’s, preventing your modifications from being applied.

11. In Pyspark, how can I add a new column to a Spark dataframe?

Spark does not allow you to add any column to a dataframe. Only literals can be used to construct new columns.

12. In Apache Spark dataframe, concatenate columns.

Checking for null values is required. Because the result will be null if one of the columns is null, even if the other columns contain data.

13. In Spark, how are phases divided into tasks?

For each stage, a separate task must be launched for each split of data. Consider that each partition will most likely be stored in separate physical locations, such as HDFS blocks or local file system directories/volumes.

Our Lovely Student feedback

Artificial Intelligence Students Review
Microsoft Azure DevOps Students Review
Python Full Stack Develope Students Review
Amazon Web Services Students Review
Selenium with Python Students Review
AWS Solution Architect Certification Review

Comcast Data Engineer Interview Questions for Freshers

1. What is the best way to store custom objects in a Dataset?

Although things have improved since 2.2/2.3, which includes built-in encoder support for Set, Seq, Map, Date, Timestamp, and Big Decimal, this response remains true and useful. You should be good with just the implicit in SQLImplicits if you stick to constructing types using only case classes and the standard Scala types.

2. Map vs map Partitions in Apache Spark?

By applying a function, the method map changes each element of the source RDD into a single element of the return RDD. Each partition of the source RDD is converted into numerous parts of the result by map Partitions.

3. Using spark-csv, create a single CSV file.

Because each division is saved separately, it creates a folder with several files. If you desire a single output file that can be partitioned (preferred if upstream data is large, but requires a shuffle).

4. How can I change the memory of Apache Spark Executor?

As you’ve discovered, adjusting spark. executor. memory has no effect because Spark is executing in local mode. The reason for this is that the Worker “lives” within the driver JVM process that you start when you start spark-shell, which uses 512M of memory by default. You can raise this by increasing spark. driver. Memory to a greater value, such as 5g. You can do it either way.

5. In Spark, how can I convert a rdd object to a dataframe?

A number of create dataframe methods in Spark Session make a dataframe from an RDD. I’m sure one of these will work in your situation.

  • For example, define dataframe (rowRDD: RDD[Row], schema: StructType):
  • Creates a dataframe from an RDD with Rows according to the specified schema.

6. In a Pyspark dataframe, show distinct column values

To get just different rows based on colX in the array, use df. drop Duplicates([‘col1′,’col2’]).

7. Python list from spark dataframe column

I used list (mvv count df. Select(‘mvv’) to perform a benchmarking analysis.

8. The fastest approach is to Pandas () [‘mvv’]). I’m quite taken aback.

I used a 5 node i3. xlarge cluster (each node has 30.5 GBs of RAM and 4 CPUs) with Spark 2.4.5 to test the different ways on 100 thousand / 100 million row datasets. Data was evenly spread among 20 single-column compressed Parquet files.

9. How do I define dataframe partitioning?

In Spark - 1.6 You can use HiveQL DISTRIBUTE BY colX… (ensures each of N reducers gets non-overlapping ranges of x) and CLUSTER BY colX if you build a Hive Context rather than a plain old SqlContext.

10. In Spark, how can I replace the output directory?

“Whether to replace files added by SparkContext.addFile() when the destination file exists and its contents do not match those of the source,” according to the documentation for the parameter spark. files. Overwrite. As a result, the saveAsTextFiles function is unaffected.

11. How do you stop a Spark application from running?

application 1428487296152 25597 is an example of an application ID copied from the Spark scheduler connect to the server where the job was started application 1428487296152 25597 yarn application -kill

12. Is the spark Context object defined inside the main function or somewhere else?

I had the similar issue, and my mistake was to start the spark Context outside of the main method and inside the class. It worked perfectly when I started it from within the main function.

Comcast Data Engineer Interview Questions and Answers for Freshers

1. What is the best way to store custom objects in a Dataset?

Although things have improved since 2.2/2.3, which includes built-in encoder support for Set, Seq, Map, Date, Timestamp, and Big Decimal, this response remains true and useful. You should be good with just the implicit in SQLImplicits if you stick to constructing types using only case classes and the standard Scala types.

2. Map vs map Partitions in Apache Spark?

By applying a function, the method map changes each element of the source RDD into a single element of the return RDD. Each partition of the source RDD is converted into numerous parts of the result by map Partitions.

3. Using spark-csv, create a single CSV file.

Because each division is saved separately, it creates a folder with several files. If you desire a single output file that can be partitioned (preferred if upstream data is large, but requires a shuffle).

4. How can I change the memory of Apache Spark Executor?

As you’ve discovered, adjusting spark. executor. memory has no effect because Spark is executing in local mode. The reason for this is that the Worker “lives” within the driver JVM process that you start when you start spark-shell, which uses 512M of memory by default. You can raise this by increasing spark. driver. Memory to a greater value, such as 5g. You can do it either way.

5. In Spark, how can I convert a rdd object to a dataframe?

A number of create dataframe methods in Spark Session make a dataframe from an RDD. I’m sure one of these will work in your situation.

  • For example, define dataframe (rowRDD: RDD[Row], schema: StructType):
  • Creates a dataframe from an RDD with Rows according to the specified schema.

6. In a Pyspark dataframe, show distinct column values

To get just different rows based on colX in the array, use df. drop Duplicates([‘col1′,’col2’]).

7. Python list from spark dataframe column

I used list (mvv count df. Select(‘mvv’) to perform a benchmarking analysis.

8. The fastest approach is to Pandas () [‘mvv’]). I’m quite taken aback.

I used a 5 node i3. xlarge cluster (each node has 30.5 GBs of RAM and 4 CPUs) with Spark 2.4.5 to test the different ways on 100 thousand / 100 million row datasets. Data was evenly spread among 20 single-column compressed Parquet files.

9. How do I define dataframe partitioning?

In Spark - 1.6 You can use HiveQL DISTRIBUTE BY colX… (ensures each of N reducers gets non-overlapping ranges of x) and CLUSTER BY colX if you build a Hive Context rather than a plain old SqlContext.

102. In Spark, how can I replace the output directory?

“Whether to replace files added by SparkContext.addFile() when the destination file exists and its contents do not match those of the source,” according to the documentation for the parameter spark. files. Overwrite. As a result, the saveAsTextFiles function is unaffected.

11. How do you stop a Spark application from running?

  • application 1428487296152 25597 is an example of an application ID copied from the Spark scheduler
  • connect to the server where the job was started
  • application 1428487296152 25597 yarn application -kill

15. Is the spark Context object defined inside the main function or somewhere else?

I had the similar issue, and my mistake was to start the spark Context outside of the main method and inside the class. It worked perfectly when I started it from within the main function.

Comcast Data Engineer Interview Questions for Experience

1. How to use sc. text File instead of HDFS to load a local file

sc. text File (“file:/path to the file/”) should be explicitly specified. When the Hadoop environment is set, the issue occurs.

If schema is missing, SparkContext.textFile invokes org. apache. Hadoop. mapred. FileInputFormat.getSplits, which then calls org. apache. Hadoop’s. getDefaultUri. This method reads the Hadoop conf’s “fs. defaultFS” argument. The argument “hdfs:/…” is normally set if the HADOOP CONF DIR environment variable is specified; otherwise, “file:/.”

2. Is it possible to operate Apache Spark without Hadoop?

Although Spark may run without Hadoop, certain of its features rely on Hadoop’s code (e.g. handling of Parquet files). We’re using Mesos and S3 to run Spark, which was a little tough to set up but works great once it’s done.

3. In Spark-shell, what do the numbers on the progress bar mean?

Although Spark may run without Hadoop, certain of its features rely on Hadoop’s code (e.g. handling of Parquet files). We’re using Mesos and S3 to run Spark, which was a little tough to set up but works great once it’s done.

4. Python string to date format conversion

The easiest method to do this in Spark 2.2+ is probably to use the to date or to timestamp functions, which both accept the format option. Inventive+ paraphrase (for Spark 2.2).

5. In Apache Spark, extract column values from a Dataframe as a List.

You just obtain a Row object without the mapping, which contains every column from the database. Keep in mind that you will almost certainly receive a list of any type. You can use if you want to specify the result type. in r => r asInstanceOf [YOUR TYPE] (0). mapping asInstanceOf [YOUR TYPE]

6. How to Repair When trying to launch Pyspark after installing spark, you get a ‘Type Error: an integer is required (get type bytes)’

Because you’re using Python 3.8, this is happening. Pyspark most recent pip release (Pyspark 2.4.4 at the time of writing) does not support Python 3.8. For the time being, you should stick on Python 3.7.

7. In the Apache Spark web UI, what does “Stage Skipped” mean?

It usually indicates that data has been retrieved from cache and that the stage has not been re-run. It’s in line with your DAG, which indicates that the following stage will necessitate shuffling (reduceByKey). Spark whenever there’s a lot of shuffling going on.

8. What is the difference between spark? Default? parallelism and spark.sql. shuffle. Partitions?

The number of partitions utilised while shuffling data for joins or aggregations is controlled by spark.sql. shuffle. Partitions. When not specifically adjusted by the user, spark. Default. parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize. It’s worth noting that spark. Default. parallelism appears to function exclusively for raw RDDs and is disregarded when working with dataframe.

9. How do I save a dataframe to Hive?

It’s commonly attempted by converting dataframe to Rdd, saving as a text file, and then loading into Hive. But I’m curious if I can save a dataframe directly to hive.

10. What is the best way to utilise Pyspark with Python 3?

Take a look at the document. The shebang line is most likely pointing to the ‘env’ binary, which looks for the first compatible executable in the path. You can use python3 instead of python. Change the environment to utilise the python3 binary that is hardcoded. Alternatively, you can use python3 to run the binary without the shebang line.

11. Is it possible to get the top 1000 rows from a Spark Dataframe?

Takes the first n rows and returns a new Dataset. This function differs from head in that it returns a new Dataset rather than an array.

12. Split the string column in a Spark Dataframe into numerous columns.

The proper technique here is to use pyspark.sql. functions. Split () to flatten the nested Array Type column into numerous top-level columns. It’s simple in this example because each array just has two elements. To retrieve each part of the array as a column, simply call Column.getItem().

13. In the spark dataframe write method, you can overwrite certain partitions.

This is a very common issue. With Spark versions prior to 2.0, the sole option is to write directly into the partition directory. If you’re running Spark before version 2.0, you’ll need to disable metadata files. If you’re using Spark before 1.6.2, you’ll also need to remove the _SUCCESS file from /root/path/to/data/partition col=value; otherwise, automatic partition discovery will fail.

14. How can I change the python version of the driver in Spark?

You must ensure that the standalone project you’re launching uses Python 3. If you’re submitting your standalone programme using spark-submit, it should work well; however, if you’re launching it with Python, make sure you’re using Python 3.

15. How do you install Spark on Windows?

However, if you’re simply playing with Spark and don’t need it to operate on Windows for any other reason than your personal system is running Windows, I strongly advise you to install it on a Linux virtual machine. The simplest approach to get started is to download one of Cloudera’s or Hortonworks’ ready-made images and either utilise the bundled Spark or build your own from source or the built binaries available on the Spark website.

16. How can saveAsTextFile output not be split into many files?

If you require the file to be saved with saveAsTextFile you can use coalesce (1, true) (1, true). saveAsTextFile (). This basically means do the computation then coalesce to 1 partition. You can also use repartition (1) which is just a wrapper for coalesce with the shuffle argument set to true.

Scroll to Top