Analysis of Items with Item Response Theory (IRT) Approach on Final Assessment for Al-Quran Hadith Subjects

Measurement in the field of education, especially the teaching and learning process can be done with measuring tools in the form of tests and non-tests. Islamic Religious Education is considered the same as other subjects. to realize student success is also measured through evaluation, which is a systematic process to obtain information about the effectiveness of teaching and learning activities. In addition, it can also assist teachers in achieving learning objectives and describe student achievement in accordance with predetermined criteria. This study aims to analyze the Year-End Assessment of MAN 2 Bantul for the 2020/2021 academic year consisting of 25 multiple-choice questions. The items analyzed in the report have a total number of responses of 265 students. Item analysis using analysis with modern test theory ( Item Response Theory ). The results showed that the results of the model fit test showed that the instrument fit on the 2 PL model (Logistics parameter) with the lowest AIC value, namely 4874.85. The results of the parameter analysis of the level of difficulty indicate that there are 4 questions that are categorized as very easy, 4 easy questions, 15 moderate questions, 1 difficult question, and 1 very difficult question. This shows that the distribution of the difficulty level parameters is quite balanced. The results of the analysis of the different power parameters show that there are 23 good questions, 1 fairly good question, and 1 bad question. This shows that the different power level parameters are quite good. The results of the estimation of students' abilities with the MLE estimator showed that there were no students who had abilities below -4. There are 15 students with abilities above 2.00. There are 165 students who fall into the good category (ability -2 to 2 and 9 students who have abilities below -2. Based on the plot of the information function, it can be concluded that the optimal test if given to individuals with low abilities is around -1.2. Accurate questions to measure students' abilities with a range of -2.5 to 1.2.


Introduction
A test is a task, or a series of tasks designed to obtain information about educational and psychological traits or attributes. The ability of students in the world of education can be determined through a series of tests (Parmaningsih and Saputro 2021) . The test is part of the assessment process. Assessment is the process of collecting some informations about students and classes for instructional decision-making purposes (Istiyono 2020) . Another related term is measurement. Measurement is the determination of a number against a certain characteristic possessed by certain people, objects or objects according to clear rules or formulations. The characteristics of measurement are the use of numbers or certain rules or formulas (Parmaningsih and Saputro 2021) . The relationship between tests, measurements, and assessments shows that the assessment of learning outcomes can only be carried out properly and correctly when using the information obtained through measuring learning outcomes (Matondang, Jalinus, and Ambiyar 2020) . The quality of the test items must meet the appropriate requirements (Mulyani, Efendi, and Ramalis 2021) . This is important because the test results are used as the basis for determining the "decision" of the examinees, for example passing or not passing, being accepted or rejected.
In conducting an assessment, an instrument is needed that can measure students' abilities accurately. In the case of summative or end of semester exams, it is usually done with multiple choice. The process of assigning a number to something or someone based on a rule is called measurement. The activity of systematically determining numbers for an object is defined as a measurement by Mardapi (Mardapi 2012) . Results must be as accurate as possible. Although measurement does not produce a value or determine whether something is good or bad, the results of these measurements can be used to make decisions.
There are at least two forms of assessment in educational practice, namely formative assessment and summative assessment. In context assessment for learning, summative assessment is important to do. If the learning process is complete, focus on the summative to describe the quality of students. The summative assessment at the end of the lesson focuses on learning and achieving the desired outcome. Summative assessment is used to track students' progress in the program by providing a summary of their experience.
Measurement in the field of education, especially the teaching and learning process can be done with measuring tools in the form of tests and non-tests. The measuring instrument or test instrument commonly used in evaluating student learning outcomes is a set of questions. A set of questions used to measure student learning outcomes must have good quality in order to measure the actual abilities of students (Sarea and Hadi 2015) .
It can be said that the measurement is quantitative, to determine the value quantitatively a measuring instrument is needed, one of which is a test. The measuring instrument or test instrument commonly used in evaluating student learning outcomes is a set of questions.
Measuring tools that can be used in the process of evaluating the learning process can be in the form of homework assignments, quizzes, midterm exams (UTS), and final semester exams (UAS). The test is one form of instrument used to measure a number of questions that have true or false answers, or all true or partially true with the aim of knowing the learning achievements or competencies that have been achieved by students in certain fields (Mardapi 2012) . In this report, the tests in the field of PAI are explained.
Summative Assessment or also known as PAS (Semester Final Assessment) is sometimes made too difficult or too easy, making it difficult for educators to distinguish students' abilities. Therefore, it is necessary to test/analyze the test questions in the hope that the results obtained accurately reflect the students' abilities (Istiyono 2020) .
In the analysist development test items is divided into 2 forms, those are classic test theory (CTT) and item response theory (IRT). IRT is the latest analytical theory that incorporates item analysis. This modern item analysis theory was created by academics to overcome the shortcomings of the classical item analysis theory, therefore it can be said that this IRT model is a complement to the classical item analysis theory in the form of item response theory (Classical Test Theory). Item response theory (IRT) is a collection of statistical models that have been used to model responses to educational and psychological test items along with the latent traits that determine how individuals respond to those items. The theoretical and computational framework of IRT was developed from the 1950s to the 1980s and since then has been widely used by organizational researchers for various applications and research domains (Foster, Min, and Zickar 2017) . Balsis shows IRT has been used in analyzing Alzheimer's disease that affect neurological, cognitive, and behavioral processes (Balsis et al. 2018) . In other research McGrory shows IRT has been used in research on dementia (McGrory et al. 2014) . In the development of computing IRT can be used either using the Stata . application (Linden 2018) , as well as Software IRTEQ, STUIRT, and POLYEQUATE (Malatesta and Lee 2019) .
Islamic Religious Education is considered the same as other subjects. In realizing the student success is also measured through evaluation, which is a systematic process to obtain information about the effectiveness of teaching and learning activities. In addition, it can also assist teachers in achieving learning objectives and describe student achievement in accordance with predetermined criteria (Azizah and Zainudin 2020) .
Islamic Religious Education in its development seeks to develop a Muslim personality that is combined in terms of spiritual, physical, spiritual, emotional, intellectual, and social, in order to instill noble character, and can make a complete human being. In addition, it also tries to make humans as caliphs fil al-ardh, responsible, and continue to serve Allah, especially in the face of difficulties. Therefore, Islamic religious education must be taught in its entirety. However, Islamic religious education subjects in schools are still less attractive to students, and are still less successful in instilling positive behavior in students, due to monotonous learning. The low quality of learning is caused by the weak methodological aspects mastered by the teacher. In addition to teaching, teachers have prepared lesson plans and assessments in their activities, even the teacher evaluation planning has not been given much attention in its implementation. This is very influential on the success of student learning. Not to mention the questions that are made do not measure the aspects that have been planned in the RPP (Sham 2019) .
A test is said to be good as a measuring tool if it meets the test requirements, namely having: validity, reliability, objectivity, practicality and economy. Therefore, this study will examine the feasibility of tests in the field of PAI with the following competencies:    In the study of PAI there is no research that analyzes test items using IRT analysis. The research carried out is still using CTT analysis at the SMA, SMP, and SD levels (Ahmad and Sukiman 2019; Rahmaini and Taufiq 2018;Rusmayani 2020;Sarea 2018) . For this reason, it is important to conduct a study on IRT analysis on the test items used in PAI subjects.
This study will test the test items in the Year-End Assessment of PAI subjects at the MAN 2 Bantul school for the 2020/2021 academic year. The question consists of 25 multiple choice questions. The items analyzed in the report have a total number of responses of 265 students. Item analysis using analysis with modern test theory ( Item Response Theory ).

Research Method
The research subjects were students of class MAN 2 Bantul for the academic year 2020/2021 with a total of 188 students. The object of research is a set of questions for the Even Semester Final Exam of MAN 2 Bantul for the 2020/2021 academic year in the Al-Quran Hadith Subject which consists of 25 multiple choice questions. The data used is the results of the final semester exam test results for 10th grade students of MAN 2 Bantul in the subject of Al-Quran Hadith with a total of 25 questions. The test results are then scored dichotomously where the correct answer is given a score of 1 and the wrong answer is given a score of 0.
The scoring data were analyzed with three types of parameters, namely the 1 PL, 2 PL, and 3 PL models and then concluded the fit items for the 1 PL, 2 PL, and 3 PL parameters. Then, the model fit test was carried out. The purpose of the model fit test is to find out which of the four models is the best for estimating item parameters and the ability of subjects with IRT on the test items. In this study, the model fit test was conducted by comparing the AIC ( Akaike Information Criterion ) values of all models. The analysis was carried out using the R program. The most suitable model is the one with the lowest AIC (Snipes and Taylor 2014) . After being fit with one model, the parameters and item abilities were estimated using the MLE (Maximum Likelihood) method. After that, the estimation results are interpreted. The value of the item information function is then calculated to determine the reliability of the item.

IRT 1 PL Analisis analysis
The item in the Rasch model only has one item characteristic parameter, namely the item difficulty level/index which is denoted by b . The parameter level of difficulty of this item is able to measure the level of ability of the test takers. The following are the results of the analysis of 1 PL for 25 items of Al-Quran Hadith Subjects at MAN 2 Bantul. The results of the analysis can be explained that Dffclt explain the level of difficulty of the test items and Dscrmn explain the differentiating power of the question. Obtained value Dscrmn the same because the analysis used is an analysis of 1 Logistics Parameters. The estimation results are then categorized based on the range as follows. Based on the category table above, it can be seen that the level distribution of difficulty in the 25 items is quite balanced. This can be understood from the proportion of very easy and easy items is 16%. While the items with a difficulty level is only 4% are the same as very difficult. For items with a moderate category of 60%. However, the distribution of difficulty levels should be increased. This can be done by increasing the types of questions that have difficult or very difficult categories.

Image 1. ICC 1 PL Charts
Based on the ICC graph, it can be seen that the distribution of difficulty levels is quite good. The ability that must be possessed by students to answer 25 test items for Al-Quran Hadith subjects at MAN 2 Bantul with a probability of 50% is in the minimum range of abilities -2.5 to 2.8.

IRT 2 PL Analisis Analysis
The two-parameter model calculates item difficulty level (bi) and item discriminatory power (ai). The ability of an item to distinguish high-ability test respondents from low-ability test respondents is referred to as item distinguishing power (Istiyono 2020) . The following are the results of the 2 OT analysis for 25 items of the Al-Quran Hadith subject at MAN 2 Bantul Dffclt describes the level of difficulty of the item, while Dscrmn illustrates the distinguishing power, based on the above data analysis. The estimation results are then classified into the following categories based on the following ranges:. Based on the category table above, most of the questions are in the good category. T There are 23 items with good criteria. These items have different power values in the range of 0.00 -0.5. The other 2 items each fall into the category of quite good and not good. The item that goes quite well is question number 11. While the items that fall into the bad category are item number 2. For item number 11 which is included in the good category, the form of the question can be improved so that it can better distinguish students' abilities. As for item number 2 which is in the bad category, it can be deleted and replaced with better items.

Image 2. ICC 2PL Graphics
The grain characteristic curve is shown in Figure 2. The different tilt or curvature of the 2 PL models indicates that the discriminating power for each item is different. The item curve will be better able to distinguish high-ability test takers from low-ability test takers if it is more vertical (large gradient/slope curve) (Istiyono, 2020).

3PL IRT Analysis
This three-parameter logistic model uses three parameters to indicate item characteristics, namely item difficulty level (bi), item discriminatory power (ai) and pseudo guessing (ci). The pseudo guessing parameter is the factor parameter that happens to be answered correctly which in this model is not equal to zero. Following are the results of 3 PL analysis for 25 items of Al-Quran Hadith Subject at MAN 2 Bantul. 0.00011 -0.87952 0.533665 Based on the data analysis above, Dffclt describe the level of difficulty of the test items, Dscrmn describe the power of difference and Gussng describe a pseudo guess. The estimation results are then categorized based on the range as follows. On a test item, the value of ci or this pseudo guess ranges between 0 and 1. An item is said to be good if the value of ci is not more than 1/k, with k number of choices (Hullin, Drasgow, and Parsons 1983) . The number of choices for each test item on the general field mathematics test is 5 (ABCDE). Therefore, this item is said to be good if the value of c i not more than 0.20.
Based on the results of the analysis in the category table above, it can be seen that from the 25 items, there are 10 items in the bad category and 15 in the good category. This indicates that for questions that are categorized as not good in the aspect of all guesses, even lowability test takers have the opportunity to answer correctly even though the questions are quite difficult. (Istiyono, 2020) .

Image 3. ICC 3PL Graphics
Based on the grain characteristic curve graph for 3 PL analysis. the ci parameter gives the probability of a non-zero bottom asymptote on the characteristic curve. The picture above shows that the probability of answering correctly does not start from zero.

Model Fit Test
This model fit test aims to determine which analysis model best fits the data. If a data fits into a model, then the item will behave consistently as expected by the model. There are several ways to test the fit of the model, one of which is based on the AIC value ( Akaike Information Criterion ). This AIC value calculates the balance between the magnitude of likelihood with the number of variables in the model. The most suitable model is the one with the lowest AIC (Snipes and Taylor 2014) .
The purpose of this model fit test is to find out which analysis model best fits the data. If a data fits the model, the item will behave consistently as the model predicts. There are several methods to test the fit of the model, one of which is to use the AIC value ( Akaike Information Criterion ). The balance between the likelihood and the number of variables in the model is calculated using this AIC value. The model with the lowest AIC value is the most suitable (Snipes and Taylor 2014) .

Image 4. Model Fit Test
Based on the results of the above analysis, the smallest AIC value is found in the 2 PL model, which is 4874.85. So it can be concluded that the items fit the 2 PL model.

Student Ability Estimation Results
Below is a graph of the estimated ability of students to take the test. From the graph it can be seen that there are no students who have abilities below -4.00 on the test. While students who have abilities above 6.00 are 2 people from 188 students.
More Furthermore, it can be analyzed that there are 15 students with high abilities above 2.00. There are 165 students who fall into the medium category (ability from -2 to 2) and 9 students who have ability below -2 (low ability). Students with high abilities for example students number 45, 49, 55 with their respective abilities 2.498397, 2.342558 and 3.765938. Meanwhile, students with moderate abilities, for example students number 33, 64 and 84 with their respective abilities 1.267403, 0.227422, 0.764376. While students who have low abilities, for example students number 4, 23, and 56 with their respective abilities are -2.42629, -2.25918, -2.14486

Image 5. Graph of Student Ability Distribution
If we look at the boxplot chart below, it can be seen that there are 3 outliers below and 6 outliers above. It can be understood that there are 3 students who are under the normal graph distribution. So the three students should get more attention. While the meaning of the 6 outliers above can be interpreted that there are 6 students who are smarter than the normal distribution.

Information Function
The following is a plot of the test information function based on these 25 items. Because the results of the model fit test show that the instrument fits the 2PL, the information function in the 2PL model is used. The plot in general can provide information that the value of the information function is a maximum of 1.6 and the test can measure well, especially on abilities below 0, approximately -1.2. That is, the test produces optimal information when given to individuals with low abilities. Then the test is overall appropriate for students with abilities around -2.5 to 1.2.

Conclusion
From the results of the analysis that has been in conducted on 25 questions of Al-Quran Hadith subject at MAN 2 Bantul, it can be concluded: the results of the model fit test show that the instrument fit on the 2 PL model (Logistics parameter) with the lowest AIC value, namely 4874.85. The results of the parameter analysis of the level of difficulty indicate that there are 4 questions that are categorized as very easy, 4 easy questions, 15 moderate questions, 1 difficult question, and 1 very difficult question. This shows that the distribution of the difficulty level parameters is quite balanced. The results of the analysis of the different power parameters show that there are 23 good questions, 1 fairly good question, and 1 bad question. This shows that the different power level parameters are quite good. The results of the estimation of students' abilities with the MLE estimator showed that there were no students who had abilities below -4. There are 15 students with abilities above 2.00. There are 165 students who fall into the good category (ability -2 to 2 and 9 students who have abilities below -2. Based on the plot of the information function, it can be concluded that the optimal test if given to individuals with low abilities is around -1.2. Accurate questions to measure students' abilities with a range of -2.5 to 1.2.