Featured image of post How to Scrape Melmod Website

How to Scrape Melmod Website

Learn to extract valuable data from Melmod effortlessly with step-by-step guidance. Read now for a web scraping journey!

Introduction:

As I mentioned ealier, I want to crawl some valuable data from Mods For Melon Playground. This guide provides a step-by-step walkthrough. Let’s get stared!

Setting Up the Project

Begin by creating a new Kotlin project and adding Selenium WebDriver as a dependency in your build file. If you missed the initial setup, check out the details in the previous article.

Scraping Data from Melmod

Get List of All Article Links

Our first objective is to crawl a list of article links from the Melmod website. To achieve this, create a new class, MelModGetLinks.kt, within the test module.

Structure Project

Inside this class, initialize the WebDriver and CSVWriter. The detailed code for this can be found in the provided snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class MelModGetLinks {
    private lateinit var driver: WebDriver
    private lateinit var csvWriter: CSVWriter

    @BeforeEach
    fun setup() {
        // Create ChromeOptions
        val chromeOptions = ChromeOptions()

        // Disable images loading
        val prefs: MutableMap<String, Any> = HashMap()
        prefs["profile.managed_default_content_settings.images"] = 2
        chromeOptions.setExperimentalOption("prefs", prefs)

        // It doesn't render the UI if running the browser in headless mode
        chromeOptions.addArguments("--headless")

        // Initialize the WebDriver
        driver = ChromeDriver(chromeOptions)

        // Navigate to the website
        driver.get("https://melmod.com/mods/")

        // The output CSV file
        csvWriter = CSVWriter(FileWriter("src/test/resources/melmod-link.csv"))
    }

    @AfterEach
    fun tearDown() {
        driver.quit()
        csvWriter.close()
    }
}

We have everything to do. Get back to the MelMod website. As we can see, every article has post inside class attribute:

https://melmod.com/mods/

So we can get all article names and detail links by get all elements have post in className. We can achieve it through the following code snippet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
@Test
fun `Get list links from pages - success`() {
    // Add header to csv file
    addHeaderForCSV()

    // Get link from pages
    val firstPage = 1
    val lastPage = 2
    val pages = (firstPage..lastPage).toList()

    for (page in pages) {
        // Navigate to the website in specific page
        driver.get("https://melmod.com/mods/page/$page/")

        // Get all articles available with class `post`
        val articles = driver.findElements(By.className("post"))
        Assertions.assertEquals(articles.size, 10)

        // Get link of each article
        articles.forEachIndexed { index, article ->
            val h2 = article.findElement(By.className("entry-title"))
            val a = h2.findElement(By.tagName("a"))
            val link = a.getAttribute("href")
            insertToCSV(10*(page-1)+(index+1), link, h2.text)
        }
    }
}

private fun addHeaderForCSV() {
    val header = arrayOf("Index", "Link", "Name")
    csvWriter.writeNext(header)
}

private fun insertToCSV(index: Int, link: String, name: String) {
    val row = arrayOf(index.toString(), link, name)
    csvWriter.writeNext(row)
    println("CSV: $index, $link, $name")
}

This code navigates to the Melmod website and extracts all article name and detail links using a className. Result will appear in src/test/resources:

melmod-link.csv

Extract Valuable Data from Each Article

Once we have the article links stored in src/test/resources/melmod-link.csv, the next step is to delve into each article and the desired data—image, name, and mod file link:

  1. Image: by finding the div element has featured-image in class attribute.
  2. Name: by finding the h1 tag has entry-title in class attribute. But I have a name in the first step. So I don’t need to get the name again in this step
  3. Mod File Link: by finding the button has wp-block-button in class attribute.

For more information, you can see the image below:

The full source code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
class MelModGetFiles {
    private lateinit var driver: WebDriver
    private lateinit var csvReader: CSVReader
    private lateinit var csvWriter: CSVWriter

    @BeforeEach
    fun setup() {
        // Create ChromeOptions
        val chromeOptions = ChromeOptions()

        // Disable images loading
        val prefs: MutableMap<String, Any> = HashMap()
        prefs["profile.managed_default_content_settings.images"] = 2
        chromeOptions.setExperimentalOption("prefs", prefs)

        // It doesn't render the UI if running the browser in headless mode
        chromeOptions.addArguments("--headless")

        // Initialize the WebDriver
        driver = ChromeDriver(chromeOptions)

        // Initialize the CSVReader
        csvReader = CSVReader(FileReader("src/test/resources/melmod-link.csv"))

        // Initialize the CSVWriter
        csvWriter = CSVWriter(FileWriter("src/test/resources/melmod-fileMods.csv"))
    }

    @AfterEach
    fun tearDown() {
        driver.quit()
        csvReader.close()
        csvWriter.close()
    }

    @Test
    fun `Get all file links from melmod-link csv - success`() {
        // Write header for output
        addHeader()

        // Read header of input file. Don't need to care the header
        csvReader.readNext()

        // Start reading the input data
        var nextRecord: Array<String>?
        while (csvReader.readNext().also { nextRecord = it } != null) {
            // Process data for each row
            val index = nextRecord!![0]
            val link = nextRecord!![1]
            val name = nextRecord!![2]

            findFileLinkAndAddToCSV(index, link, name)
        }
    }

    private fun addHeader() {
        val header = arrayOf("Index", "Name", "Image", "File")
        csvWriter.writeNext(header)
    }

    private fun findFileLinkAndAddToCSV(
        index: String,
        link: String,
        name: String
    ) {
        // Navigate to mod detail
        driver.get(link)

        // Get the mod image
        val imageDiv = driver.findElement(By.className("featured-image"))
        val imageTag = imageDiv.findElement(By.tagName("img"))
        val imageLink = imageTag.getAttribute("src")

        // Get the mod file link
        val downloadButton = driver.findElement(By.className("wp-block-button"))
        assertEquals(downloadButton.text, "download")

        val a = downloadButton.findElement(By.tagName("a"))
        val fileLink = a.getAttribute("href")

        insertToCSV(index, name, imageLink, fileLink)
    }

    private fun insertToCSV(
        index: String,
        name: String,
        image: String,
        file: String
    ) {
        val row = arrayOf(index, name, image, file)
        csvWriter.writeNext(row)
        println("CSV: $index, $name, $image, $file")
    }
}

This code iterates through each article link, navigates to the corresponding page, and extracts the image and file link.

Result

After completing these 2 steps, you’ll have successfully scraped all the data you need. The resulting information will be stored in src/test/resources/melmod-fileMods.csv, as illustrated in the provided image:

Scraping Result

Drawbacks

While web scraping is a powerful tool, it comes with certain drawbacks that should be considered:

  • Performance: Currently, the extraction of data from each article takes approximately 2 minutes, which might be deemed sluggish. I will find a way to improve it later.

Conclusion

Armed with the ability to fetch article links and extract valuable data, you are now equipped to scrape essential information from Mods For Melon Playground.

Remember to check the website’s terms of service and policies before scraping to ensure compliance. Feel free to customize the code according to your specific scraping needs. Happy coding!

comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy