Introduction

Domestication of plants is a process based on many natural and non-natural factors. The ‘domestication syndrome’ is a concept which explains the key traits selected in major crops by humans. For instance, reduced grain shattering (i.e. retention of seeds on the parent plant), synchronized flowering and grain maturation, increased grain size and number, compact plant architecture, reduction of grain dormancy, and increased apical dominance are all part of the domestication syndrome (Harlan 1992). In sorghum, domestication was initiated based on allelic changes in two loci in response to the selection pressures imposed by harvesting techniques: in the change from a shattering and open panicled phenotype to non-shattering and compact panicled phenotype (Mann et al. 1983; House 1985). This was likely followed by selection for phenotypes with traits such as increased grain size and total number of branches within the inflorescence, and a reduction in rachis internode length. As a result, cultivated crop lines carry a higher yield compared to their wild relatives.

Grain size and weight are two of the important traits selected during domestication determined by the rate of cell division, the size of the cells and the duration of the grain filling period which are under both genetic and environmental control (Nicolas et al. 1984, 1985). Those are also key determinants of yield (Lee et al. 2002; Tao et al. 2017) as well as important quality attributes (Lee et al. 2002). Grain size and weight are complex quantitative traits controlled by multiple genes. Many important QTLs associated with grain size have been identified in Arabidopsis, rice, and maize (Li et al. 2011; Song et al. 2007; Wang et al. 2015a). For instance, Grain Size 3 (GS3) (Takano-Kai et al. 2009), Grain Size 5 (GS5) (Li et al. 2011), Grain Width 8 (GW8) (Wang et al. 2012), Grain Width and Weight 5 (GW5) (Liu et al. 2017), Grain Width 2 (GW2) (Song et al. 2007) and Grain Length 7 (GL7) (Wang et al. 2015b) regulate grain size by controlling cell division in rice.

In domestication, larger grains were selected over smaller grains because larger grains were easier to sow, harvest and process (Tao et al. 2017), offer increased yield as well as facilitate rapid seedling growth (Manga and Yadav 1995). This selection has led to reduction in genetic diversity in the cultivated accessions of cereal crops (Doebley et al. 2006). Studies have identified selection signatures in grain size related genes in cereals such as GS3 (Botella 2012), and GS5 (Li et al. 2011) in rice. However, limited genomic studies have been undertaken on grain size regulating genes in sorghum (Tao et al. 2017).

Genes related to the grain size in rice have been well documented (Song et al. 2007; Wang et al. 2008, 2015b). The GS3 gene encodes a protein which controls the grain length in rice (Fan et al. 2006). Studies by Takano-Kai et al. (2009) using genomic approaches have shown that a mutation in the GS3 gene was associated with the enhanced grain length in O. sativa by controlling grain elongation. GW2 is a gene in rice which controls grain size by encoding a RING-type protein with a E3 ubiquitin ligase activity which acts in the ubiquitin–proteasome pathway. Larger grains are a result of loss of function of the GW2 gene (Song et al. 2007). The qGW7/GL7 gene encodes a protein homologous to longifolia 1 in Arabidopsis and regulates longitudinal cell elongation. Mutations of GL7 resulted in an increase in grain length in rice (Wang et al. 2015b). GW8 also known as OsSPL16, is associated with grain size by encoding a squamosa promoter-binding protein–like 16. Loss of function of this gene is related to more slender grain varieties such as Basmati (Wang et al. 2008). GW5 is a gene in rice encoding a nuclear protein which controls the grain width and weight of rice which also acts in the ubiquitin–proteasome pathway. A deletion in the GW5 gene is associated with increased grain width in rice (Weng et al. 2008). GS5 in rice encodes a putative serine carboxypeptidase which regulates the grain weight, filling and width and consequently increase grain size (Li et al. 2011). Interestingly, in rice, many of these genes (GS3, GW3 and GW5) negatively regulate grain size as the wild accessions had smaller grains while mutations in these gene alleles of cultivated species resulted in larger grains (Li et al. 2011; Zou et al. 2020).

The genomic resources of crop wild relatives are well documented in rice, wheat, sugarcane, and maize (Stalker 1980; Plucknett and Smith 2014; Brozynska et al. 2016) but less so in sorghum (Cowan et al. 2022; Ananda et al. 2020; Mace et al. 2013). The indigenous Australian sorghums are ecologically widely adaptable (Cowan et al. 2020; Myrans et al. 2021; Myrans et al. 2020). This high diversity is a result of having separate origins from domesticated sorghums, outcrossing of cultivars with highly variable wild races and cross pollination between races (Doggett 1988). The diversity among the wild species of sorghum is higher than the diversity among the cultivated species suggesting that the diversity has been reduced during the domestication process. However, gene flow is suggested to be asymmetric (Mutegi et al. 2012) since the rate of gene flow from crop-to-wild is higher than vice versa although the rare phenomenon of bidirectional gene flow can be observed in sorghum which is not common among other major crops (Mace et al. 2013).

In Australia, sorghum is mainly cultivated for animal feed (Venkateswaran et al. 2019). There are 17 species of sorghum native to Australia across four subgenera Chaetosorghum, Heterosorghum, Parasorghum, and Stiposorghum (Lazarides et al. 1991; Ananda et al. 2020). Most are found in the semiarid tropical regions of northern Australia with only one species (S. leiocladum (Hack.) C.E. Hubb.) extending to cool temperate regions (Myrans et al. 2020). The monotypic subgenus, Chaetosorghum contains the endemic species S. macrospermum E.D. Garber which is an annual that has 40 chromosomes (2n = 40). It is isolated in distribution to the Northern Territory of Australia, and has a small, sessile spikelet with an ovoid to ellipsoid caryopsis and a reduced pedicellate spikelet. Sorghum laxiflorum F.M. Bailey belongs to the subgenus Heterosorghum and is widely distributed throughout Australia, the Philippines, and Papua New Guinea. It is an annual, with 40 chromosomes (2n = 40) with a large, sessile spikelet, obovoid to ellipsoid caryopsis and reduced pedicellate spikelet. The Parasorghum contains the species S. grande Lazarides, S. leiocladum, S. matarankense E.D. Garber & Snyder, S. nitidum Pers., S. purpureosericeum (Hochst. Ex A, Rich.) Schweinf & Asch., S. versicolor Andersson, S. timorense Buse ex de Vriese and S. trichocladum Kuntze that are distributed across Australia, Africa, Asia, and Mexico. They are mainly perennials with varying chromosome numbers (2n = 10, 20, 30 and 40), with a minute sessile, and a developed pedicellate spikelet, with the five species S. grande, S. leiocladum, S. matarankense, S. nitidum and S. timorense native or endemic to Australia. The Stiposorghum subgenus contains ten species, S. amplum Lazarides, S. angustum S.T. Blake, S. brachypodum Lazarides, S. bulbosum Lazarides, S. ecarinatum Lazarides, S. exstans Lazarides, S. interjectum Lazarides, S. intrans F. Muell. ex Benth., S. plumosum P. Beauv and S. stipoideum (Ewart & Jean White) C.A. Gardner & C.E. Hubb. and all are endemic to Australia. These species are mainly perennials with varying chromosome numbers (2n = 10, 20, 30 and 40) with a small sessile and a well-developed pedicellate spikelet (Lazarides et al. 1991).

Variations in the grain morphology between the domesticated and wild sorghum species were studied by Shapter et al. (2008). The typical cultivated sorghum grains are spherical in shape (Tao et al. 2017). The size of the grain is determined by the cell size, cell number and number of starch granules (Nicolas et al. 1984; Yang et al. 2009). No consistent measurements for the grain size characteristics in sorghum are available in the literature with individual grain weight being used as an indicator for grain size. The weight of the grain is determined by the rate and duration of grain filling (Tao et al. 2017; Nicolas et al. 1984). Therefore, understanding the genetic basis of the grain size in sorghum will provide useful genetic information about the domestication of sorghum and use of this trait in crop improvement.

Materials and methods

Plant material and DNA sequencing

A total of 15 accessions from seven species representing the five sorghum subgenera were used in this experiment (Table 1). Plants were grown at the Australian Grains Genebank, Horsham, Vic, Australia (36° 43′ 21.93764″ S and 142° 10′ 29.50331″ E) following the protocol described in Ananda et al. (2021). Total genomic DNA was extracted from pulverized leaf tissue samples of the 15 sorghum accessions using the Cetyltrimethyl ammonium bromide (CTAB) method optimized for sorghum (Furtado, 2014) and DNA samples were sequenced on an Illumina HiSeq 2000 platform at the Ramaciotti Centre, University of New South Wales, Australia. The data yield obtained post trimming was 20X-36X of the genome size (Ananda et al. 2021).

Table 1 Details of the samples selected for sequencing sourced from the Australian Grains Genebank

Statistical analysis

Morphological measurements of grain weight (g), grain width (mm), grain length (mm) and grain thickness (mm) of 10 grains per accession were measured using a ruler under 10X magnification light ring to get to two decimal places. For S. leiocladum, S. matarankense and S. laxiflorum, 10 grains were weighed together as they were too small to register individual weights on the balance. All other species had individual weights for 10 grains measured. Morphological measurements of the grains of 15 accessions were analysed using One-way ANOVA in Minitab (Minitab, LLC, 2021. https://www.minitab.com) at the significance level α ≤ 0.05. Multiple means were compared using the Tukey pairwise comparison test in Minitab.

Variant analysis

A comparative variant analysis of the selected grain size regulating genes (Table 2) was conducted using the basic variant analysis tool in the CLC Genomics Workbench (CLC-GWB 11.0, http://www.clcbio.com). Raw sequencing reads were imported to CLC-GWB together with the annotated nuclear genome sequence of S. bicolor genome from NCBI (accession NC012870.2) as reference (Paterson et al. 2009). Raw reads were subjected to Quality Control (QC) analysis and trimmed to meet a quality score limit of 0.01 (with most calls at Phred score > 30). Variant analysis was undertaken sequentially as follows; trimmed reads were mapped against the reference genome of S. bicolor followed by structural variant analysis using a p-value of 0.0001 as the threshold and finally the reads were subjected local realignment using the Indel track as the guidance variance track.

Table 2 Details of the grain size related genes selected for variant analysis

Variant analysis was conducted using the locally realigned mapping file. Total number of homozygous and heterozygous variants were filtered based on the frequency values equal to 100% and in the range of > 25%–< 75%, respectively. The number of synonymous and nonsynonymous amino acid changes were also determined.

Phylogenetic analysis

Trimmed reads were mapped against the annotated S. bicolor reference genome. The consensus sequences were extracted for each species and converted into coding DNA sequence (CDS) and genome tracks. From the CDS and genome tracks, annotations for the selected grain size related genes (Table 2) were selected for each accession. These genes (exons only) were concatenated to give a final annotated sequence per accession. The concatenated sequences of all the accessions were aligned using the MAFFT alignment tool in Geneious 11.1.5 software (www.geneious.com) with default parameters. A neighbour joining tree was constructed with 1000 bootstrap replicates in Geneious software.

Results

Morphological characteristics of the grains

Figure 1 shows the morphological variation of the sorghum grains representing the five subgenera, Eusorghum (S. bicolor), Chaetosorghum (S. macrospermum), Heterosorghum (S. laxiflorum), Parasorghum (S. matarankense, S. leiocladum, S. purpureosericeum) and Stiposorghum (S. brachypodum). The cultivated species, S. bicolor had distinctly larger grains compared to the grains of the wild sorghum species. The two cultivated accessions used in this study had a spherical shaped grain with a creamy white colour. All the wild sorghum species had smaller and narrower grains with brown to dark brown colour. S. macrospermum had the largest grain followed by S. purpureosericeum and S. brachypodum while S. leiocladum had the smallest grain (Fig. 1)

Fig. 1
figure 1

Morphological characteristics of the sorghum grains showing the wide variation in the size, shape, and colour of the grains.

Statistical analysis

Table 3 shows the morphological characteristics of the grains using grain weight, width, length, and thickness as defining parameters and analysed using One-way ANOVA test. According to the results, all four parameters were significantly different between the accessions (significance level of α ≤ 0.05) (Table 3).

Table 3 Descriptive statistics of the One-way ANOVA test for the grain size related parameters

The cultivated species S. bicolor had significantly higher values compared to the grains of the wild species for each of the characters. However, of the two S. bicolor species, the S. bicolor accession 314,746 had significantly higher grain weight, width, length, and thickness compared to S. bicolor accession 112,151. Among the wild species, all the grain size parameters of the two S. macrospermum accessions, 302,367 and 326,072 were distinct from the majority of the other wild species. Grain weight was highest in S. bicolor 314,746 while it was lowest in the accessions of S. matarankense 326,065 and 326,066. In the pairwise comparison, the species S. purpureosericeum and S. brachypodum were not significantly different in grain weight from S. laxiflorum and S. leiocladum. Similarly, the highest grain width was observed in S. bicolor 314,746 whereas the lowest was observed in S. matarankense. In the pairwise comparison, the species S. purpureosericeum and S. brachypodum were grouped together, while S. laxiflorum, S. matarankense and S. leiocladum were grouped together. Grain length was highest in S. macrospermum 302,367 while S. leiocladum 326,062 had the lowest. Interestingly, grain length was not significantly different for the two species S. bicolor 314,746 and S. brachypodum 326,073 and the same for S. laxiflorum, S. leiocladum, and S. matarankense. Likewise, the highest grain thickness was observed in S. bicolor 314,746, whereas the lowest was observed in S. matarankense 326,066. Moreover, the species S. laxiflorum, S. leiocladum, and S. matarankense had grain thickness values which were not significantly different (Table S1, Fig. 2).

Fig. 2
figure 2

Statistical analysis of morphological characteristics of grains from cultivated and wild sorghum species. a Average grain weight (g), b Average grain length (mm), c Average grain thickness (mm), d Average grain width (mm). Multiple means comparisons based on Tukey pairwise comparison test. Levels not connected by the same letter are significantly different (p < 0.05). Dark blue: Eusorghum, Light blue: Chaetosorghum, Yellow: Heterosorghum, Pink: Parasorghum, and Purple: Stiposorghum

Variant analysis of the coding regions of selected grain size related genes in the different Sorghum species

Based on the reference genome of S. bicolor BTX623, variant analysis within the coding sequence regions of the selected grain size related genes from different sorghum species was carried out using the basic variant analysis tool. The highest number of total variants, including single nucleotide polymorphisms (SNPs), insertions and deletions (Indels) and multi- nucleotide variants (MNVs) was found in S. purpureosericeum 326,075 while the lowest number of total variants was found in S. bicolor 112,151 (Table 4). The total number of SNPs was also highest in S. purpureosericeum 326,075 and lowest in S. bicolor 112,151. In the accessions of S. bicolor 314,746 and 112,151, S. macrospermum 302,367 and 326,072, S. laxiflorum 326,060 and 326,074 and S. leiocladum 326,061, the number of homozygous SNPs was higher than that of the heterozygous SNPs, whereas the opposite was observed for the remaining species (Table 4).

Table 4 Variants identified in the coding regions of grain size related genes in either different species of sorghum identified by comparison to the reference genome of S. bicolor (https://www.ncbi.nlm.nih.gov/nuccore/NC_012870.2)

According to the basic variant analysis of the CDS regions of the selected genes, all the wild species had a similar number of variants per gene. In the sorghum reference genome, some of these selected genes have several transcript variants resulting in different protein sequences. For instance, Sobic.001G335800 has three transcript variants giving rise to three protein products (XP_021307644.1, XP_021307643.1, and XP_002467688.1), Sobic.002G257900 has two transcript variants giving rise to two proteins (XP_002460490.1 and XP_021308005.1), and Sobic.004G107300 has two transcript variants (XP_002453598.2 and XP_021315956.1). As expected, no variants were observed in the CDS regions for any of the selected genes of the two S. bicolor accessions compared to the reference S. bicolor, since they are all the BTx623 genotype (Table 5).

Table 5 Variants in the CDS region of the selected grain size related genes compared to S. bicolor reference genome

Sobic.001G335800 gene (qGW7/GL7)

Within the wild sorghums, the highest total number of variants within the Sobic.001G335800 gene was observed in all three transcript variants in the two S. macrospermum accessions followed by S. matarankense. The lowest number of variants was observed in the two accessions of S. leiocladum. Similar results were observed for the total number of SNPs found in the region. Compared to the number of homozygous SNPs, the number of heterozygous SNPs was higher in the species of S. macrospermum and S. laxiflorum while lower in the rest of the species. In all wild sorghum species, the number of nonsynonymous amino acid changes was higher than the number of synonymous changes (Table 5).

Sobic.001G341700 gene (GS3)

The highest total number of variants within Sobic.001G341700 was found in the two S. macrospermum accessions while the lowest was found in S. matarankense 326,066. Similar results were observed for the total number of SNPs. Compared to the number of homologous SNPs, the number of heterozygous SNPs was higher in the species of S. macrospermum, S. laxiflorum and S. brachypodum 302,670. The number of nonsynonymous amino acid changes was higher than the synonymous amino acid changes only in the species S. macrospermum, S. laxiflorum, S. leiocladum 326,062, S. matarankense 326,065 and, S. brachypodum 302,670 (Table5).

Sobic.002G257900 gene (GW8)

In the CDS regions of the two transcript variants of Sobic.002G257900, the highest total number of variants was observed in the three S. purpureosericeum accessions followed by S. laxiflorum, while the lowest was observed in S. macrospermum. A parallel situation was observed for the total number of SNPs found in the region. Compared to homologous SNPs, the number of heterozygous SNPs was higher in the species of S. macrospermum, S. laxiflorum, S. leiocladum 326,061, and S. purpureosericeum. The number of nonsynonymous amino acid changes was higher than synonymous amino acid changes in S. macrospermum 302,367, S. laxiflorum, S. purpureosericeum 326,068 and 326,075 (Table 5).

Sobic.003G035400 gene (GW5/qSW5)

The highest total number of variants in the CDS region of the two transcript variants of the Sobic.003G035400 gene was present in S. brachypodum 302,670 followed by S. laxiflorum 326,060 whereas S. matarankense had the lowest number. Similar results were observed for the total number of SNPs. Compared to the number of homozygous SNPs, the number of heterozygous SNPs was higher in the species of S. macrospermum, S. laxiflorum, S. leiocladum, and S. brachypodum. In all species, the number of nonsynonymous amino acid changes were higher than the synonymous amino acid changes (Table 5).

Sobic.004G107300 gene (GW2)

For the CDS regions of the two transcript variants of Sobic.004G107300 gene, the accessions S. matarankense 326,066 and S. leiocladum carried the highest number and lowest number of total variants, respectively. A parallel situation applied for the total number of SNPs. Except for the species S. macrospermum and S. laxiflorum, the number of heterozygous SNPs were higher than the homozygous SNPs in all the species. For all the species, a higher number of synonymous compared to nonsynonymous amino acid changes wase observed (Table 5).

Sobic.009G053600 gene (GS5)

The highest total number of variants was observed in S. leiocladum and S. macrospermum while the lowest was observed in S. matarankense 326,065. Similar results were observed for the total number of SNPs found in the region. In comparison to the number of homozygous SNPs, the number of heterozygous SNPs were higher in the species S. macrospermum, S. laxiflorum, S. leiocladum and S. purpureosericeum. The number of nonsynonymous amino acid changes was lower than the synonymous amino acid changes in all species (Table 5).

The variant analysis of the selected grain size related genes in some accessions identified SNP variants which resulted in premature stop codons. In both accessions of S. macrospermum and S. laxiflorum, SNP variants resulting in premature stop codons were observed in all the three transcript variants of Sobic.001G335800 (qGW7/GL7) gene, as the result of changes of a glycine at the positions 958, 1057 and 1057 into a stop codon. In the two accessions of S. macrospermum, an additional change was observed in the Sobic.001G341700 (GS3) gene resulting in the change of a glycine at position 241 into a stop codon. In Sobic.009G053600 (GS5), stop codons were observed in the species; S. macrospermum 326,072 (glycine28 > *), S. laxiflorum 326,060 (glycine28 > *) and 326,074 (glycine28 > *), S. leiocladum 326,061 (tyrosine483 > *) and 326,062 (glycine637 > *, tyrosine1236 > *), S. purpureosericeum 326,068 (tyrosine483 > *), 326,071 (tyrosine483 > *) and 326,075 (tyrosine483 > *). A unique amino acid change was observed in Sobic.001G341700 (GS3) in the S. brachypodum 302,670 accession, changing an arginine at the position 262 into a stop codon. A stop codon was observed in the Sobic.004G107300 (GW2) gene in all the three accessions of S. purpureosericeum at the position 466 (Table 6).

Table 6 SNP variants leading to premature stop codons in selected grain size related genes in wild sorghum species

The consensus sequences of the coding regions of the selected grain size related genes were concatenated and then aligned to the reference S. bicolor genome derived sequences to construct a neighbour-joining tree. The topology of the tree was supported by high bootstrap values for all clades. In the neighbour-joining tree, two distinct clades were observed with Eusorghum, Chaetosorghum and Heterosorghum in one clade while Stiposorghum and Parasorghum clustered in a separate clade. All the accessions within subgenera were clustered together in the same clade (Fig. 3) which resembled the phylogenetic tree in Ananda et al. (2021).

Fig. 3
figure 3

Neighbour-joining tree constructed (1000 bootstrap replicates) of the CDS sequences of selected grain size related genes (Sobic.001G335800 (qGW7/GL7), Sobic.001G341700 (GS3), Sobic.002G257900 (GW8), Sobic.003G035400 (GW5/qSW5), Sobic.004G107300 (GW2), and Sobic.009G053600 (GS5)) derived from 15 wild sorghum accessions covering the five subgenera in the Sorghum genus and with O. sativa as the out group. Each subgenus is shown in a different colour. Marked on each node is the Bootstrap value (/100)

Discussion

Grain size and weight are two of the key yield components in cereals. Grain size in sorghum varies across the genus (Dillon et al. 2007) but information on genes controlling grain size is scarce. In our current study, we demonstrate a clear difference in the shape, colour, size, and weight between cultivated and wild sorghum species from across the genus. Significant differences in grain size were also detected between the two different S. bicolor (Eusorghum) lines. It could be that these two accessions were derived from two different parent accessions (Ananda et al. 2021) or due to environmental effects when the seed lines were grown. The two monotypic subgenera Chaetosorghum and Heterosorghum are closely related (Ananda et al. 2021) (Fig. 3). Nevertheless, the size of the grains was significantly different.

Interestingly, significant differences were observed even within subgenera, with the grains of S. matarankense, S. purpureosericeum, and S. leiocladum, which belong to Parasorghum, having significantly different grain size parameters.

In this study, we identified the presence of variants in the CDS regions of a number of grain size related genes in wild sorghum species. The number of variants was lowest in the two S. bicolor accessions as the sequence comparisons were made using the sequence of S. bicolor (genotype BTx623) as reference. The wild sorghum species are distant to S. bicolor, and the mapping percentages therefore differed significantly and were low for some of the species. S. macrospermum and S. laxiflorum are closely related to S. bicolor and thus contained a higher percentage of trimmed reads mapping to the reference genome resulting in identification of a higher number of variants. This makes direct comparisons between species difficult. This imbalance can be resolved by using the same species as the reference sequence when the whole genome sequences of these wild species become available.

Within the same species, no significant differences were observed with the number of SNPs within a certain gene suggesting the accessions are indeed closely related. Furthermore, except for the species S. bicolor, S. macrospermum, and S. laxiflorum and S. leiocladum 326,061, all the other accessions had a higher number of homozygous SNPs than heterozygous SNPs. Although sorghum is considered as a self-pollinated crop, cross-pollination ranging from 5 to 15% has been reported (Poehlman 2013). Therefore, S. bicolor, S. macrospermum, and S. laxiflorum might have a higher cross-pollination rate than self-pollination.

In this study, six key grain size related genes were analysed for variants using the S. bicolor annotated genome as a reference. Some of the genes were annotated with more than one transcript variant as an indicator of alternative splicing and different protein products. In all the sorghum species, the same pattern of number of variants were observed in all the transcript variants for a particular gene. The highest percentage of variants was observed in the Sobic.001G335800 (qGW7/GL7) gene (9% of the length) which was the longest transcript. Among the other genes, the Sobic.004G107300 (GW2) Sobic.002G257900 (GW8) genes were more conserved within the genus as those had comparatively less percentage of variants (3% of the length).

In sorghum and other cereals, mutation studies have been reported to cause a loss of function of some grain size related genes (Song et al. 2007; Zou et al. 2020; Tao et al. 2017). In our study, for some of the genes, nonsynonymous amino acid changes which code for stop codons were observed in some accessions. In the GS3 gene in S. bicolor, a premature stop codon in the fifth exon was shown to result from a single C to A nucleotide change preventing expression of the gene and resulting in an increase in grain weight (Tao et al. 2020). In our analysis of the same gene, mutations causing G to A nucleotide changes resulted in conversion of codons for the amino acids glycine and arginine into stop codons in S. macrospermum and S. brachypodum, respectively. Our measurements of grain size show that S. macrospermum has the highest grain weight among the wild accessions followed by S. brachypodum. Therefore, the stop codons identified in the GS3 gene causing a loss of function in the Sobic.001G341700 (GS3) gene in these two sorghum species might contribute to the observed increased grain weight. In rice, loss of function mutation of the GW2 gene is known to cause increased grain weight and width and thus larger grains (Song et al. 2007). In this study, a SNP mutation was observed in both transcript variants of the same gene in the S. purpureosericeum accessions causing introduction of premature stop codons. The grain morphology of S. purpureosericeum was characterized by a comparatively higher grain width and weight among the Parasorghum species. This may be due to the loss of function of the Sobic.004G107300 (GW2) gene. Sequence changes which introduce premature stop codons were also observed in Sobic.001G335800 (qGW7/GL7) gene and may therefore affect grain length in S. macrospermum and S. laxiflorum. These two species belong to the subgenera Chaetosorghum and Heterosorghum, respectively, and are phylogenetically closely related (Ananda et al. 2021). Nevertheless, their grain size parameters are drastically different. Therefore, it is difficult to assess the effect of the stop codon in qGW7/GL7 (Sobic.001G335800) in these two.

species.

The GS5 gene controls grain width, weight and filling in rice (Li et al. 2011). In our analyses, stop codons were observed in GS5 (Sobic.009G053600) in S. laxiflorum, S. leiocladum and S. purpureosericeum. The grain weights of S. laxiflorum and S. leiocladum are low but except for one accession (S. purpureosericeum 326,068) not significantly different. However, the grain width of both accessions of S. purpureosericeum were significantly greater than the other two species. This might be due to the reduction or loss of function of the GS5 (Sobic.009G053600) gene. To determine the exact effect of these premature stop codons on the function of these grain size related genes in sorghum, further experiments with more samples are required.

The phylogenetic tree shown here based on the six selected grain size related genes had a similar tree topology to the phylogenetic tree published in the study of Ananda et al. (2021). In that study, we suggested that Sorghum genus could be divided in to two main groups based on chloroplast and nuclear genes phylogeny, with Eusorghum, Chaetosorghum and Heterosorghum in one group and Parasorghum and Stiposorghum in the other. Results presented here supports this view.

Our current study was targeted towards addressing some of those less studied features of the wild sorghums that will be important in efforts to use wild sorghum for re-wilding of the elite sorghum cultivars and overcome the “domestication syndrome” and gain plant vitality. However, the mapping percentages of the species were vastly different because the genome of the cultivated species S. bicolor was the only one able to be used as the reference. Thus, species more closely related to S. bicolor had a higher mapping percentage compared to other species which affects the variant analysis giving rise to higher number of variants. This study can be modified including more accessions covering all species from the genus and multiple populations representing the diversity within the specie. This provides a preliminary guide to identify the key gene targets in the wild sorghum species to improve the grain quality of sorghum. Multiple accessions of several wild sorghum species are currently being sequenced and, in the future, this will provide a broad range of reference sequences for more accurate mapping. Further sequence analysis and experimental data will also allow more accurate determination of ploidy levels and more accurate basic variant analysis between and within species can be done. The information about the diversity of grain size related genes of the wild accessions would be beneficial in future experiments.

Conclusions

The genus Sorghum has a wide variation in the grain size related parameters with the wild sorghum species having higher diversity. The selected six grain size related genes, Sobic.001G335800 (qGW7/GL7), Sobic.001G341700 (GS3), Sobic.002G257900 (GW8), Sobic.003G035400 (GW5/qSW5), Sobic.004G107300 (GW2), and Sobic.009G053600 (GS5) showed polymorphism in the coding sequence regions. Grain size related genes from wild sorghums have a higher degree of polymorphism compared to the cultivated sorghum species. Mutations which cause stop codons in the grain size related genes might led to reduction or loss of function of the genes and may explain the variation in grain sizes observed. These results suggest that analysis of the genomes of wild sorghum species should allow the discovery of useful genes for the control of grain size in sorghum and other grasses.